Unverified Commit 66f7832a authored by Hongkuan Zhou's avatar Hongkuan Zhou Committed by GitHub
Browse files

feat(planner): unify throughput and load scaling on FPM regression (#7961)

parent 0b7a18ce
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# SLA Planner Load Test # SLA Planner Load Test
...@@ -19,52 +15,13 @@ You have two options to obtain the pre-deployment profiling data: ...@@ -19,52 +15,13 @@ You have two options to obtain the pre-deployment profiling data:
### Option A: Use Test Configuration (Quickstart) ### Option A: Use Test Configuration (Quickstart)
Use the pre-configured test deployment with sample profiling data, we provide the results and the deployment configuration for the following models x hardware configurations: Use the pre-configured test deployment with sample profiling data, we provide the results and the deployment configuration for the following models x hardware configurations:
- `nvidia/Llama-3.1-8B-Instruct-FP8` on H200 with max context length 16384, TP1 Prefill, and TP1 Decode. At ISL/OSL 3000/150, it achieves 40k tokens/s/gpu prefill with 80ms TTFT and 10k tokens/s/gpu decode with 10ms ITL. See `../tests/data/profiling_results/H200_TP1P_TP1D/`. - `nvidia/Llama-3.1-8B-Instruct-FP8` on H200 with max context length 16384, TP1 Prefill, and TP1 Decode. At ISL/OSL 3000/150, it achieves 40k tokens/s/gpu prefill with 80ms TTFT and 10k tokens/s/gpu decode with 10ms ITL. See `../tests/data/profiling_results/H200_TP1P_TP1D/`.
### Option B: Use Your Own Profiling Results ### Option B: Use Your Own Profiling Results
1. Run pre-deployment profiling for your specific setup. See the [pre-deployment profiling documentation](../../../../../../docs/components/profiler/profiler-guide.md) for detailed instructions. 1. Run pre-deployment profiling for your specific setup. See the [pre-deployment profiling documentation](../../../../../../docs/components/profiler/profiler-guide.md) for detailed instructions.
## Interpolator Testing
SLA planner uses two interpolators to estimate the performance of prefill and decode. You can test the interpolators with the following command:
```bash
python components/src/dynamo/planner/core/throughput/interpolation.py \
--profile_results_dir <path_to_profile_results> \
--isl <ISL> \
--osl <OSL> \
--ttft <TTFT(ms)> \
--itl <ITL(ms)>
```
The script will perform the interpolation based on ISL, OSL, and TTFT and ITL SLAs and advise the load that can saturate the engine.
For example, to test the interpolator for `nvidia/Llama-3.1-8B-Instruct-FP8` on H200 (target TTFT=200ms, ITL=10ms):
```bash
python components/src/dynamo/planner/core/throughput/interpolation.py \
--profile_results_dir components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/ \
--isl 3000 \
--osl 300 \
--ttft 200 \
--itl 10
# output:
ISL=3000, OSL=300
TTFT=200ms, ITL=10ms
Using profile results from components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/
Interpolating prefill performance ...
Estimated TTFT=60.00ms <= target TTFT=200.00ms. Requests can queue 140.00ms maximally while meeting TTFT SLA.
Estimated throughput: 49481.09 tokens/s/gpu. Request rate at 16.49 requests/s will saturate one GPU.
Interpolating decode performance ...
Average context length: isl + osl/2 = 3150.
Estimated ITL=9.70ms <= target ITL=10.00ms at 16.16% active kv usage.
Estimated throughput: 4555.68 token/s/gpu. Request rate at 15.19 requests/s will saturate one GPU.
```
## Generating Load Dataset ## Generating Load Dataset
We provide a tool to generate load dataset with varying request rate. More details can be found in [sin_load_generator](../../../../../../benchmarks/sin_load_generator/README.md). We provide a tool to generate load dataset with varying request rate. More details can be found in [sin_load_generator](../../../../../../benchmarks/sin_load_generator/README.md).
...@@ -89,36 +46,6 @@ python benchmarks/sin_load_generator/sin_synth.py \ ...@@ -89,36 +46,6 @@ python benchmarks/sin_load_generator/sin_synth.py \
The dataset starts at 5 requests/s, increases to 45 requests/s at t=300s, decreases back to 5 requests/s at t=600s, and repeats. The dataset starts at 5 requests/s, increases to 45 requests/s at t=300s, decreases back to 5 requests/s at t=600s, and repeats.
The total duration is 30 minutes or 1800 seconds. The total duration is 30 minutes or 1800 seconds.
## Planner Dry Run
Before testing SLA planner on real deployments, we provide a dry run feature to test the autoscaling behavior on a given dataset. Specifically, in dry run mode,
- The load predictor will be tested. However, the load metrics will be different from the real deployment because the actual OSL is only known after the requests are processed.
- There will be no SLA predictions. Instead, sla planner will show the safe throughput limit that will ensure the requests can be processed within the SLA.
- The correction factor will be disabled because there is no SLA metrics as reference.
To dry run SLA planner,
```bash
python components/src/dynamo/planner/tests/manual/unit/planner_sla_dryrun.py \
--config '{"environment":"kubernetes","backend":"vllm","ttft":200,"itl":10,"profile_results_dir":"components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D","throughput_adjustment_interval":60,"no_correction":true}' \
--dataset rr-5-45_i3000o300.jsonl \
--start-num-p 1 \
--start-num-d 1 \
--output-plot dryrun_plot.png
```
Below is the dryrun result:
![Dryrun Plot](./figures/dryrun_plot.png)
The first plot shows the actual request rate and the predicted request rate (in the unit of requests/adjustment_interval).
The second plot shows the actual ISL/OSL and the predicted ISL/OSL. The first two plots are useful when tuning the performance of the load predictor.
The third plot shows the actual prefill throughput, number of prefill workers that planner scales, and the safe throughput limit with the number of prefill workers. If the actual throughput is below the safe throughput limit, the deployment has the capacity to adhere the TTFT SLA. Note that in the real deployment, due to other factors such as queueing, load balancing, KV cache transfer latency, and ISL variance, it is not guaranteed that the actual deployment can adhere the TTFT SLA.
The fourth plot, similar to the third plot, shows the actual decode throughput, number of decode workers that planner scales, and the safe throughput limit with the number of decode workers. If the actual throughput is below the safe throughput limit, the deployment has the capacity to adhere the ITL SLA. Note that in the real deployment, due to other factors such as load balancing and OSL variance, it is not guaranteed that the actual deployment can adhere the ITL SLA.
## Scaling Tests ## Scaling Tests
This directory contains comprehensive tests for validating the SLA planner's scaling behavior. The tests validate both the replica calculation logic and end-to-end scaling behavior. The scaling test uses a graduated load approach rather than dataset files, as it proved more reliable for metric generation and scaling triggers. This directory contains comprehensive tests for validating the SLA planner's scaling behavior. The tests validate both the replica calculation logic and end-to-end scaling behavior. The scaling test uses a graduated load approach rather than dataset files, as it proved more reliable for metric generation and scaling triggers.
...@@ -132,6 +59,7 @@ This directory contains comprehensive tests for validating the SLA planner's sca ...@@ -132,6 +59,7 @@ This directory contains comprehensive tests for validating the SLA planner's sca
### Quick Start for Unit Tests and End-to-End Tests ### Quick Start for Unit Tests and End-to-End Tests
#### Run Unit Tests Only #### Run Unit Tests Only
Test the replica calculation logic without requiring Kubernetes: Test the replica calculation logic without requiring Kubernetes:
```bash ```bash
...@@ -175,6 +103,7 @@ components/src/dynamo/planner/tests/manual/scaling/run_scaling_test.sh --namespa ...@@ -175,6 +103,7 @@ components/src/dynamo/planner/tests/manual/scaling/run_scaling_test.sh --namespa
### Instructions for End-to-End Perf Tests ### Instructions for End-to-End Perf Tests
In this test, we compare performance (goodput and goodput/GPU) on deployments on the following four deployments using the aforementioned 8b FP8 model on H200 and the dataset used in dryrun: In this test, we compare performance (goodput and goodput/GPU) on deployments on the following four deployments using the aforementioned 8b FP8 model on H200 and the dataset used in dryrun:
- Config 1 with inefficient P/D ratio: 3xTP1P_1xTP1D_4GPU - Config 1 with inefficient P/D ratio: 3xTP1P_1xTP1D_4GPU
`./perf_test_configs/disagg_8b_3p1d.yaml` `./perf_test_configs/disagg_8b_3p1d.yaml`
- Config 2 with best static deployment: 2xTP1P_2xTP1D_4GPU - Config 2 with best static deployment: 2xTP1P_2xTP1D_4GPU
...@@ -214,12 +143,13 @@ aiperf profile \ ...@@ -214,12 +143,13 @@ aiperf profile \
#### E2E Perf Test Results #### E2E Perf Test Results
![Results](./figures/sla_planner_perf.png) Results
The table below shows the performance improvement of SLA planner across different deployment configurations: The table below shows the performance improvement of SLA planner across different deployment configurations:
| Baseline | Goodput Improvement | Goodput/GPU Improvement |
|---------------|-----------------|-------------------------| | Baseline | Goodput Improvement | Goodput/GPU Improvement |
| Inefficient P/D ratio | 725% | 600% | | ----------------------------------- | ------------------- | ----------------------- |
| Inefficient parallelization mapping | 311% | 249% | | Inefficient P/D ratio | 725% | 600% |
| Best static deployment | 52% | 29% | | Inefficient parallelization mapping | 311% | 249% |
| Best static deployment | 52% | 29% |
...@@ -82,7 +82,7 @@ spec: ...@@ -82,7 +82,7 @@ spec:
- dynamo.planner - dynamo.planner
args: args:
- --config - --config
- '{"environment": "kubernetes", "backend": "vllm", "ttft": 200, "itl": 10, "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/", "throughput_adjustment_interval": 60, "metric_reporting_prometheus_port": 9085, "no_correction": true}' - '{"environment": "kubernetes", "backend": "vllm", "ttft": 200, "itl": 10, "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/", "throughput_adjustment_interval": 60, "metric_reporting_prometheus_port": 9085}'
VllmDecodeWorker: VllmDecodeWorker:
envFromSecret: hf-token-secret envFromSecret: hf-token-secret
componentType: worker componentType: worker
......
...@@ -25,7 +25,7 @@ spec: ...@@ -25,7 +25,7 @@ spec:
- dynamo.planner - dynamo.planner
args: args:
- --config - --config
- '{"environment": "kubernetes", "backend": "vllm", "throughput_adjustment_interval": 60, "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D", "no_correction": true}' - '{"environment": "kubernetes", "backend": "vllm", "throughput_adjustment_interval": 60, "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D"}'
VllmDecodeWorker: VllmDecodeWorker:
envFromSecret: hf-token-secret envFromSecret: hf-token-secret
componentType: worker componentType: worker
......
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import argparse
import logging
from dynamo.planner.config.planner_config import PlannerConfig
from dynamo.planner.offline.dryrun import run_sla_planner_dryrun
logger = logging.getLogger(__name__)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Planner Dryrun")
parser.add_argument(
"--config",
required=True,
help="JSON string or path to a JSON/YAML config file",
)
parser.add_argument(
"--dataset", type=str, required=True, help="Path to the jsonl dataset file"
)
parser.add_argument(
"--start-num-p",
type=int,
default=1,
help="Number of prefill workers to start with",
)
parser.add_argument(
"--start-num-d",
type=int,
default=1,
help="Number of decode workers to start with",
)
parser.add_argument(
"--output-plot",
type=str,
default="dryrun_plot.png",
help="Path to the output plot file",
)
args = parser.parse_args()
config = PlannerConfig.from_config_arg(args.config)
run_sla_planner_dryrun(
config,
dataset=args.dataset,
start_num_p=args.start_num_p,
start_num_d=args.start_num_d,
output_plot=args.output_plot,
)
...@@ -19,7 +19,7 @@ from dynamo.common.forward_pass_metrics import ( ...@@ -19,7 +19,7 @@ from dynamo.common.forward_pass_metrics import (
) )
from dynamo.planner.config.planner_config import PlannerConfig from dynamo.planner.config.planner_config import PlannerConfig
from dynamo.planner.core.decode import DecodePlanner from dynamo.planner.core.decode import DecodePlanner
from dynamo.planner.core.load.fpm_regression import ( from dynamo.planner.core.perf_model import (
AggRegressionModel, AggRegressionModel,
DecodeRegressionModel, DecodeRegressionModel,
PrefillRegressionModel, PrefillRegressionModel,
...@@ -70,19 +70,24 @@ def _make_fpm( ...@@ -70,19 +70,24 @@ def _make_fpm(
class TestPrefillRegressionModel: class TestPrefillRegressionModel:
def test_insufficient_data(self): def test_insufficient_data(self):
model = PrefillRegressionModel(window_size=50, min_observations=5) model = PrefillRegressionModel(
max_num_fpm_samples=50, min_observations=5, bucket_count=16
)
assert not model.has_sufficient_data() assert not model.has_sufficient_data()
assert model.estimate_next_ttft(0, 2048) is None assert model.estimate_next_ttft(0, 2048) is None
def test_heartbeat_skipped(self): def test_heartbeat_skipped(self):
model = PrefillRegressionModel(window_size=50, min_observations=3) model = PrefillRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
fpm = _make_fpm(wall_time=0.0, sum_prefill_tokens=100, num_prefill_requests=1) fpm = _make_fpm(wall_time=0.0, sum_prefill_tokens=100, num_prefill_requests=1)
model.add_observation(fpm) model.add_observation(fpm)
assert model.num_observations == 0 assert model.num_observations == 0
def test_basic_regression_and_ttft_estimate(self): def test_basic_regression_and_ttft_estimate(self):
model = PrefillRegressionModel(window_size=50, min_observations=3) model = PrefillRegressionModel(
# wall_time = 0.001 * prefill_tokens + 0.002 (linear relationship) max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
for tokens in [500, 1000, 1500, 2000, 2500]: for tokens in [500, 1000, 1500, 2000, 2500]:
fpm = _make_fpm( fpm = _make_fpm(
sum_prefill_tokens=tokens, sum_prefill_tokens=tokens,
...@@ -93,9 +98,6 @@ class TestPrefillRegressionModel: ...@@ -93,9 +98,6 @@ class TestPrefillRegressionModel:
assert model.has_sufficient_data() assert model.has_sufficient_data()
# Single iteration: queued=0, avg_isl should be mean of [500..2500]=1500
# total_tokens = 0 + avg_isl ≈ 1500
# 1 iteration at max_num_batched_tokens=2048 (1500 < 2048)
est = model.estimate_next_ttft( est = model.estimate_next_ttft(
queued_prefill_tokens=0, max_num_batched_tokens=2048 queued_prefill_tokens=0, max_num_batched_tokens=2048
) )
...@@ -103,8 +105,9 @@ class TestPrefillRegressionModel: ...@@ -103,8 +105,9 @@ class TestPrefillRegressionModel:
assert est > 0 assert est > 0
def test_chunked_ttft_simulation(self): def test_chunked_ttft_simulation(self):
model = PrefillRegressionModel(window_size=50, min_observations=3) model = PrefillRegressionModel(
# Simple: wall_time = 0.001 * prefill_tokens (slope=0.001, intercept≈0) max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
for tokens in [100, 200, 300, 400, 500]: for tokens in [100, 200, 300, 400, 500]:
fpm = _make_fpm( fpm = _make_fpm(
sum_prefill_tokens=tokens, sum_prefill_tokens=tokens,
...@@ -113,11 +116,6 @@ class TestPrefillRegressionModel: ...@@ -113,11 +116,6 @@ class TestPrefillRegressionModel:
) )
model.add_observation(fpm) model.add_observation(fpm)
# avg_isl = mean([100,200,300,400,500]) = 300
# total_tokens = 5000 (queued) + 300 (next ISL) = 5300
# max_num_batched_tokens = 2048
# iterations: ceil(5300/2048) = 3
# chunk1=2048, chunk2=2048, chunk3=1204
est = model.estimate_next_ttft( est = model.estimate_next_ttft(
queued_prefill_tokens=5000, max_num_batched_tokens=2048 queued_prefill_tokens=5000, max_num_batched_tokens=2048
) )
...@@ -125,7 +123,9 @@ class TestPrefillRegressionModel: ...@@ -125,7 +123,9 @@ class TestPrefillRegressionModel:
assert est > 0.003 # at least 3 iterations worth assert est > 0.003 # at least 3 iterations worth
def test_avg_isl_tracking(self): def test_avg_isl_tracking(self):
model = PrefillRegressionModel(window_size=50, min_observations=3) model = PrefillRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
for isl in [1000, 2000, 3000]: for isl in [1000, 2000, 3000]:
fpm = _make_fpm( fpm = _make_fpm(
sum_prefill_tokens=isl, num_prefill_requests=1, wall_time=0.01 sum_prefill_tokens=isl, num_prefill_requests=1, wall_time=0.01
...@@ -133,39 +133,219 @@ class TestPrefillRegressionModel: ...@@ -133,39 +133,219 @@ class TestPrefillRegressionModel:
model.add_observation(fpm) model.add_observation(fpm)
assert abs(model.avg_isl - 2000.0) < 1.0 assert abs(model.avg_isl - 2000.0) < 1.0
def test_sliding_window_eviction(self): def test_find_best_engine_prefill_rps(self):
model = PrefillRegressionModel(window_size=5, min_observations=3) model = PrefillRegressionModel(
for i in range(10): max_num_fpm_samples=50, min_observations=3, bucket_count=16
fpm = _make_fpm(sum_prefill_tokens=100 * (i + 1), wall_time=0.01) )
for tokens in [500, 1000, 1500, 2000, 2500]:
fpm = _make_fpm(
sum_prefill_tokens=tokens,
num_prefill_requests=1,
wall_time=0.001 * tokens + 0.002,
)
model.add_observation(fpm) model.add_observation(fpm)
rps, actual_ttft_ms = model.find_best_engine_prefill_rps(
ttft_sla=2000.0, isl=1000.0
)
assert rps > 0
# wall_time ~1.002s for 1000 tokens -> rps ~ 1/1.002 ~ 0.998
assert 0.5 < rps < 2.0
assert actual_ttft_ms > 0
assert 1000 < actual_ttft_ms < 2000
def test_find_best_engine_prefill_rps_zero_isl(self):
model = PrefillRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
for tokens in [500, 1000, 1500]:
fpm = _make_fpm(
sum_prefill_tokens=tokens,
num_prefill_requests=1,
wall_time=0.001 * tokens,
)
model.add_observation(fpm)
rps, _ = model.find_best_engine_prefill_rps(ttft_sla=1000.0, isl=0.0)
assert rps == 0.0
def test_load_benchmark_fpms(self):
model = PrefillRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
fpms = [
_make_fpm(sum_prefill_tokens=t, num_prefill_requests=1, wall_time=0.001 * t)
for t in [500, 1000, 1500, 2000, 2500]
]
model.load_benchmark_fpms(fpms)
assert model.num_observations == 5 assert model.num_observations == 5
assert model.has_sufficient_data()
est = model.estimate_next_ttft(
queued_prefill_tokens=0, max_num_batched_tokens=2048
)
assert est is not None
# ── Bucketed retirement tests ─────────────────────────────────────────
class TestBucketedRetirement:
def test_total_capped_at_max(self):
"""Total observations never exceed max_num_fpm_samples."""
model = PrefillRegressionModel(
max_num_fpm_samples=10, min_observations=3, bucket_count=4
)
for i in range(20):
fpm = _make_fpm(
sum_prefill_tokens=100 * (i + 1),
num_prefill_requests=1,
wall_time=0.01 * (i + 1),
)
model.add_observation(fpm)
assert model.num_observations == 10
def test_most_populated_bucket_loses_oldest(self):
"""When evicting, the oldest entry from the most-populated bucket is removed."""
model = PrefillRegressionModel(
max_num_fpm_samples=6, min_observations=1, bucket_count=4
)
# 3 observations at low tokens (bucket 0 area)
for i in range(3):
fpm = _make_fpm(
sum_prefill_tokens=10 + i,
num_prefill_requests=1,
wall_time=0.001 * (10 + i),
)
model.add_observation(fpm)
# 3 observations at high tokens (different bucket)
for i in range(3):
fpm = _make_fpm(
sum_prefill_tokens=1000 + i * 100,
num_prefill_requests=1,
wall_time=0.001 * (1000 + i * 100),
)
model.add_observation(fpm)
assert model.num_observations == 6
# One more at low tokens; total would exceed 6 so most-populated
# bucket loses its oldest entry.
fpm = _make_fpm(
sum_prefill_tokens=15,
num_prefill_requests=1,
wall_time=0.015,
)
model.add_observation(fpm)
assert model.num_observations == 6
def test_uniform_distribution_preserved(self):
"""Bucketed eviction keeps observations across operating points."""
model = DecodeRegressionModel(
max_num_fpm_samples=10, min_observations=3, bucket_count=16
)
# Many observations at a single operating point
for _ in range(15):
fpm = _make_fpm(
num_decode_requests=32,
sum_decode_kv_tokens=32000,
wall_time=0.01,
)
model.add_observation(fpm)
assert model.num_observations == 10
# Add a different operating point; the concentrated bucket loses one
fpm = _make_fpm(
num_decode_requests=4,
sum_decode_kv_tokens=4000,
wall_time=0.005,
)
model.add_observation(fpm)
assert model.num_observations == 10
def test_2d_bucketed_retirement(self):
"""2D models retire from the most-populated grid cell."""
model = AggRegressionModel(
max_num_fpm_samples=8, min_observations=1, bucket_count=16
)
# Fill with varied data
for p, d in [(100, 500), (200, 1000), (300, 1500), (400, 2000)]:
fpm = _make_fpm(
sum_prefill_tokens=p,
num_prefill_requests=1,
sum_decode_kv_tokens=d,
num_decode_requests=5,
wall_time=0.001 * p + 0.0001 * d,
)
model.add_observation(fpm)
# Concentrate 4 more in one region
for _ in range(4):
fpm = _make_fpm(
sum_prefill_tokens=100,
num_prefill_requests=1,
sum_decode_kv_tokens=500,
num_decode_requests=5,
wall_time=0.15,
)
model.add_observation(fpm)
assert model.num_observations == 8
# Overflow triggers retirement from the concentrated cell
fpm = _make_fpm(
sum_prefill_tokens=350,
num_prefill_requests=1,
sum_decode_kv_tokens=1800,
num_decode_requests=5,
wall_time=0.5,
)
model.add_observation(fpm)
assert model.num_observations == 8
# ── DecodeRegressionModel tests ────────────────────────────────────── # ── DecodeRegressionModel tests ──────────────────────────────────────
class TestDecodeRegressionModel: class TestDecodeRegressionModel:
def _train_2d(self, model: DecodeRegressionModel) -> None:
"""Populate with 2D data: wall_time = f(num_decode_requests, sum_decode_kv_tokens)."""
for n_req, kv in [
(5, 1000),
(10, 2000),
(15, 3000),
(20, 4000),
(25, 5000),
]:
fpm = _make_fpm(
sum_decode_kv_tokens=kv,
num_decode_requests=n_req,
wall_time=0.0001 * kv + 0.0005 * n_req + 0.001,
)
model.add_observation(fpm)
def test_insufficient_data(self): def test_insufficient_data(self):
model = DecodeRegressionModel(window_size=50, min_observations=5) model = DecodeRegressionModel(
max_num_fpm_samples=50, min_observations=5, bucket_count=16
)
assert not model.has_sufficient_data() assert not model.has_sufficient_data()
assert model.estimate_next_itl(0, 0) is None assert model.estimate_next_itl(0, 0) is None
def test_heartbeat_skipped(self): def test_heartbeat_skipped(self):
model = DecodeRegressionModel(window_size=50, min_observations=3) model = DecodeRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
fpm = _make_fpm(wall_time=0.0, sum_decode_kv_tokens=100, num_decode_requests=1) fpm = _make_fpm(wall_time=0.0, sum_decode_kv_tokens=100, num_decode_requests=1)
model.add_observation(fpm) model.add_observation(fpm)
assert model.num_observations == 0 assert model.num_observations == 0
def test_basic_itl_estimate(self): def test_basic_itl_estimate(self):
model = DecodeRegressionModel(window_size=50, min_observations=3) model = DecodeRegressionModel(
# wall_time = 0.0001 * decode_kv + 0.001 max_num_fpm_samples=50, min_observations=3, bucket_count=16
for kv in [1000, 2000, 3000, 4000, 5000]: )
fpm = _make_fpm( self._train_2d(model)
sum_decode_kv_tokens=kv,
num_decode_requests=10,
wall_time=0.0001 * kv + 0.001,
)
model.add_observation(fpm)
assert model.has_sufficient_data() assert model.has_sufficient_data()
est = model.estimate_next_itl(scheduled_decode_kv=3000, queued_decode_kv=0) est = model.estimate_next_itl(scheduled_decode_kv=3000, queued_decode_kv=0)
...@@ -173,7 +353,9 @@ class TestDecodeRegressionModel: ...@@ -173,7 +353,9 @@ class TestDecodeRegressionModel:
assert est > 0 assert est > 0
def test_avg_decode_length_tracking(self): def test_avg_decode_length_tracking(self):
model = DecodeRegressionModel(window_size=50, min_observations=3) model = DecodeRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
for total_kv, num_req in [(1000, 10), (2000, 10), (3000, 10)]: for total_kv, num_req in [(1000, 10), (2000, 10), (3000, 10)]:
fpm = _make_fpm( fpm = _make_fpm(
sum_decode_kv_tokens=total_kv, sum_decode_kv_tokens=total_kv,
...@@ -183,35 +365,99 @@ class TestDecodeRegressionModel: ...@@ -183,35 +365,99 @@ class TestDecodeRegressionModel:
model.add_observation(fpm) model.add_observation(fpm)
assert abs(model.avg_decode_length - 200.0) < 1.0 assert abs(model.avg_decode_length - 200.0) < 1.0
def _train_thpt_model(self, model: DecodeRegressionModel) -> None:
"""Populate with 2D data at decode-realistic wall-time scale."""
for n_req, kv in [
(5, 5000),
(10, 10000),
(20, 20000),
(30, 30000),
(40, 40000),
]:
fpm = _make_fpm(
sum_decode_kv_tokens=kv,
num_decode_requests=n_req,
wall_time=0.00001 * kv + 0.001,
)
model.add_observation(fpm)
def test_find_best_engine_decode_rps(self):
model = DecodeRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
self._train_thpt_model(model)
rps, actual_itl = model.find_best_engine_decode_rps(
itl=50.0, context_length=1000.0, osl=150.0
)
assert rps > 0
assert actual_itl > 0
assert actual_itl <= 50.0
def test_find_best_engine_decode_rps_zero_context(self):
model = DecodeRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
self._train_2d(model)
rps, itl_ms = model.find_best_engine_decode_rps(
itl=50.0, context_length=0.0, osl=150.0
)
assert rps == 0.0
assert itl_ms == 0.0
def test_load_benchmark_fpms(self):
model = DecodeRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
fpms = [
_make_fpm(
num_decode_requests=n,
sum_decode_kv_tokens=n * 1000,
wall_time=0.001 * n,
)
for n in [5, 10, 15, 20, 25]
]
model.load_benchmark_fpms(fpms)
assert model.num_observations == 5
assert model.has_sufficient_data()
# ── AggRegressionModel tests ───────────────────────────────────────── # ── AggRegressionModel tests ─────────────────────────────────────────
class TestAggRegressionModel: class TestAggRegressionModel:
def _train_agg(self, model: AggRegressionModel) -> None:
for p, d in [(100, 1000), (200, 2000), (300, 3000), (400, 4000), (500, 5000)]:
fpm = _make_fpm(
sum_prefill_tokens=p,
num_prefill_requests=1,
sum_decode_kv_tokens=d,
num_decode_requests=10,
wall_time=0.001 * p + 0.0001 * d + 0.001,
)
model.add_observation(fpm)
def test_insufficient_data(self): def test_insufficient_data(self):
model = AggRegressionModel(window_size=50, min_observations=5) model = AggRegressionModel(
max_num_fpm_samples=50, min_observations=5, bucket_count=16
)
assert not model.has_sufficient_data() assert not model.has_sufficient_data()
assert model.estimate_next_ttft(0, 2048, 0) is None assert model.estimate_next_ttft(0, 2048, 0) is None
assert model.estimate_next_itl(0, 0) is None assert model.estimate_next_itl(0, 0) is None
def test_heartbeat_skipped(self): def test_heartbeat_skipped(self):
model = AggRegressionModel(window_size=50, min_observations=3) model = AggRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
fpm = _make_fpm(wall_time=0.0, sum_prefill_tokens=100, sum_decode_kv_tokens=200) fpm = _make_fpm(wall_time=0.0, sum_prefill_tokens=100, sum_decode_kv_tokens=200)
model.add_observation(fpm) model.add_observation(fpm)
assert model.num_observations == 0 assert model.num_observations == 0
def test_2d_regression(self): def test_2d_regression(self):
model = AggRegressionModel(window_size=50, min_observations=3) model = AggRegressionModel(
# wall_time = 0.001 * prefill + 0.0001 * decode_kv + 0.001 max_num_fpm_samples=50, min_observations=3, bucket_count=16
for p, d in [(100, 1000), (200, 2000), (300, 3000), (400, 4000), (500, 5000)]: )
fpm = _make_fpm( self._train_agg(model)
sum_prefill_tokens=p,
num_prefill_requests=1,
sum_decode_kv_tokens=d,
num_decode_requests=10,
wall_time=0.001 * p + 0.0001 * d + 0.001,
)
model.add_observation(fpm)
assert model.has_sufficient_data() assert model.has_sufficient_data()
...@@ -227,6 +473,37 @@ class TestAggRegressionModel: ...@@ -227,6 +473,37 @@ class TestAggRegressionModel:
assert itl is not None assert itl is not None
assert itl > 0 assert itl > 0
def test_find_best_engine_agg_rps(self):
model = AggRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
self._train_agg(model)
thpt, actual_ttft, actual_itl = model.find_best_engine_agg_rps(
isl=2048.0,
osl=150.0,
max_num_batched_tokens=4096,
ttft_sla=500.0,
itl_sla=50.0,
)
assert isinstance(thpt, float)
assert thpt > 0
assert actual_ttft >= 0
assert actual_itl >= 0
def test_find_best_engine_agg_rps_insufficient_data(self):
model = AggRegressionModel(
max_num_fpm_samples=50, min_observations=5, bucket_count=16
)
thpt, _, _ = model.find_best_engine_agg_rps(
isl=2048.0,
osl=150.0,
max_num_batched_tokens=4096,
ttft_sla=500.0,
itl_sla=50.0,
)
assert thpt == 0.0
# ── Planner integration tests (with mocked FPM subscriber) ────────── # ── Planner integration tests (with mocked FPM subscriber) ──────────
...@@ -249,7 +526,6 @@ def _build_load_config(**overrides) -> PlannerConfig: ...@@ -249,7 +526,6 @@ def _build_load_config(**overrides) -> PlannerConfig:
itl=50.0, itl=50.0,
backend="vllm", backend="vllm",
no_operation=True, no_operation=True,
no_correction=True,
metric_pulling_prometheus_endpoint="http://localhost:9090", metric_pulling_prometheus_endpoint="http://localhost:9090",
metric_reporting_prometheus_port=0, metric_reporting_prometheus_port=0,
load_predictor="constant", load_predictor="constant",
...@@ -266,7 +542,8 @@ def _build_load_config(**overrides) -> PlannerConfig: ...@@ -266,7 +542,8 @@ def _build_load_config(**overrides) -> PlannerConfig:
enable_load_scaling=True, enable_load_scaling=True,
enable_throughput_scaling=True, enable_throughput_scaling=True,
load_adjustment_interval=5, load_adjustment_interval=5,
load_learning_window=50, max_num_fpm_samples=50,
fpm_sample_bucket_size=16,
load_scaling_down_sensitivity=80, load_scaling_down_sensitivity=80,
load_metric_samples=10, load_metric_samples=10,
load_min_observations=5, load_min_observations=5,
...@@ -294,7 +571,6 @@ class TestPrefillFpmScaling: ...@@ -294,7 +571,6 @@ class TestPrefillFpmScaling:
planner.model_name = "test-model" planner.model_name = "test-model"
planner.prefill_worker_info = WorkerInfo(max_num_batched_tokens=2048) planner.prefill_worker_info = WorkerInfo(max_num_batched_tokens=2048)
# Train regression: wall_time grows linearly with prefill tokens
for tokens in range(200, 1200, 100): for tokens in range(200, 1200, 100):
fpm = _make_fpm( fpm = _make_fpm(
sum_prefill_tokens=tokens, sum_prefill_tokens=tokens,
...@@ -303,7 +579,6 @@ class TestPrefillFpmScaling: ...@@ -303,7 +579,6 @@ class TestPrefillFpmScaling:
) )
planner.ttft_regression.add_observation(fpm) planner.ttft_regression.add_observation(fpm)
# Both engines have heavy queued prefill -> high estimated TTFT
stats = { stats = {
("w1", 0): _make_fpm( ("w1", 0): _make_fpm(
worker_id="w1", worker_id="w1",
...@@ -335,8 +610,6 @@ class TestPrefillFpmScaling: ...@@ -335,8 +610,6 @@ class TestPrefillFpmScaling:
planner.model_name = "test-model" planner.model_name = "test-model"
planner.prefill_worker_info = WorkerInfo(max_num_batched_tokens=2048) planner.prefill_worker_info = WorkerInfo(max_num_batched_tokens=2048)
# Train with short ISL (100 tokens each) so avg_isl stays low.
# Regression: wall_time ≈ 0.001 * prefill_tokens
for tokens in range(100, 600, 50): for tokens in range(100, 600, 50):
fpm = _make_fpm( fpm = _make_fpm(
sum_prefill_tokens=tokens, sum_prefill_tokens=tokens,
...@@ -345,9 +618,6 @@ class TestPrefillFpmScaling: ...@@ -345,9 +618,6 @@ class TestPrefillFpmScaling:
) )
planner.ttft_regression.add_observation(fpm) planner.ttft_regression.add_observation(fpm)
# All engines idle (no queued prefill).
# estimate_next_ttft: total = 0 + avg_isl(~100) = ~100 tokens
# predicted wall_time ≈ 0.001 * 100 = 0.1s = 100ms < 500ms SLA
stats = { stats = {
(f"w{i}", 0): _make_fpm( (f"w{i}", 0): _make_fpm(
worker_id=f"w{i}", worker_id=f"w{i}",
...@@ -372,7 +642,6 @@ class TestPrefillFpmScaling: ...@@ -372,7 +642,6 @@ class TestPrefillFpmScaling:
planner.model_name = "test-model" planner.model_name = "test-model"
planner.prefill_worker_info = WorkerInfo(max_num_batched_tokens=2048) planner.prefill_worker_info = WorkerInfo(max_num_batched_tokens=2048)
# Only 2 observations, need 5
for tokens in [100, 200]: for tokens in [100, 200]:
fpm = _make_fpm(sum_prefill_tokens=tokens, wall_time=0.01) fpm = _make_fpm(sum_prefill_tokens=tokens, wall_time=0.01)
planner.ttft_regression.add_observation(fpm) planner.ttft_regression.add_observation(fpm)
...@@ -394,11 +663,18 @@ class TestDecodeFpmScaling: ...@@ -394,11 +663,18 @@ class TestDecodeFpmScaling:
planner = DecodePlanner(None, config, shared_state=shared_state) planner = DecodePlanner(None, config, shared_state=shared_state)
planner.model_name = "test-model" planner.model_name = "test-model"
for kv in range(1000, 6000, 500): # 2D regression: vary both num_decode_requests and sum_decode_kv_tokens
for n_req, kv in [
(5, 1000),
(10, 2000),
(15, 3000),
(20, 4000),
(25, 5000),
]:
fpm = _make_fpm( fpm = _make_fpm(
sum_decode_kv_tokens=kv, sum_decode_kv_tokens=kv,
num_decode_requests=10, num_decode_requests=n_req,
wall_time=0.0001 * kv + 0.001, wall_time=0.0001 * kv + 0.0005 * n_req + 0.001,
) )
planner.itl_regression.add_observation(fpm) planner.itl_regression.add_observation(fpm)
...@@ -431,40 +707,17 @@ class TestDecodeFpmScaling: ...@@ -431,40 +707,17 @@ class TestDecodeFpmScaling:
planner = DecodePlanner(None, config, shared_state=shared_state) planner = DecodePlanner(None, config, shared_state=shared_state)
planner.model_name = "test-model" planner.model_name = "test-model"
fpm = _make_fpm(sum_decode_kv_tokens=1000, wall_time=0.01) fpm = _make_fpm(
sum_decode_kv_tokens=1000, num_decode_requests=5, wall_time=0.01
)
planner.itl_regression.add_observation(fpm) planner.itl_regression.add_observation(fpm)
stats = {("w1", 0): _make_fpm(sum_decode_kv_tokens=5000, wall_time=0.5)} stats = {
("w1", 0): _make_fpm(
sum_decode_kv_tokens=5000, num_decode_requests=10, wall_time=0.5
)
}
planner.fpm_subscriber = _mock_fpm_subscriber(stats) planner.fpm_subscriber = _mock_fpm_subscriber(stats)
result = planner.load_plan_adjustment() result = planner.load_plan_adjustment()
assert result is None assert result is None
# ── Correction factor auto-disable tests ─────────────────────────────
class TestCorrectionFactorAutoDisable:
def test_correction_factor_disabled_when_load_enabled(self):
config = PlannerConfig(
enable_load_scaling=True,
enable_throughput_scaling=True,
no_correction=False,
)
assert config.no_correction is True
def test_correction_factor_stays_disabled_if_already_set(self):
config = PlannerConfig(
enable_load_scaling=True,
enable_throughput_scaling=True,
no_correction=True,
)
assert config.no_correction is True
def test_correction_factor_not_disabled_without_loadbased(self):
config = PlannerConfig(
enable_load_scaling=False,
enable_throughput_scaling=True,
no_correction=False,
)
assert config.no_correction is False
...@@ -87,3 +87,36 @@ def test_throughput_metrics_source_invalid(): ...@@ -87,3 +87,36 @@ def test_throughput_metrics_source_invalid():
"""throughput_metrics_source rejects invalid values.""" """throughput_metrics_source rejects invalid values."""
with pytest.raises(ValidationError): with pytest.raises(ValidationError):
PlannerConfig(namespace="test-ns", throughput_metrics_source="invalid") PlannerConfig(namespace="test-ns", throughput_metrics_source="invalid")
@pytest.mark.parametrize("bucket_size", [1, 4, 9, 16, 25])
def test_fpm_sample_bucket_size_accepts_perfect_squares(bucket_size):
"""fpm_sample_bucket_size must be a perfect square (valid values)."""
config = PlannerConfig(namespace="test-ns", fpm_sample_bucket_size=bucket_size)
assert config.fpm_sample_bucket_size == bucket_size
@pytest.mark.parametrize("bucket_size", [2, 3, 5, 7, 10])
def test_fpm_sample_bucket_size_rejects_non_squares(bucket_size):
"""fpm_sample_bucket_size rejects values that are not perfect squares."""
with pytest.raises(ValidationError, match="perfect square"):
PlannerConfig(namespace="test-ns", fpm_sample_bucket_size=bucket_size)
def test_max_num_fpm_samples_field():
"""max_num_fpm_samples configures the FPM sample retention (formerly load_learning_window)."""
config = PlannerConfig(namespace="test-ns", max_num_fpm_samples=100)
assert config.max_num_fpm_samples == 100
def test_agg_mode_supports_throughput_scaling():
"""Agg mode supports throughput-based scaling."""
config = PlannerConfig(
namespace="test-ns",
mode="agg",
enable_throughput_scaling=True,
enable_load_scaling=False,
)
assert config.mode == "agg"
assert config.enable_throughput_scaling is True
assert config.scaling_enabled() is True
...@@ -5,7 +5,7 @@ ...@@ -5,7 +5,7 @@
Unit tests for SLA planner replica calculation logic. Unit tests for SLA planner replica calculation logic.
These tests focus specifically on the replica calculation formulas without These tests focus specifically on the replica calculation formulas without
testing load prediction, interpolation, or correction factors. testing load prediction or regression internals.
""" """
import asyncio import asyncio
...@@ -42,9 +42,9 @@ class PlannerHarness: ...@@ -42,9 +42,9 @@ class PlannerHarness:
if not self.shared_state.last_metrics.is_valid(): if not self.shared_state.last_metrics.is_valid():
return return
p_endpoints, d_endpoints = await self.prefill_planner.get_workers_info() num_p, num_d, is_stable = await self.prefill_planner.get_workers_info()
self.shared_state.p_endpoints = p_endpoints self.shared_state.num_p_workers = num_p
self.shared_state.d_endpoints = d_endpoints self.shared_state.num_d_workers = num_d
next_num_p = self.prefill_planner.plan_adjustment() next_num_p = self.prefill_planner.plan_adjustment()
next_num_d = self.decode_planner.plan_adjustment() next_num_d = self.decode_planner.plan_adjustment()
...@@ -86,14 +86,12 @@ class PlannerHarness: ...@@ -86,14 +86,12 @@ class PlannerHarness:
"config", "config",
} }
prefill_attrs = { prefill_attrs = {
"prefill_interpolator", "ttft_regression",
"prefill_worker_info", "prefill_worker_info",
"p_correction_factor",
} }
decode_attrs = { decode_attrs = {
"decode_interpolator", "itl_regression",
"decode_worker_info", "decode_worker_info",
"d_correction_factor",
} }
if name == "last_metrics": if name == "last_metrics":
return self.shared_state.last_metrics return self.shared_state.last_metrics
...@@ -119,8 +117,8 @@ class PlannerHarness: ...@@ -119,8 +117,8 @@ class PlannerHarness:
"config", "config",
"get_workers_info", "get_workers_info",
} }
prefill_attrs = {"prefill_interpolator", "p_correction_factor"} prefill_attrs = {"ttft_regression"}
decode_attrs = {"decode_interpolator", "d_correction_factor"} decode_attrs = {"itl_regression"}
if name == "last_metrics": if name == "last_metrics":
self.shared_state.last_metrics = value self.shared_state.last_metrics = value
return None return None
...@@ -159,7 +157,6 @@ def planner(): ...@@ -159,7 +157,6 @@ def planner():
itl=10.0, itl=10.0,
backend="vllm", backend="vllm",
no_operation=True, no_operation=True,
no_correction=False,
metric_pulling_prometheus_endpoint="http://localhost:9090", metric_pulling_prometheus_endpoint="http://localhost:9090",
metric_reporting_prometheus_port=0, metric_reporting_prometheus_port=0,
load_predictor="constant", load_predictor="constant",
...@@ -176,12 +173,13 @@ def planner(): ...@@ -176,12 +173,13 @@ def planner():
enable_load_scaling=False, enable_load_scaling=False,
load_predictor_warmup_trace=None, load_predictor_warmup_trace=None,
load_predictor_log1p=False, load_predictor_log1p=False,
max_num_fpm_samples=50,
fpm_sample_bucket_size=16,
load_min_observations=5,
) )
# Mock the runtime
mock_runtime = Mock() mock_runtime = Mock()
# Patch Prometheus Gauge to avoid registry conflicts
with patch("dynamo.planner.monitoring.planner_metrics.Gauge") as mock_gauge: with patch("dynamo.planner.monitoring.planner_metrics.Gauge") as mock_gauge:
mock_gauge.return_value = Mock() mock_gauge.return_value = Mock()
...@@ -206,9 +204,21 @@ def planner(): ...@@ -206,9 +204,21 @@ def planner():
decode_planner.prefill_worker_info = prefill_planner.prefill_worker_info decode_planner.prefill_worker_info = prefill_planner.prefill_worker_info
decode_planner.decode_worker_info = prefill_planner.decode_worker_info decode_planner.decode_worker_info = prefill_planner.decode_worker_info
# Mock the interpolators to return fixed values for testing planner.ttft_regression = Mock()
planner.prefill_interpolator = Mock() # Default: 40000 tokens/s at isl=3000 → 40000/3000 rps
planner.decode_interpolator = Mock() planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
40000.0 / 3000.0,
75.0,
)
planner.ttft_regression.has_sufficient_data.return_value = True
planner.itl_regression = Mock()
# Default: 10000 tokens/s at osl=150 → 10000/150 rps
planner.itl_regression.find_best_engine_decode_rps.return_value = (
10000.0 / 150.0,
9.5,
)
planner.itl_regression.has_sufficient_data.return_value = True
# Mock the predictors to return fixed values # Mock the predictors to return fixed values
planner.num_req_predictor = Mock() planner.num_req_predictor = Mock()
...@@ -221,14 +231,9 @@ def planner(): ...@@ -221,14 +231,9 @@ def planner():
# Mock prometheus client # Mock prometheus client
planner.prometheus_traffic_client = Mock() planner.prometheus_traffic_client = Mock()
# Set up some baseline correction factors
planner.p_correction_factor = 1.0
planner.d_correction_factor = 1.0
planner.config = config planner.config = config
yield planner yield planner
# Cleanup is automatic with context manager
class TestReplicaCalculation: class TestReplicaCalculation:
...@@ -239,59 +244,40 @@ class TestReplicaCalculation: ...@@ -239,59 +244,40 @@ class TestReplicaCalculation:
@pytest.mark.performance @pytest.mark.performance
def test_prefill_replica_calculation_basic(self, planner): def test_prefill_replica_calculation_basic(self, planner):
"""Test basic prefill replica calculation.""" """Test basic prefill replica calculation."""
# Setup test data
next_num_req = 10 next_num_req = 10
next_isl = 3000 next_isl = 3000
prefill_thpt_per_gpu = 40000 # tokens/s/gpu (from the test data) engine_rps = 40000.0 / next_isl
# Mock the predictor outputs
planner.num_req_predictor.predict_next.return_value = next_num_req planner.num_req_predictor.predict_next.return_value = next_num_req
planner.isl_predictor.predict_next.return_value = next_isl planner.isl_predictor.predict_next.return_value = next_isl
planner.osl_predictor.predict_next.return_value = 150 planner.osl_predictor.predict_next.return_value = 150
# Mock interpolator output planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = ( engine_rps,
prefill_thpt_per_gpu 75.0,
) )
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = ( planner.itl_regression.find_best_engine_decode_rps.return_value = (
10000, 10000.0 / 150.0,
0.01, 9.5,
0.5,
) )
# Calculate expected result manually # Formula: ceil(num_req / interval / engine_rps)
pred_prefill_load_per_gpu = ( pred_prefill_demand = (
next_num_req next_num_req / planner.config.throughput_adjustment_interval
* next_isl
/ planner.config.throughput_adjustment_interval
* min(1, planner.p_correction_factor)
)
expected_prefill_replicas = math.ceil(
pred_prefill_load_per_gpu
/ prefill_thpt_per_gpu
/ planner.config.prefill_engine_num_gpu
) )
expected_prefill_replicas = math.ceil(pred_prefill_demand / engine_rps)
# Set up valid metrics to trigger calculation
planner.last_metrics = Metrics( planner.last_metrics = Metrics(
num_req=10, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0 num_req=10, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
) )
# Mock workers info async def mock_get_workers_info(*args, **kwargs):
async def mock_get_workers_info(): return (1, 1, True)
return (["prefill1"], ["decode1"])
planner.get_workers_info = mock_get_workers_info planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls for correction factor calculation
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Run the calculation
asyncio.run(planner.make_adjustments()) asyncio.run(planner.make_adjustments())
# Extract the calculated values from the log calls or by checking the mock calls
# Since we mocked the connector, we can check what replicas were requested
prefill_component = "VllmPrefillWorker" prefill_component = "VllmPrefillWorker"
calculated_prefill_replicas = _replica_count( calculated_prefill_replicas = _replica_count(
planner.last_target_replicas, prefill_component planner.last_target_replicas, prefill_component
...@@ -299,7 +285,6 @@ class TestReplicaCalculation: ...@@ -299,7 +285,6 @@ class TestReplicaCalculation:
print(f"Expected prefill replicas: {expected_prefill_replicas}") print(f"Expected prefill replicas: {expected_prefill_replicas}")
print(f"Calculated prefill replicas: {calculated_prefill_replicas}") print(f"Calculated prefill replicas: {calculated_prefill_replicas}")
# Allow for small differences due to min_endpoint constraints
assert ( assert (
max(expected_prefill_replicas, planner.config.min_endpoint) max(expected_prefill_replicas, planner.config.min_endpoint)
== calculated_prefill_replicas == calculated_prefill_replicas
...@@ -310,52 +295,39 @@ class TestReplicaCalculation: ...@@ -310,52 +295,39 @@ class TestReplicaCalculation:
@pytest.mark.performance @pytest.mark.performance
def test_decode_replica_calculation_basic(self, planner): def test_decode_replica_calculation_basic(self, planner):
"""Test basic decode replica calculation.""" """Test basic decode replica calculation."""
# Setup test data
next_num_req = 10 next_num_req = 10
next_osl = 150 next_osl = 150
decode_thpt_per_gpu = 10000 # tokens/s/gpu engine_rps = 10000.0 / next_osl
# Mock the predictor outputs
planner.num_req_predictor.predict_next.return_value = next_num_req planner.num_req_predictor.predict_next.return_value = next_num_req
planner.isl_predictor.predict_next.return_value = 3000 planner.isl_predictor.predict_next.return_value = 3000
planner.osl_predictor.predict_next.return_value = next_osl planner.osl_predictor.predict_next.return_value = next_osl
# Mock interpolator outputs planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000 40000.0 / 3000.0,
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = ( 75.0,
decode_thpt_per_gpu, )
0.01, planner.itl_regression.find_best_engine_decode_rps.return_value = (
0.5, engine_rps,
9.5,
) )
# Calculate expected result manually # Formula: ceil(num_req / interval / engine_rps)
expected_decode_replicas = math.ceil( expected_decode_replicas = math.ceil(
next_num_req next_num_req / planner.config.throughput_adjustment_interval / engine_rps
* next_osl
/ planner.config.throughput_adjustment_interval
/ decode_thpt_per_gpu
/ planner.config.decode_engine_num_gpu
) )
# Set up valid metrics
planner.last_metrics = Metrics( planner.last_metrics = Metrics(
num_req=10, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0 num_req=10, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
) )
# Mock workers info async def mock_get_workers_info(*args, **kwargs):
async def mock_get_workers_info(): return (1, 1, True)
return (["prefill1"], ["decode1"])
planner.get_workers_info = mock_get_workers_info planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls for correction factor calculation
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Run the calculation
asyncio.run(planner.make_adjustments()) asyncio.run(planner.make_adjustments())
# Check the results
decode_component = "VllmDecodeWorker" decode_component = "VllmDecodeWorker"
calculated_decode_replicas = _replica_count( calculated_decode_replicas = _replica_count(
planner.last_target_replicas, decode_component planner.last_target_replicas, decode_component
...@@ -363,46 +335,43 @@ class TestReplicaCalculation: ...@@ -363,46 +335,43 @@ class TestReplicaCalculation:
print(f"Expected decode replicas: {expected_decode_replicas}") print(f"Expected decode replicas: {expected_decode_replicas}")
print(f"Calculated decode replicas: {calculated_decode_replicas}") print(f"Calculated decode replicas: {calculated_decode_replicas}")
# Allow for small differences due to min_endpoint constraints
assert ( assert (
max(expected_decode_replicas, planner.config.min_endpoint) max(expected_decode_replicas, planner.config.min_endpoint)
== calculated_decode_replicas == calculated_decode_replicas
) )
@pytest.mark.parametrize( @pytest.mark.parametrize(
"num_req,decode_thpt,expected_p,expected_d", "num_req,decode_rps,expected_p,expected_d",
[ [
(10, 10000, 1, 1), # low_load_10_req_per_second (10, 10000.0 / 150.0, 1, 1), # low_load_10_req_per_second
(500, 1000, 1, 2), # high_load_500_req_per_second (lower decode throughput) (
500,
1000.0 / 150.0,
1,
2,
), # high_load_500_req_per_second (lower decode rps)
], ],
) )
@pytest.mark.nightly @pytest.mark.nightly
@pytest.mark.gpu_2 @pytest.mark.gpu_2
@pytest.mark.performance @pytest.mark.performance
def test_scaling_scenario_low_to_high_load( def test_scaling_scenario_low_to_high_load(
self, planner, num_req, decode_thpt, expected_p, expected_d self, planner, num_req, decode_rps, expected_p, expected_d
): ):
"""Test scaling from low to high load scenarios.""" """Test scaling from low to high load scenarios."""
# Reset the planner state
planner.p_correction_factor = 1.0
planner.d_correction_factor = 1.0
# Mock predictor outputs for this case
planner.num_req_predictor.predict_next.return_value = num_req planner.num_req_predictor.predict_next.return_value = num_req
planner.isl_predictor.predict_next.return_value = 3000 planner.isl_predictor.predict_next.return_value = 3000
planner.osl_predictor.predict_next.return_value = 150 planner.osl_predictor.predict_next.return_value = 150
# Mock interpolator outputs (based on H200 1P1D profiling data) planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = ( 40000.0 / 3000.0,
40000 # tokens/s/gpu 75.0,
) )
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = ( planner.itl_regression.find_best_engine_decode_rps.return_value = (
decode_thpt, decode_rps,
0.01, 9.5,
0.5,
) )
# Set up metrics
planner.last_metrics = Metrics( planner.last_metrics = Metrics(
num_req=num_req, num_req=num_req,
isl=3000, isl=3000,
...@@ -412,23 +381,14 @@ class TestReplicaCalculation: ...@@ -412,23 +381,14 @@ class TestReplicaCalculation:
request_duration=100.0, request_duration=100.0,
) )
# Mock workers info async def mock_get_workers_info(*args, **kwargs):
async def mock_get_workers_info(): return (1, 1, True)
return (["prefill1"], ["decode1"])
planner.get_workers_info = mock_get_workers_info planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls for correction factor calculation
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Reset the mock
planner.connector.reset_mock() planner.connector.reset_mock()
# Run calculation
asyncio.run(planner.make_adjustments()) asyncio.run(planner.make_adjustments())
# Verify results
prefill_replicas = _replica_count( prefill_replicas = _replica_count(
planner.last_target_replicas, "VllmPrefillWorker" planner.last_target_replicas, "VllmPrefillWorker"
) )
...@@ -449,41 +409,32 @@ class TestReplicaCalculation: ...@@ -449,41 +409,32 @@ class TestReplicaCalculation:
@pytest.mark.performance @pytest.mark.performance
def test_gpu_budget_constraint(self, planner): def test_gpu_budget_constraint(self, planner):
"""Test that GPU budget constraints are properly applied.""" """Test that GPU budget constraints are properly applied."""
# Set a low GPU budget
planner.config.max_gpu_budget = 3 planner.config.max_gpu_budget = 3
# Mock predictor outputs that would normally require more GPUs planner.num_req_predictor.predict_next.return_value = 50
planner.num_req_predictor.predict_next.return_value = 50 # High load
planner.isl_predictor.predict_next.return_value = 3000 planner.isl_predictor.predict_next.return_value = 3000
planner.osl_predictor.predict_next.return_value = 150 planner.osl_predictor.predict_next.return_value = 150
# Mock interpolator outputs planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000 40000.0 / 3000.0,
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = ( 75.0,
10000, )
0.01, planner.itl_regression.find_best_engine_decode_rps.return_value = (
0.5, 10000.0 / 150.0,
9.5,
) )
# Set up metrics
planner.last_metrics = Metrics( planner.last_metrics = Metrics(
num_req=50, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0 num_req=50, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
) )
# Mock workers info async def mock_get_workers_info(*args, **kwargs):
async def mock_get_workers_info(): return (1, 1, True)
return (["prefill1"], ["decode1"])
planner.get_workers_info = mock_get_workers_info planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Run calculation
asyncio.run(planner.make_adjustments()) asyncio.run(planner.make_adjustments())
# Verify that total GPU usage doesn't exceed budget
prefill_replicas = _replica_count( prefill_replicas = _replica_count(
planner.last_target_replicas, "VllmPrefillWorker" planner.last_target_replicas, "VllmPrefillWorker"
) )
...@@ -510,38 +461,30 @@ class TestReplicaCalculation: ...@@ -510,38 +461,30 @@ class TestReplicaCalculation:
"""Test that minimum endpoint constraints are respected.""" """Test that minimum endpoint constraints are respected."""
planner.config.min_endpoint = 2 planner.config.min_endpoint = 2
# Mock predictor outputs that would normally require fewer workers planner.num_req_predictor.predict_next.return_value = 1
planner.num_req_predictor.predict_next.return_value = 1 # Very low load
planner.isl_predictor.predict_next.return_value = 100 planner.isl_predictor.predict_next.return_value = 100
planner.osl_predictor.predict_next.return_value = 10 planner.osl_predictor.predict_next.return_value = 10
# Mock interpolator outputs planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000 40000.0 / 100.0,
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = ( 75.0,
10000, )
0.01, planner.itl_regression.find_best_engine_decode_rps.return_value = (
0.5, 10000.0 / 10.0,
9.5,
) )
# Set up metrics
planner.last_metrics = Metrics( planner.last_metrics = Metrics(
num_req=1, isl=100, osl=10, ttft=80.0, itl=10.0, request_duration=100.0 num_req=1, isl=100, osl=10, ttft=80.0, itl=10.0, request_duration=100.0
) )
# Mock workers info async def mock_get_workers_info(*args, **kwargs):
async def mock_get_workers_info(): return (1, 1, True)
return (["prefill1"], ["decode1"])
planner.get_workers_info = mock_get_workers_info planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Run calculation
asyncio.run(planner.make_adjustments()) asyncio.run(planner.make_adjustments())
# Verify minimum constraints are respected
prefill_replicas = _replica_count( prefill_replicas = _replica_count(
planner.last_target_replicas, "VllmPrefillWorker" planner.last_target_replicas, "VllmPrefillWorker"
) )
...@@ -557,182 +500,47 @@ class TestReplicaCalculation: ...@@ -557,182 +500,47 @@ class TestReplicaCalculation:
decode_replicas >= planner.config.min_endpoint decode_replicas >= planner.config.min_endpoint
), "Decode replicas below minimum" ), "Decode replicas below minimum"
@pytest.mark.nightly
@pytest.mark.gpu_2
@pytest.mark.performance
def test_prefill_correction_factor_clamping(self, planner):
"""Test that prefill correction factor > 1 is clamped to 1."""
# Set a high correction factor > 1
planner.p_correction_factor = 2.5
planner.d_correction_factor = 1.0
# Mock predictor outputs
planner.num_req_predictor.predict_next.return_value = 10
planner.isl_predictor.predict_next.return_value = 3000
planner.osl_predictor.predict_next.return_value = 150
# Mock interpolator outputs
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
10000,
0.01,
0.5,
)
# Set up metrics
planner.last_metrics = Metrics(
num_req=10, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
)
# Mock workers info
async def mock_get_workers_info():
return (["prefill1"], ["decode1"])
planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Calculate expected result manually with clamping
# Should use min(1, 2.5) = 1
pred_prefill_load_per_gpu = (
10
* 3000
/ planner.config.throughput_adjustment_interval
* min(1, 2.5) # Should be * 1
)
expected_prefill_replicas = math.ceil(
pred_prefill_load_per_gpu / 40000 / planner.config.prefill_engine_num_gpu
)
# Run calculation
asyncio.run(planner.make_adjustments())
# Verify that correction factor was effectively clamped
prefill_replicas = _replica_count(
planner.last_target_replicas, "VllmPrefillWorker"
)
print(
f"Correction factor clamping test: Expected={expected_prefill_replicas}, Got={prefill_replicas}"
)
assert prefill_replicas == max(
expected_prefill_replicas, planner.config.min_endpoint
), "Prefill correction factor should be clamped to 1"
@pytest.mark.nightly
@pytest.mark.gpu_2
@pytest.mark.performance
def test_decode_correction_factor_zero_handling(self, planner):
"""Test handling of d_correction_factor <= 0."""
# Test both 0 and negative values
for correction_factor in [0.0, -1.0]:
planner.p_correction_factor = 1.0
planner.d_correction_factor = correction_factor
# Mock predictor outputs
planner.num_req_predictor.predict_next.return_value = 10
planner.isl_predictor.predict_next.return_value = 3000
planner.osl_predictor.predict_next.return_value = 150
# Mock interpolator outputs
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
10000,
0.01,
0.5,
)
# Set up metrics
planner.last_metrics = Metrics(
num_req=10,
isl=3000,
osl=150,
ttft=80.0,
itl=10.0,
request_duration=100.0,
)
# Mock workers info
async def mock_get_workers_info():
return (["prefill1"], ["decode1"])
planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Run calculation
asyncio.run(planner.make_adjustments())
# Should handle gracefully without crashing
# The code should use args.itl directly instead of dividing by 0
decode_replicas = _replica_count(
planner.last_target_replicas, "VllmDecodeWorker"
)
print(
f"Correction factor {correction_factor} test: Decode replicas={decode_replicas}"
)
# Should get a valid result (not crash)
assert (
decode_replicas >= 1
), f"Should handle correction factor {correction_factor} gracefully"
@pytest.mark.nightly @pytest.mark.nightly
@pytest.mark.gpu_2 @pytest.mark.gpu_2
@pytest.mark.performance @pytest.mark.performance
def test_multi_gpu_engines(self, planner): def test_multi_gpu_engines(self, planner):
"""Test replica calculation with multi-GPU engines.""" """Test replica calculation with multi-GPU engines."""
# Set multi-GPU configuration
planner.config.prefill_engine_num_gpu = 2 planner.config.prefill_engine_num_gpu = 2
planner.config.decode_engine_num_gpu = 4 planner.config.decode_engine_num_gpu = 4
# Mock predictor outputs
planner.num_req_predictor.predict_next.return_value = 20 planner.num_req_predictor.predict_next.return_value = 20
planner.isl_predictor.predict_next.return_value = 3000 planner.isl_predictor.predict_next.return_value = 3000
planner.osl_predictor.predict_next.return_value = 150 planner.osl_predictor.predict_next.return_value = 150
# Mock interpolator outputs # Engine-level request rate (already accounts for multi-GPU)
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000 prefill_engine_rps = 40000.0 / 3000.0
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = ( decode_engine_rps = 5000.0 / 150.0
5000, planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
0.01, prefill_engine_rps,
0.5, 75.0,
) # Lower for scaling )
planner.itl_regression.find_best_engine_decode_rps.return_value = (
decode_engine_rps,
9.5,
)
# Set up metrics
planner.last_metrics = Metrics( planner.last_metrics = Metrics(
num_req=20, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0 num_req=20, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
) )
# Mock workers info async def mock_get_workers_info(*args, **kwargs):
async def mock_get_workers_info(): return (1, 1, True)
return (["prefill1"], ["decode1"])
planner.get_workers_info = mock_get_workers_info planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls # No engine_num_gpu division — regression returns engine-level rps
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Calculate expected results manually
pred_prefill_load_per_gpu = (
20 * 3000 / planner.config.throughput_adjustment_interval * 1.0
)
expected_prefill_replicas = math.ceil( expected_prefill_replicas = math.ceil(
pred_prefill_load_per_gpu / 40000 / 2 20 / planner.config.throughput_adjustment_interval / prefill_engine_rps
) # 2 GPUs per engine )
expected_decode_replicas = math.ceil( expected_decode_replicas = math.ceil(
20 * 150 / planner.config.throughput_adjustment_interval / 5000 / 4 20 / planner.config.throughput_adjustment_interval / decode_engine_rps
) # 4 GPUs per engine )
# Run calculation
asyncio.run(planner.make_adjustments()) asyncio.run(planner.make_adjustments())
prefill_replicas = _replica_count( prefill_replicas = _replica_count(
...@@ -742,10 +550,10 @@ class TestReplicaCalculation: ...@@ -742,10 +550,10 @@ class TestReplicaCalculation:
planner.last_target_replicas, "VllmDecodeWorker" planner.last_target_replicas, "VllmDecodeWorker"
) )
print( print(
f"Multi-GPU test: P={prefill_replicas} (expected ~{expected_prefill_replicas}), D={decode_replicas} (expected ~{expected_decode_replicas})" f"Multi-GPU test: P={prefill_replicas} (expected ~{expected_prefill_replicas}), "
f"D={decode_replicas} (expected ~{expected_decode_replicas})"
) )
# Verify calculations account for multiple GPUs per engine
assert prefill_replicas == max( assert prefill_replicas == max(
expected_prefill_replicas, planner.config.min_endpoint expected_prefill_replicas, planner.config.min_endpoint
) )
...@@ -757,42 +565,39 @@ class TestReplicaCalculation: ...@@ -757,42 +565,39 @@ class TestReplicaCalculation:
@pytest.mark.gpu_2 @pytest.mark.gpu_2
@pytest.mark.performance @pytest.mark.performance
def test_complex_gpu_budget_scaling(self, planner): def test_complex_gpu_budget_scaling(self, planner):
"""Test complex GPU budget scaling with proportional reduction and decode adjustment.""" """Test complex GPU budget scaling with proportional reduction."""
# Set tight GPU budget that will trigger complex scaling
planner.config.max_gpu_budget = 5 planner.config.max_gpu_budget = 5
planner.config.prefill_engine_num_gpu = 2 planner.config.prefill_engine_num_gpu = 2
planner.config.decode_engine_num_gpu = 2 planner.config.decode_engine_num_gpu = 2
planner.config.min_endpoint = 1 planner.config.min_endpoint = 1
# High load that would normally require more GPUs
planner.num_req_predictor.predict_next.return_value = 100 planner.num_req_predictor.predict_next.return_value = 100
planner.isl_predictor.predict_next.return_value = 3000 planner.isl_predictor.predict_next.return_value = 3000
planner.osl_predictor.predict_next.return_value = 150 planner.osl_predictor.predict_next.return_value = 150
# Lower throughput to trigger higher replica needs planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 10000 10000.0 / 3000.0,
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = ( 300.0,
1000, )
0.01, planner.itl_regression.find_best_engine_decode_rps.return_value = (
0.5, 1000.0 / 150.0,
9.5,
) )
# Set up metrics
planner.last_metrics = Metrics( planner.last_metrics = Metrics(
num_req=100, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0 num_req=100,
isl=3000,
osl=150,
ttft=80.0,
itl=10.0,
request_duration=100.0,
) )
# Mock workers info async def mock_get_workers_info(*args, **kwargs):
async def mock_get_workers_info(): return (1, 1, True)
return (["prefill1"], ["decode1"])
planner.get_workers_info = mock_get_workers_info planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Run calculation
asyncio.run(planner.make_adjustments()) asyncio.run(planner.make_adjustments())
prefill_replicas = _replica_count( prefill_replicas = _replica_count(
...@@ -801,14 +606,14 @@ class TestReplicaCalculation: ...@@ -801,14 +606,14 @@ class TestReplicaCalculation:
decode_replicas = _replica_count( decode_replicas = _replica_count(
planner.last_target_replicas, "VllmDecodeWorker" planner.last_target_replicas, "VllmDecodeWorker"
) )
# Verify total GPU usage doesn't exceed budget
total_gpus = ( total_gpus = (
prefill_replicas * planner.config.prefill_engine_num_gpu prefill_replicas * planner.config.prefill_engine_num_gpu
+ decode_replicas * planner.config.decode_engine_num_gpu + decode_replicas * planner.config.decode_engine_num_gpu
) )
print( print(
f"Complex GPU budget test: P={prefill_replicas}, D={decode_replicas}, Total GPUs={total_gpus}" f"Complex GPU budget test: P={prefill_replicas}, D={decode_replicas}, "
f"Total GPUs={total_gpus}"
) )
assert ( assert (
...@@ -820,6 +625,3 @@ class TestReplicaCalculation: ...@@ -820,6 +625,3 @@ class TestReplicaCalculation:
assert ( assert (
decode_replicas >= planner.config.min_endpoint decode_replicas >= planner.config.min_endpoint
), "Should respect min_endpoint for decode" ), "Should respect min_endpoint for decode"
# No need for unittest.main() with pytest!
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
import argparse
import asyncio import asyncio
import math import math
import os import os
from unittest.mock import Mock, patch from unittest.mock import MagicMock, Mock, patch
import pytest import pytest
...@@ -15,7 +14,6 @@ from dynamo.planner.core.decode import DecodePlanner ...@@ -15,7 +14,6 @@ from dynamo.planner.core.decode import DecodePlanner
from dynamo.planner.core.prefill import PrefillPlanner from dynamo.planner.core.prefill import PrefillPlanner
from dynamo.planner.core.state import PlannerSharedState from dynamo.planner.core.state import PlannerSharedState
from dynamo.planner.errors import DeploymentValidationError from dynamo.planner.errors import DeploymentValidationError
from dynamo.planner.offline.dryrun import run_sla_planner_dryrun
pytestmark = [ pytestmark = [
pytest.mark.gpu_0, pytest.mark.gpu_0,
...@@ -24,6 +22,10 @@ pytestmark = [ ...@@ -24,6 +22,10 @@ pytestmark = [
pytest.mark.planner, pytest.mark.planner,
] ]
PREFILL_ENGINE_RPS = 10.0
DECODE_ENGINE_RPS = 5.0
DECODE_ACTUAL_ITL_MS = 40.0
@pytest.fixture(autouse=True) @pytest.fixture(autouse=True)
def mock_prometheus_metrics(): def mock_prometheus_metrics():
...@@ -43,7 +45,6 @@ def _build_config(): ...@@ -43,7 +45,6 @@ def _build_config():
itl=50.0, itl=50.0,
backend="vllm", backend="vllm",
no_operation=True, no_operation=True,
no_correction=True,
metric_pulling_prometheus_endpoint="http://localhost:9090", metric_pulling_prometheus_endpoint="http://localhost:9090",
metric_reporting_prometheus_port=0, metric_reporting_prometheus_port=0,
load_predictor="constant", load_predictor="constant",
...@@ -90,6 +91,20 @@ def _build_planners(config, prometheus_client): ...@@ -90,6 +91,20 @@ def _build_planners(config, prometheus_client):
prefill_planner.model_name = "test-model" prefill_planner.model_name = "test-model"
decode_planner.model_name = "test-model" decode_planner.model_name = "test-model"
prefill_planner.ttft_regression = MagicMock()
prefill_planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
PREFILL_ENGINE_RPS,
75.0,
)
prefill_planner.ttft_regression.has_sufficient_data.return_value = True
decode_planner.itl_regression = MagicMock()
decode_planner.itl_regression.find_best_engine_decode_rps.return_value = (
DECODE_ENGINE_RPS,
DECODE_ACTUAL_ITL_MS,
)
decode_planner.itl_regression.has_sufficient_data.return_value = True
async def mock_get_workers_info(require_prefill=True, require_decode=True): async def mock_get_workers_info(require_prefill=True, require_decode=True):
return ( return (
1 if require_prefill else 0, 1 if require_prefill else 0,
...@@ -103,32 +118,20 @@ def _build_planners(config, prometheus_client): ...@@ -103,32 +118,20 @@ def _build_planners(config, prometheus_client):
def _expected_prefill(config, prefill_planner, sample): def _expected_prefill(config, prefill_planner, sample):
pred_prefill_throughput = ( demand_rps = sample["num_req"] / config.throughput_adjustment_interval
sample["num_req"] * sample["isl"] / config.throughput_adjustment_interval engine_rps, _ = prefill_planner.ttft_regression.find_best_engine_prefill_rps(
) ttft_sla=config.ttft, isl=sample["isl"]
thpt_per_gpu = prefill_planner.prefill_interpolator.interpolate_thpt_per_gpu(
sample["isl"]
)
expected = math.ceil(
pred_prefill_throughput / thpt_per_gpu / config.prefill_engine_num_gpu
) )
expected = math.ceil(demand_rps / engine_rps)
return max(expected, config.min_endpoint) return max(expected, config.min_endpoint)
def _expected_decode(config, decode_planner, sample): def _expected_decode(config, decode_planner, sample):
( demand_rps = sample["num_req"] / config.throughput_adjustment_interval
pred_decode_thpt_per_gpu, engine_rps, _ = decode_planner.itl_regression.find_best_engine_decode_rps(
_,
_,
) = decode_planner.decode_interpolator.find_best_throughput_per_gpu(
itl=config.itl, context_length=sample["isl"] + sample["osl"] / 2 itl=config.itl, context_length=sample["isl"] + sample["osl"] / 2
) )
pred_decode_throughput = ( expected = math.ceil(demand_rps / engine_rps)
sample["num_req"] * sample["osl"] / config.throughput_adjustment_interval
)
expected = math.ceil(
pred_decode_throughput / pred_decode_thpt_per_gpu / config.decode_engine_num_gpu
)
return max(expected, config.min_endpoint) return max(expected, config.min_endpoint)
...@@ -210,128 +213,114 @@ def test_disagg_scale_down(): ...@@ -210,128 +213,114 @@ def test_disagg_scale_down():
assert low_d < high_d assert low_d < high_d
# Tests for _initialize_gpu_counts
class TestInitializeGpuCounts: class TestInitializeGpuCounts:
@staticmethod
def _make_config(**overrides):
defaults = dict(prefill_engine_num_gpu=None, decode_engine_num_gpu=None)
defaults.update(overrides)
return PlannerConfig.model_construct(**defaults)
def test_kubernetes_mode_reads_from_dgd(self): def test_kubernetes_mode_reads_from_dgd(self):
"""Test that GPU counts are read from DGD in Kubernetes mode""" """Test that GPU counts are read from DGD in Kubernetes mode"""
args = argparse.Namespace() config = self._make_config()
args.prefill_engine_num_gpu = None
args.decode_engine_num_gpu = None
connector = Mock() connector = Mock()
connector.get_gpu_counts = Mock(return_value=(2, 4)) connector.get_gpu_counts = Mock(return_value=(2, 4))
_initialize_gpu_counts( _initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=True config, connector, require_prefill=True, require_decode=True
) )
assert args.prefill_engine_num_gpu == 2 assert config.prefill_engine_num_gpu == 2
assert args.decode_engine_num_gpu == 4 assert config.decode_engine_num_gpu == 4
connector.get_gpu_counts.assert_called_once_with( connector.get_gpu_counts.assert_called_once_with(
require_prefill=True, require_decode=True require_prefill=True, require_decode=True
) )
def test_kubernetes_mode_prefill_only(self): def test_kubernetes_mode_prefill_only(self):
"""Test GPU count initialization for prefill-only mode""" """Test GPU count initialization for prefill-only mode"""
args = argparse.Namespace() config = self._make_config()
args.prefill_engine_num_gpu = None
args.decode_engine_num_gpu = None
connector = Mock() connector = Mock()
connector.get_gpu_counts = Mock(return_value=(2, 0)) connector.get_gpu_counts = Mock(return_value=(2, 0))
_initialize_gpu_counts( _initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=False config, connector, require_prefill=True, require_decode=False
) )
assert args.prefill_engine_num_gpu == 2 assert config.prefill_engine_num_gpu == 2
assert args.decode_engine_num_gpu == 0 assert config.decode_engine_num_gpu == 0
connector.get_gpu_counts.assert_called_once_with( connector.get_gpu_counts.assert_called_once_with(
require_prefill=True, require_decode=False require_prefill=True, require_decode=False
) )
def test_virtual_mode_uses_cli_args(self): def test_virtual_mode_uses_cli_args(self):
"""Test that GPU counts come from CLI args in virtual mode""" """Test that GPU counts come from config in virtual mode"""
args = argparse.Namespace() config = self._make_config(prefill_engine_num_gpu=2, decode_engine_num_gpu=4)
args.prefill_engine_num_gpu = 2
args.decode_engine_num_gpu = 4
# Virtual connector doesn't have get_gpu_counts method
connector = Mock(spec=[]) connector = Mock(spec=[])
_initialize_gpu_counts( _initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=True config, connector, require_prefill=True, require_decode=True
) )
# Values should remain unchanged assert config.prefill_engine_num_gpu == 2
assert args.prefill_engine_num_gpu == 2 assert config.decode_engine_num_gpu == 4
assert args.decode_engine_num_gpu == 4
def test_virtual_mode_missing_prefill_raises_error(self): def test_virtual_mode_missing_prefill_raises_error(self):
"""Test that missing prefill GPU flag raises error in virtual mode""" """Test that missing prefill GPU config raises error in virtual mode"""
args = argparse.Namespace() config = self._make_config(decode_engine_num_gpu=4)
args.prefill_engine_num_gpu = None
args.decode_engine_num_gpu = 4
connector = Mock(spec=[]) connector = Mock(spec=[])
with pytest.raises(DeploymentValidationError) as exc_info: with pytest.raises(DeploymentValidationError) as exc_info:
_initialize_gpu_counts( _initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=True config, connector, require_prefill=True, require_decode=True
) )
assert "prefill_engine_num_gpu" in str(exc_info.value) assert "prefill_engine_num_gpu" in str(exc_info.value)
def test_virtual_mode_missing_decode_raises_error(self): def test_virtual_mode_missing_decode_raises_error(self):
"""Test that missing decode GPU flag raises error in virtual mode""" """Test that missing decode GPU config raises error in virtual mode"""
args = argparse.Namespace() config = self._make_config(prefill_engine_num_gpu=2)
args.prefill_engine_num_gpu = 2
args.decode_engine_num_gpu = None
connector = Mock(spec=[]) connector = Mock(spec=[])
with pytest.raises(DeploymentValidationError) as exc_info: with pytest.raises(DeploymentValidationError) as exc_info:
_initialize_gpu_counts( _initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=True config, connector, require_prefill=True, require_decode=True
) )
assert "decode_engine_num_gpu" in str(exc_info.value) assert "decode_engine_num_gpu" in str(exc_info.value)
def test_virtual_mode_missing_both_raises_error_with_both_messages(self): def test_virtual_mode_missing_both_raises_error_with_both_messages(self):
"""Test that missing both GPU flags shows both error messages""" """Test that missing both GPU configs shows both error messages"""
args = argparse.Namespace() config = self._make_config()
args.prefill_engine_num_gpu = None
args.decode_engine_num_gpu = None
connector = Mock(spec=[]) connector = Mock(spec=[])
with pytest.raises(DeploymentValidationError) as exc_info: with pytest.raises(DeploymentValidationError) as exc_info:
_initialize_gpu_counts( _initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=True config, connector, require_prefill=True, require_decode=True
) )
assert len(exc_info.value.errors) == 2 assert len(exc_info.value.errors) == 2
def test_virtual_mode_decode_only_no_prefill_error(self): def test_virtual_mode_decode_only_no_prefill_error(self):
"""Test decode-only mode doesn't require prefill GPU flag""" """Test decode-only mode doesn't require prefill GPU config"""
args = argparse.Namespace() config = self._make_config(decode_engine_num_gpu=4)
args.prefill_engine_num_gpu = None
args.decode_engine_num_gpu = 4
connector = Mock(spec=[]) connector = Mock(spec=[])
# Should not raise - prefill not required
_initialize_gpu_counts( _initialize_gpu_counts(
args, connector, require_prefill=False, require_decode=True config, connector, require_prefill=False, require_decode=True
) )
assert args.decode_engine_num_gpu == 4 assert config.decode_engine_num_gpu == 4
def test_kubernetes_mode_fallback_to_cli_on_dgd_error(self): def test_kubernetes_mode_fallback_to_cli_on_dgd_error(self):
"""Test that K8s mode falls back to CLI flags when DGD parsing fails""" """Test that K8s mode falls back to config when DGD parsing fails"""
args = argparse.Namespace() config = self._make_config(prefill_engine_num_gpu=2, decode_engine_num_gpu=4)
args.prefill_engine_num_gpu = 2
args.decode_engine_num_gpu = 4
connector = Mock() connector = Mock()
connector.get_gpu_counts = Mock( connector.get_gpu_counts = Mock(
...@@ -339,18 +328,15 @@ class TestInitializeGpuCounts: ...@@ -339,18 +328,15 @@ class TestInitializeGpuCounts:
) )
_initialize_gpu_counts( _initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=True config, connector, require_prefill=True, require_decode=True
) )
# Should use CLI flag values after fallback assert config.prefill_engine_num_gpu == 2
assert args.prefill_engine_num_gpu == 2 assert config.decode_engine_num_gpu == 4
assert args.decode_engine_num_gpu == 4
def test_kubernetes_mode_fallback_missing_cli_flags_raises_error(self): def test_kubernetes_mode_fallback_missing_cli_flags_raises_error(self):
"""Test that K8s fallback raises error when CLI flags are also missing""" """Test that K8s fallback raises error when config also missing"""
args = argparse.Namespace() config = self._make_config()
args.prefill_engine_num_gpu = None
args.decode_engine_num_gpu = None
connector = Mock() connector = Mock()
connector.get_gpu_counts = Mock( connector.get_gpu_counts = Mock(
...@@ -359,16 +345,14 @@ class TestInitializeGpuCounts: ...@@ -359,16 +345,14 @@ class TestInitializeGpuCounts:
with pytest.raises(DeploymentValidationError) as exc_info: with pytest.raises(DeploymentValidationError) as exc_info:
_initialize_gpu_counts( _initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=True config, connector, require_prefill=True, require_decode=True
) )
assert len(exc_info.value.errors) == 2 assert len(exc_info.value.errors) == 2
def test_kubernetes_mode_fallback_partial_cli_flags(self): def test_kubernetes_mode_fallback_partial_cli_flags(self):
"""Test K8s fallback with only one CLI flag provided""" """Test K8s fallback with only one config value provided"""
args = argparse.Namespace() config = self._make_config(prefill_engine_num_gpu=2)
args.prefill_engine_num_gpu = 2
args.decode_engine_num_gpu = None
connector = Mock() connector = Mock()
connector.get_gpu_counts = Mock( connector.get_gpu_counts = Mock(
...@@ -377,73 +361,7 @@ class TestInitializeGpuCounts: ...@@ -377,73 +361,7 @@ class TestInitializeGpuCounts:
with pytest.raises(DeploymentValidationError) as exc_info: with pytest.raises(DeploymentValidationError) as exc_info:
_initialize_gpu_counts( _initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=True config, connector, require_prefill=True, require_decode=True
) )
assert "decode_engine_num_gpu" in str(exc_info.value) assert "decode_engine_num_gpu" in str(exc_info.value)
# Tests for dryrun GPU defaults
class TestDryrunGpuDefaults:
@staticmethod
def _build_dryrun_config(**overrides) -> PlannerConfig:
defaults = dict(
throughput_adjustment_interval=60,
prefill_engine_num_gpu=1,
decode_engine_num_gpu=1,
min_endpoint=1,
max_gpu_budget=-1,
ttft=500.0,
itl=50.0,
backend="vllm",
no_operation=True,
no_correction=True,
metric_pulling_prometheus_endpoint="http://localhost:9090",
metric_reporting_prometheus_port=0,
load_predictor="constant",
load_predictor_warmup_trace=None,
load_predictor_log1p=False,
profile_results_dir=os.path.join(
os.path.dirname(__file__),
"..",
"data",
"profiling_results",
"H200_TP1P_TP1D",
),
environment="kubernetes",
namespace="test-namespace",
mode="disagg",
enable_throughput_scaling=True,
enable_load_scaling=False,
)
defaults.update(overrides)
return PlannerConfig.model_construct(**defaults)
def test_dryrun_defaults_gpu_counts_when_none(self):
"""Test that dryrun sets default GPU counts of 1 when None"""
config = self._build_dryrun_config(
prefill_engine_num_gpu=None, decode_engine_num_gpu=None
)
try:
run_sla_planner_dryrun(config, dataset="nonexistent.jsonl")
except (FileNotFoundError, ValueError):
pass
assert config.prefill_engine_num_gpu == 1
assert config.decode_engine_num_gpu == 1
def test_dryrun_preserves_cli_gpu_counts(self):
"""Test that dryrun preserves GPU counts provided via config"""
config = self._build_dryrun_config(
prefill_engine_num_gpu=2, decode_engine_num_gpu=4
)
try:
run_sla_planner_dryrun(config, dataset="nonexistent.jsonl")
except (FileNotFoundError, ValueError):
pass
assert config.prefill_engine_num_gpu == 2
assert config.decode_engine_num_gpu == 4
...@@ -12,4 +12,4 @@ pmdarima==2.1.1 ...@@ -12,4 +12,4 @@ pmdarima==2.1.1
prometheus-api-client==0.6.0 prometheus-api-client==0.6.0
prophet==1.2.1 prophet==1.2.1
scikit-learn==1.7.2 scikit-learn==1.7.2
scipy<1.14.0 # Upper bound for pmdarima compatibility scipy>=1.14.0,<2.0
...@@ -25,8 +25,8 @@ The Dynamo **Planner** is an autoscaler purpose-built for these constraints. It ...@@ -25,8 +25,8 @@ The Dynamo **Planner** is an autoscaler purpose-built for these constraints. It
The Planner supports two scaling modes that can run independently or together: The Planner supports two scaling modes that can run independently or together:
- **Throughput-based scaling**: Uses pre-deployment profiling data and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments. - **Throughput-based scaling**: Uses pre-deployment engine performance data (from self-benchmark or profiler) and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments.
- **Load-based scaling**: Uses ForwardPassMetrics (FPM) from the Dynamo event plane and fits an online linear regression to make scaling decisions. No profiling data or KV Router required. Adjusts on a short interval (default 5s) to respond quickly to bursts. - **Load-based scaling**: Uses ForwardPassMetrics (FPM) from the Dynamo event plane and fits an online linear regression to make scaling decisions. No pre-deployment data or KV Router required. Adjusts on a short interval (default 5s) to respond quickly to bursts.
When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor. When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor.
...@@ -36,12 +36,12 @@ When both modes are enabled, throughput-based scaling provides a capacity floor ...@@ -36,12 +36,12 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
|---------|:----------------:|:-------------------------:| |---------|:----------------:|:-------------------------:|
| **Deployment** | | | | **Deployment** | | |
| Disaggregated | Supported | Supported | | Disaggregated | Supported | Supported |
| Aggregated | Unsupported | Supported | | Aggregated | Supported | Supported |
| **LLM Framework** | | | | **LLM Framework** | | |
| SGLang | Supported | Supported | | SGLang | Supported | Supported |
| TensorRT-LLM | Supported | Supported | | TensorRT-LLM | Supported | Supported |
| vLLM | Supported | Supported | | vLLM | Supported | Supported |
| **Requires Profiling Data** | Yes | No | | **Requires Pre-deployment Data** | Yes (self-benchmark or profiler) | No |
| **Load Predictors** | ARIMA, Prophet, Kalman, Constant | N/A | | **Load Predictors** | ARIMA, Prophet, Kalman, Constant | N/A |
| **Router** | | | | **Router** | | |
| Any (round-robin, random, etc.) | Supported | Not supported | | Any (round-robin, random, etc.) | Supported | Not supported |
...@@ -52,8 +52,8 @@ When both modes are enabled, throughput-based scaling provides a capacity floor ...@@ -52,8 +52,8 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
## When to Use Which Mode ## When to Use Which Mode
- **Throughput-based scaling** should be enabled whenever engine profiling data is available (through pre-deployment profiling). It provides stable, prediction-based capacity planning. - **Throughput-based scaling** should be enabled whenever engine performance data is available (through self-benchmark or pre-deployment profiling). It provides stable, prediction-based capacity planning.
- **Load-based scaling** should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring profiling data. - **Load-based scaling** should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring pre-deployment data.
- **Both modes together**: For the best of both worlds, enable both. Throughput-based scaling provides a lower bound (long-term capacity), while load-based scaling handles bursts above that floor. When both are enabled, use a longer `--adjustment-interval` for throughput-based scaling. - **Both modes together**: For the best of both worlds, enable both. Throughput-based scaling provides a lower bound (long-term capacity), while load-based scaling handles bursts above that floor. When both are enabled, use a longer `--adjustment-interval` for throughput-based scaling.
## Quick Start ## Quick Start
...@@ -63,7 +63,7 @@ When both modes are enabled, throughput-based scaling provides a capacity floor ...@@ -63,7 +63,7 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
- Dynamo platform installed on Kubernetes ([Installation Guide](../../kubernetes/installation-guide.md)) - Dynamo platform installed on Kubernetes ([Installation Guide](../../kubernetes/installation-guide.md))
- kube-prometheus-stack installed ([Metrics Setup](../../kubernetes/observability/metrics.md)) - kube-prometheus-stack installed ([Metrics Setup](../../kubernetes/observability/metrics.md))
For throughput-based scaling, pre-deployment profiling is also required ([Profiling Guide](../profiler/profiler-guide.md)). For throughput-based scaling, pre-deployment engine performance data is also required (via self-benchmark mode or [Profiling Guide](../profiler/profiler-guide.md)).
### Throughput-Based Scaling (with DGDR) ### Throughput-Based Scaling (with DGDR)
...@@ -141,13 +141,11 @@ Load-based scaling has the following known limitations. Throughput-based scaling ...@@ -141,13 +141,11 @@ Load-based scaling has the following known limitations. Throughput-based scaling
| `--adjustment-interval` | `180` | Seconds between throughput-based scaling decisions | | `--adjustment-interval` | `180` | Seconds between throughput-based scaling decisions |
| `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) | | `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
| `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) | | `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
| `--no-correction` | `true` | Disable correction factors (auto-disabled when load-based scaling is on) |
| **Load-based scaling** | | | | **Load-based scaling** | | |
| `--enable-loadbased-scaling` | `false` | Enable load-based scaling | | `--enable-loadbased-scaling` | `false` | Enable load-based scaling |
| `--disable-throughput-scaling` | `false` | Disable throughput-based scaling (required for `agg` mode) | | `--loadbased-adjustment-interval` | `5` | Seconds between FPM regression updates and load-based scaling decisions |
| `--loadbased-router-metrics-url` | auto-discovered | URL to router's `/metrics` endpoint | | `--max-num-fpm-samples` | `64` | Maximum retained FPM observations for regression |
| `--loadbased-adjustment-interval` | `5` | Seconds between load-based scaling decisions | | `--fpm-sample-bucket-size` | `16` | Number of buckets for observation retirement (must be perfect square) |
| `--loadbased-learning-window` | `50` | Sliding window size for regression model |
| `--loadbased-scaling-down-sensitivity` | `80` | Scale-down sensitivity 0-100 (0=never, 100=aggressive) | | `--loadbased-scaling-down-sensitivity` | `80` | Scale-down sensitivity 0-100 (0=never, 100=aggressive) |
| `--loadbased-metric-samples` | `10` | Number of metric samples per adjustment interval | | `--loadbased-metric-samples` | `10` | Number of metric samples per adjustment interval |
| `--loadbased-min-observations` | `5` | Minimum observations before regression activates | | `--loadbased-min-observations` | `5` | Minimum observations before regression activates |
...@@ -175,7 +173,7 @@ The dashboard shows: ...@@ -175,7 +173,7 @@ The dashboard shows:
- Worker counts and GPU usage over time - Worker counts and GPU usage over time
- Observed TTFT, ITL, request rate, sequence lengths - Observed TTFT, ITL, request rate, sequence lengths
- Predicted load and recommended replica counts - Predicted load and recommended replica counts
- Correction factors (actual vs. expected performance) - FPM regression model status
### Prometheus Metrics ### Prometheus Metrics
......
...@@ -12,12 +12,12 @@ For a quick overview, see the [Planner overview](README.md). For architecture in ...@@ -12,12 +12,12 @@ For a quick overview, see the [Planner overview](README.md). For architecture in
The planner supports two scaling modes that can be used independently or together: The planner supports two scaling modes that can be used independently or together:
- **Throughput-based scaling** (`enable_throughput_scaling: true`): Uses pre-deployment engine interpolation data and traffic prediction to plan capacity. Best for stable, predictable workloads. Requires profiling data generated by the [Profiler](../profiler/profiler-guide.md). - **Throughput-based scaling** (`enable_throughput_scaling: true`): Uses pre-deployment engine performance data (from self-benchmark or profiler) and traffic prediction to plan capacity. Best for stable, predictable workloads.
- **Load-based scaling** (`enable_load_scaling: true`): Uses real-time per-worker engine metrics and online regression. Best for bursty or unpredictable traffic. Does not require profiling data. Requires the [KV Router](../router/README.md) — see [Current Limitations](README.md#current-limitations). - **Load-based scaling** (`enable_load_scaling: true`): Uses real-time ForwardPassMetrics (FPM) from the Dynamo event plane and online regression to make scaling decisions. Best for bursty or unpredictable traffic. Does not require pre-deployment data.
**When to use which:** **When to use which:**
- Enable **throughput-based scaling** whenever profiling data is available. It provides stable, prediction-based capacity planning. - Enable **throughput-based scaling** whenever pre-deployment performance data is available (via self-benchmark or profiler). It provides stable, prediction-based capacity planning.
- Enable **load-based scaling** when traffic is bursty. It reacts quickly to real-time load changes. - Enable **load-based scaling** when traffic is bursty. It reacts quickly to real-time load changes.
- Enable **both** for the best of both worlds: throughput-based provides a capacity floor, load-based handles bursts above it. When both are enabled, use a longer `throughput_adjustment_interval`. - Enable **both** for the best of both worlds: throughput-based provides a capacity floor, load-based handles bursts above it. When both are enabled, use a longer `throughput_adjustment_interval`.
...@@ -39,8 +39,8 @@ features: ...@@ -39,8 +39,8 @@ features:
| Field | Type | Default | Description | | Field | Type | Default | Description |
|-------|------|---------|-------------| |-------|------|---------|-------------|
| `enable_throughput_scaling` | bool | `true` | Enable throughput-based scaling (requires pre-deployment profiling data). | | `enable_throughput_scaling` | bool | `true` | Enable throughput-based scaling (requires pre-deployment performance data). |
| `enable_load_scaling` | bool | `false` | Enable load-based scaling (no pre-deployment profiling data required). | | `enable_load_scaling` | bool | `false` | Enable load-based scaling. |
At least one scaling mode must be enabled. At least one scaling mode must be enabled.
...@@ -48,9 +48,9 @@ At least one scaling mode must be enabled. ...@@ -48,9 +48,9 @@ At least one scaling mode must be enabled.
| Field | Type | Default | Description | | Field | Type | Default | Description |
|-------|------|---------|-------------| |-------|------|---------|-------------|
| `pre_deployment_sweeping_mode` | string | `rapid` | How to generate engine interpolation data: `rapid` (AIC simulation, ~30s), `thorough` (real GPUs, 2-4h), or `none` (skip). | | `pre_deployment_sweeping_mode` | string | `rapid` | How to generate engine performance data: `rapid` (AIC simulation, ~30s), `thorough` (real GPUs, 2-4h), or `none` (skip). |
When throughput-based scaling is enabled, the planner needs interpolation curves that map ISL to TTFT (prefill) and KV-cache utilization to ITL (decode). The profiler generates this data based on the `pre_deployment_sweeping_mode` setting. See the [Profiler Guide](../profiler/profiler-guide.md) for details on how this data is produced. When throughput-based scaling is enabled, the planner needs engine performance data. At startup, it first tries to fetch self-benchmark results from the `get_perf_metrics` Dynamo endpoint (see PR #7779). If unavailable, it falls back to profiler-generated data (npz or JSON) at `profile_results_dir`. Both sources are converted to ForwardPassMetrics and fed into the FPM regression model.
### Throughput-Based Scaling Settings ### Throughput-Based Scaling Settings
...@@ -61,14 +61,14 @@ When throughput-based scaling is enabled, the planner needs interpolation curves ...@@ -61,14 +61,14 @@ When throughput-based scaling is enabled, the planner needs interpolation curves
| `max_gpu_budget` | int | `8` | Maximum total GPUs the planner may allocate. | | `max_gpu_budget` | int | `8` | Maximum total GPUs the planner may allocate. |
| `ttft` | float | `500.0` | TTFT SLA target (ms) for scaling decisions. | | `ttft` | float | `500.0` | TTFT SLA target (ms) for scaling decisions. |
| `itl` | float | `50.0` | ITL SLA target (ms) for scaling decisions. | | `itl` | float | `50.0` | ITL SLA target (ms) for scaling decisions. |
| `no_correction` | bool | `true` | Disable latency correction factor. Auto-disabled when load-based scaling is on. |
### Load-Based Scaling Settings ### Load-Based Scaling Settings
| Field | Type | Default | Description | | Field | Type | Default | Description |
|-------|------|---------|-------------| |-------|------|---------|-------------|
| `load_adjustment_interval` | int | `5` | Seconds between load-based scaling decisions. Must be shorter than `throughput_adjustment_interval`. | | `load_adjustment_interval` | int | `5` | Seconds between FPM regression updates and load-based scaling decisions. Even when only throughput scaling is enabled, live FPM observations are fed into the regression at this interval. Must be shorter than `throughput_adjustment_interval`. |
| `load_learning_window` | int | `50` | Sliding window size for regression model. | | `max_num_fpm_samples` | int | `64` | Maximum retained FPM observations for regression. |
| `fpm_sample_bucket_size` | int | `16` | Number of buckets for observation retirement (must be a perfect square). |
| `load_scaling_down_sensitivity` | int | `80` | Scale-down sensitivity 0–100 (0=never, 100=aggressive). | | `load_scaling_down_sensitivity` | int | `80` | Scale-down sensitivity 0–100 (0=never, 100=aggressive). |
| `load_metric_samples` | int | `10` | Number of metric samples to collect per decision. | | `load_metric_samples` | int | `10` | Number of metric samples to collect per decision. |
| `load_min_observations` | int | `5` | Minimum observations before making scaling decisions. | | `load_min_observations` | int | `5` | Minimum observations before making scaling decisions. |
...@@ -105,8 +105,8 @@ When throughput-based scaling is enabled, the planner needs interpolation curves ...@@ -105,8 +105,8 @@ When throughput-based scaling is enabled, the planner needs interpolation curves
When the profiler runs with planner enabled, it: When the profiler runs with planner enabled, it:
1. Selects the best prefill and decode engine configurations 1. Selects the best prefill and decode engine configurations
2. Generates interpolation curves (TTFT vs ISL, ITL vs KV-cache utilization) 2. Generates engine performance data (prefill TTFT vs ISL, decode ITL vs KV-cache utilization)
3. Saves the `PlannerConfig` and profiling data into separate Kubernetes ConfigMaps 3. Saves the `PlannerConfig` and performance data into separate Kubernetes ConfigMaps
4. Adds the planner service to the generated DGD, configured to read from those ConfigMaps 4. Adds the planner service to the generated DGD, configured to read from those ConfigMaps
The planner receives its config via `--config /path/to/planner_config.json` which is mounted from the `planner-config-XXXX` ConfigMap. Profiling data is mounted from the `planner-profile-data-XXXX` ConfigMap. The planner receives its config via `--config /path/to/planner_config.json` which is mounted from the `planner-config-XXXX` ConfigMap. Profiling data is mounted from the `planner-profile-data-XXXX` ConfigMap.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment