feat(planner): unify throughput and load scaling on FPM regression (#7961)

66f7832a · Hongkuan Zhou · GitHub · 0b7a18ce · 66f7832a · 66f7832a
Unverified Commit 66f7832a authored Apr 08, 2026 by Hongkuan Zhou Committed by GitHub Apr 08, 2026
11 changed files
--- a/components/src/dynamo/planner/tests/manual/README.md
+++ b/components/src/dynamo/planner/tests/manual/README.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
-All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
 # SLA Planner Load Test
@@ -19,52 +15,13 @@ You have two options to obtain the pre-deployment profiling data:
 ### Option A: Use Test Configuration (Quickstart)
 Use the pre-configured test deployment with sample profiling data, we provide the results and the deployment configuration for the following models x hardware configurations:
 - `nvidia/Llama-3.1-8B-Instruct-FP8` on H200 with max context length 16384, TP1 Prefill, and TP1 Decode. At ISL/OSL 3000/150, it achieves 40k tokens/s/gpu prefill with 80ms TTFT and 10k tokens/s/gpu decode with 10ms ITL. See `../tests/data/profiling_results/H200_TP1P_TP1D/`.
 ### Option B: Use Your Own Profiling Results
 1. Run pre-deployment profiling for your specific setup. See the [pre-deployment profiling documentation](../../../../../../docs/components/profiler/profiler-guide.md) for detailed instructions.
-## Interpolator Testing
-SLA planner uses two interpolators to estimate the performance of prefill and decode. You can test the interpolators with the following command:
-```bash
-python components/src/dynamo/planner/core/throughput/interpolation.py \
-  --profile_results_dir <path_to_profile_results> \
-  --isl <ISL> \
-  --osl <OSL> \
-  --ttft <TTFT(ms)> \
-  --itl <ITL(ms)>
-```
-The script will perform the interpolation based on ISL, OSL, and TTFT and ITL SLAs and advise the load that can saturate the engine.
-For example, to test the interpolator for `nvidia/Llama-3.1-8B-Instruct-FP8` on H200 (target TTFT=200ms, ITL=10ms):
-```bash
-python components/src/dynamo/planner/core/throughput/interpolation.py \
-  --profile_results_dir components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/ \
-  --isl 3000 \
-  --osl 300 \
-  --ttft 200 \
-  --itl 10
-# output:
-ISL=3000, OSL=300
-TTFT=200ms, ITL=10ms
-Using profile results from components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/
-Interpolating prefill performance ...
-        Estimated TTFT=60.00ms <= target TTFT=200.00ms. Requests can queue 140.00ms maximally while meeting TTFT SLA.
-        Estimated throughput: 49481.09 tokens/s/gpu. Request rate at 16.49 requests/s will saturate one GPU.
-Interpolating decode performance ...
-        Average context length: isl + osl/2 = 3150.
-        Estimated ITL=9.70ms <= target ITL=10.00ms at 16.16% active kv usage.
-        Estimated throughput: 4555.68 token/s/gpu. Request rate at 15.19 requests/s will saturate one GPU.
-```
 ## Generating Load Dataset
 We provide a tool to generate load dataset with varying request rate. More details can be found in [sin_load_generator](../../../../../../benchmarks/sin_load_generator/README.md).
@@ -89,36 +46,6 @@ python benchmarks/sin_load_generator/sin_synth.py \
 The dataset starts at 5 requests/s, increases to 45 requests/s at t=300s, decreases back to 5 requests/s at t=600s, and repeats.
 The total duration is 30 minutes or 1800 seconds.
-## Planner Dry Run
-Before testing SLA planner on real deployments, we provide a dry run feature to test the autoscaling behavior on a given dataset. Specifically, in dry run mode,
- The load predictor will be tested. However, the load metrics will be different from the real deployment because the actual OSL is only known after the requests are processed.
- There will be no SLA predictions. Instead, sla planner will show the safe throughput limit that will ensure the requests can be processed within the SLA.
- The correction factor will be disabled because there is no SLA metrics as reference.
-To dry run SLA planner,
-```bash
-python components/src/dynamo/planner/tests/manual/unit/planner_sla_dryrun.py \
-    --config '{"environment":"kubernetes","backend":"vllm","ttft":200,"itl":10,"profile_results_dir":"components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D","throughput_adjustment_interval":60,"no_correction":true}' \
-    --dataset rr-5-45_i3000o300.jsonl \
-    --start-num-p 1 \
-    --start-num-d 1 \
-    --output-plot dryrun_plot.png
-```
-Below is the dryrun result:
-![Dryrun Plot](./figures/dryrun_plot.png)
-The first plot shows the actual request rate and the predicted request rate (in the unit of requests/adjustment_interval).
-The second plot shows the actual ISL/OSL and the predicted ISL/OSL. The first two plots are useful when tuning the performance of the load predictor.
-The third plot shows the actual prefill throughput, number of prefill workers that planner scales, and the safe throughput limit with the number of prefill workers. If the actual throughput is below the safe throughput limit, the deployment has the capacity to adhere the TTFT SLA. Note that in the real deployment, due to other factors such as queueing, load balancing, KV cache transfer latency, and ISL variance, it is not guaranteed that the actual deployment can adhere the TTFT SLA.
-The fourth plot, similar to the third plot, shows the actual decode throughput, number of decode workers that planner scales, and the safe throughput limit with the number of decode workers. If the actual throughput is below the safe throughput limit, the deployment has the capacity to adhere the ITL SLA. Note that in the real deployment, due to other factors such as load balancing and OSL variance, it is not guaranteed that the actual deployment can adhere the ITL SLA.
 ## Scaling Tests
 This directory contains comprehensive tests for validating the SLA planner's scaling behavior. The tests validate both the replica calculation logic and end-to-end scaling behavior. The scaling test uses a graduated load approach rather than dataset files, as it proved more reliable for metric generation and scaling triggers.
@@ -132,6 +59,7 @@ This directory contains comprehensive tests for validating the SLA planner's sca
 ### Quick Start for Unit Tests and End-to-End Tests
 #### Run Unit Tests Only
 Test the replica calculation logic without requiring Kubernetes:
 ```bash
@@ -175,6 +103,7 @@ components/src/dynamo/planner/tests/manual/scaling/run_scaling_test.sh --namespa
 ### Instructions for End-to-End Perf Tests
 In this test, we compare performance (goodput and goodput/GPU) on deployments on the following four deployments using the aforementioned 8b FP8 model on H200 and the dataset used in dryrun:
 - Config 1 with inefficient P/D ratio: 3xTP1P_1xTP1D_4GPU
 `./perf_test_configs/disagg_8b_3p1d.yaml`
 - Config 2 with best static deployment: 2xTP1P_2xTP1D_4GPU
@@ -214,12 +143,13 @@ aiperf profile \
 #### E2E Perf Test Results
-![Results](./figures/sla_planner_perf.png)
+Results
 The table below shows the performance improvement of SLA planner across different deployment configurations:
-| Baseline | Goodput Improvement | Goodput/GPU Improvement |
-|---------------|-----------------|-------------------------|
+| Baseline                            | Goodput Improvement | Goodput/GPU Improvement |
-| Inefficient P/D ratio | 725% | 600% |
+| ----------------------------------- | ------------------- | ----------------------- |
-| Inefficient parallelization mapping | 311% | 249% |
+| Inefficient P/D ratio               | 725%                | 600%                    |
-| Best static deployment | 52% | 29% |
+| Inefficient parallelization mapping | 311%                | 249%                    |
+| Best static deployment              | 52%                 | 29%                     |
--- a/components/src/dynamo/planner/tests/manual/perf_test_configs/disagg_8b_planner.yaml
+++ b/components/src/dynamo/planner/tests/manual/perf_test_configs/disagg_8b_planner.yaml
@@ -82,7 +82,7 @@ spec:
            - dynamo.planner
          args:
            - --config
-            - '{"environment": "kubernetes", "backend": "vllm", "ttft": 200, "itl": 10, "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/", "throughput_adjustment_interval": 60, "metric_reporting_prometheus_port": 9085, "no_correction": true}'
+            - '{"environment": "kubernetes", "backend": "vllm", "ttft": 200, "itl": 10, "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/", "throughput_adjustment_interval": 60, "metric_reporting_prometheus_port": 9085}'
    VllmDecodeWorker:
      envFromSecret: hf-token-secret
      componentType: worker

--- a/components/src/dynamo/planner/tests/manual/scaling/disagg_planner_throughput.yaml
+++ b/components/src/dynamo/planner/tests/manual/scaling/disagg_planner_throughput.yaml
@@ -25,7 +25,7 @@ spec:
            - dynamo.planner
          args:
            - --config
-            - '{"environment": "kubernetes", "backend": "vllm", "throughput_adjustment_interval": 60, "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D", "no_correction": true}'
+            - '{"environment": "kubernetes", "backend": "vllm", "throughput_adjustment_interval": 60, "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D"}'
    VllmDecodeWorker:
      envFromSecret: hf-token-secret
      componentType: worker

--- a/components/src/dynamo/planner/tests/manual/unit/planner_sla_dryrun.py
+++ b/components/src/dynamo/planner/tests/manual/unit/planner_sla_dryrun.py
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-import argparse
-import logging
-from dynamo.planner.config.planner_config import PlannerConfig
-from dynamo.planner.offline.dryrun import run_sla_planner_dryrun
-logger = logging.getLogger(__name__)
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description="Planner Dryrun")
-    parser.add_argument(
-        "--config",
-        required=True,
-        help="JSON string or path to a JSON/YAML config file",
-    )
-    parser.add_argument(
-        "--dataset", type=str, required=True, help="Path to the jsonl dataset file"
-    )
-    parser.add_argument(
-        "--start-num-p",
-        type=int,
-        default=1,
-        help="Number of prefill workers to start with",
-    )
-    parser.add_argument(
-        "--start-num-d",
-        type=int,
-        default=1,
-        help="Number of decode workers to start with",
-    )
-    parser.add_argument(
-        "--output-plot",
-        type=str,
-        default="dryrun_plot.png",
-        help="Path to the output plot file",
-    )
-    args = parser.parse_args()
-    config = PlannerConfig.from_config_arg(args.config)
-    run_sla_planner_dryrun(
-        config,
-        dataset=args.dataset,
-        start_num_p=args.start_num_p,
-        start_num_d=args.start_num_d,
-        output_plot=args.output_plot,
-    )
--- a/components/src/dynamo/planner/tests/unit/test_load_based_scaling.py
+++ b/components/src/dynamo/planner/tests/unit/test_load_based_scaling.py
@@ -19,7 +19,7 @@ from dynamo.common.forward_pass_metrics import (
 )
 from dynamo.planner.config.planner_config import PlannerConfig
 from dynamo.planner.core.decode import DecodePlanner
-from dynamo.planner.core.load.fpm_regression import (
+from dynamo.planner.core.perf_model import (
    AggRegressionModel,
    DecodeRegressionModel,
    PrefillRegressionModel,
@@ -70,19 +70,24 @@ def _make_fpm(
 class TestPrefillRegressionModel:
    def test_insufficient_data(self):
-        model = PrefillRegressionModel(window_size=50, min_observations=5)
+        model = PrefillRegressionModel(
+            max_num_fpm_samples=50, min_observations=5, bucket_count=16
+        )
        assert not model.has_sufficient_data()
        assert model.estimate_next_ttft(0, 2048) is None
    def test_heartbeat_skipped(self):
-        model = PrefillRegressionModel(window_size=50, min_observations=3)
+        model = PrefillRegressionModel(
+            max_num_fpm_samples=50, min_observations=3, bucket_count=16
+        )
        fpm = _make_fpm(wall_time=0.0, sum_prefill_tokens=100, num_prefill_requests=1)
        model.add_observation(fpm)
        assert model.num_observations == 0
    def test_basic_regression_and_ttft_estimate(self):
-        model = PrefillRegressionModel(window_size=50, min_observations=3)
+        model = PrefillRegressionModel(
-        # wall_time = 0.001 * prefill_tokens + 0.002 (linear relationship)
+            max_num_fpm_samples=50, min_observations=3, bucket_count=16
+        )
        for tokens in [500, 1000, 1500, 2000, 2500]:
            fpm = _make_fpm(
                sum_prefill_tokens=tokens,
@@ -93,9 +98,6 @@ class TestPrefillRegressionModel:
        assert model.has_sufficient_data()
-        # Single iteration: queued=0, avg_isl should be mean of [500..2500]=1500
-        # total_tokens = 0 + avg_isl ≈ 1500
-        # 1 iteration at max_num_batched_tokens=2048 (1500 < 2048)
        est = model.estimate_next_ttft(
            queued_prefill_tokens=0, max_num_batched_tokens=2048
        )
@@ -103,8 +105,9 @@ class TestPrefillRegressionModel:
        assert est > 0
    def test_chunked_ttft_simulation(self):
-        model = PrefillRegressionModel(window_size=50, min_observations=3)
+        model = PrefillRegressionModel(
-        # Simple: wall_time = 0.001 * prefill_tokens (slope=0.001, intercept≈0)
+            max_num_fpm_samples=50, min_observations=3, bucket_count=16
+        )
        for tokens in [100, 200, 300, 400, 500]:
            fpm = _make_fpm(
                sum_prefill_tokens=tokens,
@@ -113,11 +116,6 @@ class TestPrefillRegressionModel:
            )
            model.add_observation(fpm)
-        # avg_isl = mean([100,200,300,400,500]) = 300
-        # total_tokens = 5000 (queued) + 300 (next ISL) = 5300
-        # max_num_batched_tokens = 2048
-        # iterations: ceil(5300/2048) = 3
-        # chunk1=2048, chunk2=2048, chunk3=1204
        est = model.estimate_next_ttft(
            queued_prefill_tokens=5000, max_num_batched_tokens=2048
        )
@@ -125,7 +123,9 @@ class TestPrefillRegressionModel:
        assert est > 0.003  # at least 3 iterations worth
    def test_avg_isl_tracking(self):
-        model = PrefillRegressionModel(window_size=50, min_observations=3)
+        model = PrefillRegressionModel(
+            max_num_fpm_samples=50, min_observations=3, bucket_count=16
+        )
        for isl in [1000, 2000, 3000]:
            fpm = _make_fpm(
                sum_prefill_tokens=isl, num_prefill_requests=1, wall_time=0.01
@@ -133,39 +133,219 @@ class TestPrefillRegressionModel:
            model.add_observation(fpm)
        assert abs(model.avg_isl - 2000.0) < 1.0
-    def test_sliding_window_eviction(self):
+    def test_find_best_engine_prefill_rps(self):
-        model = PrefillRegressionModel(window_size=5, min_observations=3)
+        model = PrefillRegressionModel(
-        for i in range(10):
+            max_num_fpm_samples=50, min_observations=3, bucket_count=16
-            fpm = _make_fpm(sum_prefill_tokens=100 * (i + 1), wall_time=0.01)
+        )
+        for tokens in [500, 1000, 1500, 2000, 2500]:
+            fpm = _make_fpm(
+                sum_prefill_tokens=tokens,
+                num_prefill_requests=1,
+                wall_time=0.001 * tokens + 0.002,
+            )
            model.add_observation(fpm)
+        rps, actual_ttft_ms = model.find_best_engine_prefill_rps(
+            ttft_sla=2000.0, isl=1000.0
+        )
+        assert rps > 0
+        # wall_time ~1.002s for 1000 tokens -> rps ~ 1/1.002 ~ 0.998
+        assert 0.5 < rps < 2.0
+        assert actual_ttft_ms > 0
+        assert 1000 < actual_ttft_ms < 2000
+    def test_find_best_engine_prefill_rps_zero_isl(self):
+        model = PrefillRegressionModel(
+            max_num_fpm_samples=50, min_observations=3, bucket_count=16
+        )
+        for tokens in [500, 1000, 1500]:
+            fpm = _make_fpm(
+                sum_prefill_tokens=tokens,
+                num_prefill_requests=1,
+                wall_time=0.001 * tokens,
+            )
+            model.add_observation(fpm)
+        rps, _ = model.find_best_engine_prefill_rps(ttft_sla=1000.0, isl=0.0)
+        assert rps == 0.0
+    def test_load_benchmark_fpms(self):
+        model = PrefillRegressionModel(
+            max_num_fpm_samples=50, min_observations=3, bucket_count=16
+        )
+        fpms = [
+            _make_fpm(sum_prefill_tokens=t, num_prefill_requests=1, wall_time=0.001 * t)
+            for t in [500, 1000, 1500, 2000, 2500]
+        ]
+        model.load_benchmark_fpms(fpms)
        assert model.num_observations == 5
+        assert model.has_sufficient_data()
+        est = model.estimate_next_ttft(
+            queued_prefill_tokens=0, max_num_batched_tokens=2048
+        )
+        assert est is not None
+# ── Bucketed retirement tests ─────────────────────────────────────────
+class TestBucketedRetirement:
+    def test_total_capped_at_max(self):
+        """Total observations never exceed max_num_fpm_samples."""
+        model = PrefillRegressionModel(
+            max_num_fpm_samples=10, min_observations=3, bucket_count=4
+        )
+        for i in range(20):
+            fpm = _make_fpm(
+                sum_prefill_tokens=100 * (i + 1),
+                num_prefill_requests=1,
+                wall_time=0.01 * (i + 1),
+            )
+            model.add_observation(fpm)
+        assert model.num_observations == 10
+    def test_most_populated_bucket_loses_oldest(self):
+        """When evicting, the oldest entry from the most-populated bucket is removed."""
+        model = PrefillRegressionModel(
+            max_num_fpm_samples=6, min_observations=1, bucket_count=4
+        )
+        # 3 observations at low tokens (bucket 0 area)
+        for i in range(3):
+            fpm = _make_fpm(
+                sum_prefill_tokens=10 + i,
+                num_prefill_requests=1,
+                wall_time=0.001 * (10 + i),
+            )
+            model.add_observation(fpm)
+        # 3 observations at high tokens (different bucket)
+        for i in range(3):
+            fpm = _make_fpm(
+                sum_prefill_tokens=1000 + i * 100,
+                num_prefill_requests=1,
+                wall_time=0.001 * (1000 + i * 100),
+            )
+            model.add_observation(fpm)
+        assert model.num_observations == 6
+        # One more at low tokens; total would exceed 6 so most-populated
+        # bucket loses its oldest entry.
+        fpm = _make_fpm(
+            sum_prefill_tokens=15,
+            num_prefill_requests=1,
+            wall_time=0.015,
+        )
+        model.add_observation(fpm)
+        assert model.num_observations == 6
+    def test_uniform_distribution_preserved(self):
+        """Bucketed eviction keeps observations across operating points."""
+        model = DecodeRegressionModel(
+            max_num_fpm_samples=10, min_observations=3, bucket_count=16
+        )
+        # Many observations at a single operating point
+        for _ in range(15):
+            fpm = _make_fpm(
+                num_decode_requests=32,
+                sum_decode_kv_tokens=32000,
+                wall_time=0.01,
+            )
+            model.add_observation(fpm)
+        assert model.num_observations == 10
+        # Add a different operating point; the concentrated bucket loses one
+        fpm = _make_fpm(
+            num_decode_requests=4,
+            sum_decode_kv_tokens=4000,
+            wall_time=0.005,
+        )
+        model.add_observation(fpm)
+        assert model.num_observations == 10
+    def test_2d_bucketed_retirement(self):
+        """2D models retire from the most-populated grid cell."""
+        model = AggRegressionModel(
+            max_num_fpm_samples=8, min_observations=1, bucket_count=16
+        )
+        # Fill with varied data
+        for p, d in [(100, 500), (200, 1000), (300, 1500), (400, 2000)]:
+            fpm = _make_fpm(
+                sum_prefill_tokens=p,
+                num_prefill_requests=1,
+                sum_decode_kv_tokens=d,
+                num_decode_requests=5,
+                wall_time=0.001 * p + 0.0001 * d,
+            )
+            model.add_observation(fpm)
+        # Concentrate 4 more in one region
+        for _ in range(4):
+            fpm = _make_fpm(
+                sum_prefill_tokens=100,
+                num_prefill_requests=1,
+                sum_decode_kv_tokens=500,
+                num_decode_requests=5,
+                wall_time=0.15,
+            )
+            model.add_observation(fpm)
+        assert model.num_observations == 8
+        # Overflow triggers retirement from the concentrated cell
+        fpm = _make_fpm(
+            sum_prefill_tokens=350,
+            num_prefill_requests=1,
+            sum_decode_kv_tokens=1800,
+            num_decode_requests=5,
+            wall_time=0.5,
+        )
+        model.add_observation(fpm)
+        assert model.num_observations == 8
 # ── DecodeRegressionModel tests ──────────────────────────────────────
 class TestDecodeRegressionModel:
+    def _train_2d(self, model: DecodeRegressionModel) -> None:
+        """Populate with 2D data: wall_time = f(num_decode_requests, sum_decode_kv_tokens)."""
+        for n_req, kv in [
+            (5, 1000),
+            (10, 2000),
+            (15, 3000),
+            (20, 4000),
+            (25, 5000),
+        ]:
+            fpm = _make_fpm(
+                sum_decode_kv_tokens=kv,
+                num_decode_requests=n_req,
+                wall_time=0.0001 * kv + 0.0005 * n_req + 0.001,
+            )
+            model.add_observation(fpm)
    def test_insufficient_data(self):
-        model = DecodeRegressionModel(window_size=50, min_observations=5)
+        model = DecodeRegressionModel(
+            max_num_fpm_samples=50, min_observations=5, bucket_count=16
+        )
        assert not model.has_sufficient_data()
        assert model.estimate_next_itl(0, 0) is None
    def test_heartbeat_skipped(self):
-        model = DecodeRegressionModel(window_size=50, min_observations=3)
+        model = DecodeRegressionModel(
+            max_num_fpm_samples=50, min_observations=3, bucket_count=16
+        )
        fpm = _make_fpm(wall_time=0.0, sum_decode_kv_tokens=100, num_decode_requests=1)
        model.add_observation(fpm)
        assert model.num_observations == 0
    def test_basic_itl_estimate(self):
-        model = DecodeRegressionModel(window_size=50, min_observations=3)
+        model = DecodeRegressionModel(
-        # wall_time = 0.0001 * decode_kv + 0.001
+            max_num_fpm_samples=50, min_observations=3, bucket_count=16
-        for kv in [1000, 2000, 3000, 4000, 5000]:
+        )
-            fpm = _make_fpm(
+        self._train_2d(model)
-                sum_decode_kv_tokens=kv,
-                num_decode_requests=10,
-                wall_time=0.0001 * kv + 0.001,
-            )
-            model.add_observation(fpm)
        assert model.has_sufficient_data()
        est = model.estimate_next_itl(scheduled_decode_kv=3000, queued_decode_kv=0)
@@ -173,7 +353,9 @@ class TestDecodeRegressionModel:
        assert est > 0
    def test_avg_decode_length_tracking(self):
-        model = DecodeRegressionModel(window_size=50, min_observations=3)
+        model = DecodeRegressionModel(
+            max_num_fpm_samples=50, min_observations=3, bucket_count=16
+        )
        for total_kv, num_req in [(1000, 10), (2000, 10), (3000, 10)]:
            fpm = _make_fpm(
                sum_decode_kv_tokens=total_kv,
@@ -183,35 +365,99 @@ class TestDecodeRegressionModel:
            model.add_observation(fpm)
        assert abs(model.avg_decode_length - 200.0) < 1.0
+    def _train_thpt_model(self, model: DecodeRegressionModel) -> None:
+        """Populate with 2D data at decode-realistic wall-time scale."""
+        for n_req, kv in [
+            (5, 5000),
+            (10, 10000),
+            (20, 20000),
+            (30, 30000),
+            (40, 40000),
+        ]:
+            fpm = _make_fpm(
+                sum_decode_kv_tokens=kv,
+                num_decode_requests=n_req,
+                wall_time=0.00001 * kv + 0.001,
+            )
+            model.add_observation(fpm)
+    def test_find_best_engine_decode_rps(self):
+        model = DecodeRegressionModel(
+            max_num_fpm_samples=50, min_observations=3, bucket_count=16
+        )
+        self._train_thpt_model(model)
+        rps, actual_itl = model.find_best_engine_decode_rps(
+            itl=50.0, context_length=1000.0, osl=150.0
+        )
+        assert rps > 0
+        assert actual_itl > 0
+        assert actual_itl <= 50.0
+    def test_find_best_engine_decode_rps_zero_context(self):
+        model = DecodeRegressionModel(
+            max_num_fpm_samples=50, min_observations=3, bucket_count=16
+        )
+        self._train_2d(model)
+        rps, itl_ms = model.find_best_engine_decode_rps(
+            itl=50.0, context_length=0.0, osl=150.0
+        )
+        assert rps == 0.0
+        assert itl_ms == 0.0
+    def test_load_benchmark_fpms(self):
+        model = DecodeRegressionModel(
+            max_num_fpm_samples=50, min_observations=3, bucket_count=16
+        )
+        fpms = [
+            _make_fpm(
+                num_decode_requests=n,
+                sum_decode_kv_tokens=n * 1000,
+                wall_time=0.001 * n,
+            )
+            for n in [5, 10, 15, 20, 25]
+        ]
+        model.load_benchmark_fpms(fpms)
+        assert model.num_observations == 5
+        assert model.has_sufficient_data()
 # ── AggRegressionModel tests ─────────────────────────────────────────
 class TestAggRegressionModel:
+    def _train_agg(self, model: AggRegressionModel) -> None:
+        for p, d in [(100, 1000), (200, 2000), (300, 3000), (400, 4000), (500, 5000)]:
+            fpm = _make_fpm(
+                sum_prefill_tokens=p,
+                num_prefill_requests=1,
+                sum_decode_kv_tokens=d,
+                num_decode_requests=10,
+                wall_time=0.001 * p + 0.0001 * d + 0.001,
+            )
+            model.add_observation(fpm)
    def test_insufficient_data(self):
-        model = AggRegressionModel(window_size=50, min_observations=5)
+        model = AggRegressionModel(
+            max_num_fpm_samples=50, min_observations=5, bucket_count=16
+        )
        assert not model.has_sufficient_data()
        assert model.estimate_next_ttft(0, 2048, 0) is None
        assert model.estimate_next_itl(0, 0) is None
    def test_heartbeat_skipped(self):
-        model = AggRegressionModel(window_size=50, min_observations=3)
+        model = AggRegressionModel(
+            max_num_fpm_samples=50, min_observations=3, bucket_count=16
+        )
        fpm = _make_fpm(wall_time=0.0, sum_prefill_tokens=100, sum_decode_kv_tokens=200)
        model.add_observation(fpm)
        assert model.num_observations == 0
    def test_2d_regression(self):
-        model = AggRegressionModel(window_size=50, min_observations=3)
+        model = AggRegressionModel(
-        # wall_time = 0.001 * prefill + 0.0001 * decode_kv + 0.001
+            max_num_fpm_samples=50, min_observations=3, bucket_count=16
-        for p, d in [(100, 1000), (200, 2000), (300, 3000), (400, 4000), (500, 5000)]:
+        )
-            fpm = _make_fpm(
+        self._train_agg(model)
-                sum_prefill_tokens=p,
-                num_prefill_requests=1,
-                sum_decode_kv_tokens=d,
-                num_decode_requests=10,
-                wall_time=0.001 * p + 0.0001 * d + 0.001,
-            )
-            model.add_observation(fpm)
        assert model.has_sufficient_data()
@@ -227,6 +473,37 @@ class TestAggRegressionModel:
        assert itl is not None
        assert itl > 0
+    def test_find_best_engine_agg_rps(self):
+        model = AggRegressionModel(
+            max_num_fpm_samples=50, min_observations=3, bucket_count=16
+        )
+        self._train_agg(model)
+        thpt, actual_ttft, actual_itl = model.find_best_engine_agg_rps(
+            isl=2048.0,
+            osl=150.0,
+            max_num_batched_tokens=4096,
+            ttft_sla=500.0,
+            itl_sla=50.0,
+        )
+        assert isinstance(thpt, float)
+        assert thpt > 0
+        assert actual_ttft >= 0
+        assert actual_itl >= 0
+    def test_find_best_engine_agg_rps_insufficient_data(self):
+        model = AggRegressionModel(
+            max_num_fpm_samples=50, min_observations=5, bucket_count=16
+        )
+        thpt, _, _ = model.find_best_engine_agg_rps(
+            isl=2048.0,
+            osl=150.0,
+            max_num_batched_tokens=4096,
+            ttft_sla=500.0,
+            itl_sla=50.0,
+        )
+        assert thpt == 0.0
 # ── Planner integration tests (with mocked FPM subscriber) ──────────
@@ -249,7 +526,6 @@ def _build_load_config(**overrides) -> PlannerConfig:
        itl=50.0,
        backend="vllm",
        no_operation=True,
-        no_correction=True,
        metric_pulling_prometheus_endpoint="http://localhost:9090",
        metric_reporting_prometheus_port=0,
        load_predictor="constant",
@@ -266,7 +542,8 @@ def _build_load_config(**overrides) -> PlannerConfig:
        enable_load_scaling=True,
        enable_throughput_scaling=True,
        load_adjustment_interval=5,
-        load_learning_window=50,
+        max_num_fpm_samples=50,
+        fpm_sample_bucket_size=16,
        load_scaling_down_sensitivity=80,
        load_metric_samples=10,
        load_min_observations=5,
@@ -294,7 +571,6 @@ class TestPrefillFpmScaling:
        planner.model_name = "test-model"
        planner.prefill_worker_info = WorkerInfo(max_num_batched_tokens=2048)
-        # Train regression: wall_time grows linearly with prefill tokens
        for tokens in range(200, 1200, 100):
            fpm = _make_fpm(
                sum_prefill_tokens=tokens,
@@ -303,7 +579,6 @@ class TestPrefillFpmScaling:
            )
            planner.ttft_regression.add_observation(fpm)
-        # Both engines have heavy queued prefill -> high estimated TTFT
        stats = {
            ("w1", 0): _make_fpm(
                worker_id="w1",
@@ -335,8 +610,6 @@ class TestPrefillFpmScaling:
        planner.model_name = "test-model"
        planner.prefill_worker_info = WorkerInfo(max_num_batched_tokens=2048)
-        # Train with short ISL (100 tokens each) so avg_isl stays low.
-        # Regression: wall_time ≈ 0.001 * prefill_tokens
        for tokens in range(100, 600, 50):
            fpm = _make_fpm(
                sum_prefill_tokens=tokens,
@@ -345,9 +618,6 @@ class TestPrefillFpmScaling:
            )
            planner.ttft_regression.add_observation(fpm)
-        # All engines idle (no queued prefill).
-        # estimate_next_ttft: total = 0 + avg_isl(~100) = ~100 tokens
-        # predicted wall_time ≈ 0.001 * 100 = 0.1s = 100ms < 500ms SLA
        stats = {
            (f"w{i}", 0): _make_fpm(
                worker_id=f"w{i}",
@@ -372,7 +642,6 @@ class TestPrefillFpmScaling:
        planner.model_name = "test-model"
        planner.prefill_worker_info = WorkerInfo(max_num_batched_tokens=2048)
-        # Only 2 observations, need 5
        for tokens in [100, 200]:
            fpm = _make_fpm(sum_prefill_tokens=tokens, wall_time=0.01)
            planner.ttft_regression.add_observation(fpm)
@@ -394,11 +663,18 @@ class TestDecodeFpmScaling:
        planner = DecodePlanner(None, config, shared_state=shared_state)
        planner.model_name = "test-model"
-        for kv in range(1000, 6000, 500):
+        # 2D regression: vary both num_decode_requests and sum_decode_kv_tokens
+        for n_req, kv in [
+            (5, 1000),
+            (10, 2000),
+            (15, 3000),
+            (20, 4000),
+            (25, 5000),
+        ]:
            fpm = _make_fpm(
                sum_decode_kv_tokens=kv,
-                num_decode_requests=10,
+                num_decode_requests=n_req,
-                wall_time=0.0001 * kv + 0.001,
+                wall_time=0.0001 * kv + 0.0005 * n_req + 0.001,
            )
            planner.itl_regression.add_observation(fpm)
@@ -431,40 +707,17 @@ class TestDecodeFpmScaling:
        planner = DecodePlanner(None, config, shared_state=shared_state)
        planner.model_name = "test-model"
-        fpm = _make_fpm(sum_decode_kv_tokens=1000, wall_time=0.01)
+        fpm = _make_fpm(
+            sum_decode_kv_tokens=1000, num_decode_requests=5, wall_time=0.01
+        )
        planner.itl_regression.add_observation(fpm)
-        stats = {("w1", 0): _make_fpm(sum_decode_kv_tokens=5000, wall_time=0.5)}
+        stats = {
+            ("w1", 0): _make_fpm(
+                sum_decode_kv_tokens=5000, num_decode_requests=10, wall_time=0.5
+            )
+        }
        planner.fpm_subscriber = _mock_fpm_subscriber(stats)
        result = planner.load_plan_adjustment()
        assert result is None
-# ── Correction factor auto-disable tests ─────────────────────────────
-class TestCorrectionFactorAutoDisable:
-    def test_correction_factor_disabled_when_load_enabled(self):
-        config = PlannerConfig(
-            enable_load_scaling=True,
-            enable_throughput_scaling=True,
-            no_correction=False,
-        )
-        assert config.no_correction is True
-    def test_correction_factor_stays_disabled_if_already_set(self):
-        config = PlannerConfig(
-            enable_load_scaling=True,
-            enable_throughput_scaling=True,
-            no_correction=True,
-        )
-        assert config.no_correction is True
-    def test_correction_factor_not_disabled_without_loadbased(self):
-        config = PlannerConfig(
-            enable_load_scaling=False,
-            enable_throughput_scaling=True,
-            no_correction=False,
-        )
-        assert config.no_correction is False
--- a/components/src/dynamo/planner/tests/unit/test_planner_config.py
+++ b/components/src/dynamo/planner/tests/unit/test_planner_config.py
@@ -87,3 +87,36 @@ def test_throughput_metrics_source_invalid():
    """throughput_metrics_source rejects invalid values."""
    with pytest.raises(ValidationError):
        PlannerConfig(namespace="test-ns", throughput_metrics_source="invalid")
+@pytest.mark.parametrize("bucket_size", [1, 4, 9, 16, 25])
+def test_fpm_sample_bucket_size_accepts_perfect_squares(bucket_size):
+    """fpm_sample_bucket_size must be a perfect square (valid values)."""
+    config = PlannerConfig(namespace="test-ns", fpm_sample_bucket_size=bucket_size)
+    assert config.fpm_sample_bucket_size == bucket_size
+@pytest.mark.parametrize("bucket_size", [2, 3, 5, 7, 10])
+def test_fpm_sample_bucket_size_rejects_non_squares(bucket_size):
+    """fpm_sample_bucket_size rejects values that are not perfect squares."""
+    with pytest.raises(ValidationError, match="perfect square"):
+        PlannerConfig(namespace="test-ns", fpm_sample_bucket_size=bucket_size)
+def test_max_num_fpm_samples_field():
+    """max_num_fpm_samples configures the FPM sample retention (formerly load_learning_window)."""
+    config = PlannerConfig(namespace="test-ns", max_num_fpm_samples=100)
+    assert config.max_num_fpm_samples == 100
+def test_agg_mode_supports_throughput_scaling():
+    """Agg mode supports throughput-based scaling."""
+    config = PlannerConfig(
+        namespace="test-ns",
+        mode="agg",
+        enable_throughput_scaling=True,
+        enable_load_scaling=False,
+    )
+    assert config.mode == "agg"
+    assert config.enable_throughput_scaling is True
+    assert config.scaling_enabled() is True
--- a/components/src/dynamo/planner/tests/unit/test_replica_calculation.py
+++ b/components/src/dynamo/planner/tests/unit/test_replica_calculation.py
@@ -5,7 +5,7 @@
 Unit tests for SLA planner replica calculation logic.
 These tests focus specifically on the replica calculation formulas without
-testing load prediction, interpolation, or correction factors.
+testing load prediction or regression internals.
 """
 import asyncio
@@ -42,9 +42,9 @@ class PlannerHarness:
        if not self.shared_state.last_metrics.is_valid():
            return
-        p_endpoints, d_endpoints = await self.prefill_planner.get_workers_info()
+        num_p, num_d, is_stable = await self.prefill_planner.get_workers_info()
-        self.shared_state.p_endpoints = p_endpoints
+        self.shared_state.num_p_workers = num_p
-        self.shared_state.d_endpoints = d_endpoints
+        self.shared_state.num_d_workers = num_d
        next_num_p = self.prefill_planner.plan_adjustment()
        next_num_d = self.decode_planner.plan_adjustment()
@@ -86,14 +86,12 @@ class PlannerHarness:
            "config",
        }
        prefill_attrs = {
-            "prefill_interpolator",
+            "ttft_regression",
            "prefill_worker_info",
-            "p_correction_factor",
        }
        decode_attrs = {
-            "decode_interpolator",
+            "itl_regression",
            "decode_worker_info",
-            "d_correction_factor",
        }
        if name == "last_metrics":
            return self.shared_state.last_metrics
@@ -119,8 +117,8 @@ class PlannerHarness:
            "config",
            "get_workers_info",
        }
-        prefill_attrs = {"prefill_interpolator", "p_correction_factor"}
+        prefill_attrs = {"ttft_regression"}
-        decode_attrs = {"decode_interpolator", "d_correction_factor"}
+        decode_attrs = {"itl_regression"}
        if name == "last_metrics":
            self.shared_state.last_metrics = value
            return None
@@ -159,7 +157,6 @@ def planner():
        itl=10.0,
        backend="vllm",
        no_operation=True,
-        no_correction=False,
        metric_pulling_prometheus_endpoint="http://localhost:9090",
        metric_reporting_prometheus_port=0,
        load_predictor="constant",
@@ -176,12 +173,13 @@ def planner():
        enable_load_scaling=False,
        load_predictor_warmup_trace=None,
        load_predictor_log1p=False,
+        max_num_fpm_samples=50,
+        fpm_sample_bucket_size=16,
+        load_min_observations=5,
    )
-    # Mock the runtime
    mock_runtime = Mock()
-    # Patch Prometheus Gauge to avoid registry conflicts
    with patch("dynamo.planner.monitoring.planner_metrics.Gauge") as mock_gauge:
        mock_gauge.return_value = Mock()
@@ -206,9 +204,21 @@ def planner():
        decode_planner.prefill_worker_info = prefill_planner.prefill_worker_info
        decode_planner.decode_worker_info = prefill_planner.decode_worker_info
-        # Mock the interpolators to return fixed values for testing
+        planner.ttft_regression = Mock()
-        planner.prefill_interpolator = Mock()
+        # Default: 40000 tokens/s at isl=3000 → 40000/3000 rps
-        planner.decode_interpolator = Mock()
+        planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
+            40000.0 / 3000.0,
+            75.0,
+        )
+        planner.ttft_regression.has_sufficient_data.return_value = True
+        planner.itl_regression = Mock()
+        # Default: 10000 tokens/s at osl=150 → 10000/150 rps
+        planner.itl_regression.find_best_engine_decode_rps.return_value = (
+            10000.0 / 150.0,
+            9.5,
+        )
+        planner.itl_regression.has_sufficient_data.return_value = True
        # Mock the predictors to return fixed values
        planner.num_req_predictor = Mock()
@@ -221,14 +231,9 @@ def planner():
        # Mock prometheus client
        planner.prometheus_traffic_client = Mock()
-        # Set up some baseline correction factors
-        planner.p_correction_factor = 1.0
-        planner.d_correction_factor = 1.0
        planner.config = config
        yield planner
-        # Cleanup is automatic with context manager
 class TestReplicaCalculation:
@@ -239,59 +244,40 @@ class TestReplicaCalculation:
    @pytest.mark.performance
    def test_prefill_replica_calculation_basic(self, planner):
        """Test basic prefill replica calculation."""
-        # Setup test data
        next_num_req = 10
        next_isl = 3000
-        prefill_thpt_per_gpu = 40000  # tokens/s/gpu (from the test data)
+        engine_rps = 40000.0 / next_isl
-        # Mock the predictor outputs
        planner.num_req_predictor.predict_next.return_value = next_num_req
        planner.isl_predictor.predict_next.return_value = next_isl
        planner.osl_predictor.predict_next.return_value = 150
-        # Mock interpolator output
+        planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
-        planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = (
+            engine_rps,
-            prefill_thpt_per_gpu
+            75.0,
        )
-        planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
+        planner.itl_regression.find_best_engine_decode_rps.return_value = (
-            10000,
+            10000.0 / 150.0,
-            0.01,
+            9.5,
-            0.5,
        )
-        # Calculate expected result manually
+        # Formula: ceil(num_req / interval / engine_rps)
-        pred_prefill_load_per_gpu = (
+        pred_prefill_demand = (
-            next_num_req
+            next_num_req / planner.config.throughput_adjustment_interval
-            * next_isl
-            / planner.config.throughput_adjustment_interval
-            * min(1, planner.p_correction_factor)
-        )
-        expected_prefill_replicas = math.ceil(
-            pred_prefill_load_per_gpu
-            / prefill_thpt_per_gpu
-            / planner.config.prefill_engine_num_gpu
        )
+        expected_prefill_replicas = math.ceil(pred_prefill_demand / engine_rps)
-        # Set up valid metrics to trigger calculation
        planner.last_metrics = Metrics(
            num_req=10, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
        )
-        # Mock workers info
+        async def mock_get_workers_info(*args, **kwargs):
-        async def mock_get_workers_info():
+            return (1, 1, True)
-            return (["prefill1"], ["decode1"])
        planner.get_workers_info = mock_get_workers_info
-        # Mock interpolation calls for correction factor calculation
-        planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
-        planner.decode_interpolator.interpolate_itl.return_value = 10.0
-        # Run the calculation
        asyncio.run(planner.make_adjustments())
-        # Extract the calculated values from the log calls or by checking the mock calls
-        # Since we mocked the connector, we can check what replicas were requested
        prefill_component = "VllmPrefillWorker"
        calculated_prefill_replicas = _replica_count(
            planner.last_target_replicas, prefill_component
@@ -299,7 +285,6 @@ class TestReplicaCalculation:
        print(f"Expected prefill replicas: {expected_prefill_replicas}")
        print(f"Calculated prefill replicas: {calculated_prefill_replicas}")
-        # Allow for small differences due to min_endpoint constraints
        assert (
            max(expected_prefill_replicas, planner.config.min_endpoint)
            == calculated_prefill_replicas
@@ -310,52 +295,39 @@ class TestReplicaCalculation:
    @pytest.mark.performance
    def test_decode_replica_calculation_basic(self, planner):
        """Test basic decode replica calculation."""
-        # Setup test data
        next_num_req = 10
        next_osl = 150
-        decode_thpt_per_gpu = 10000  # tokens/s/gpu
+        engine_rps = 10000.0 / next_osl
-        # Mock the predictor outputs
        planner.num_req_predictor.predict_next.return_value = next_num_req
        planner.isl_predictor.predict_next.return_value = 3000
        planner.osl_predictor.predict_next.return_value = next_osl
-        # Mock interpolator outputs
+        planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
-        planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000
+            40000.0 / 3000.0,
-        planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
+            75.0,
-            decode_thpt_per_gpu,
+        )
-            0.01,
+        planner.itl_regression.find_best_engine_decode_rps.return_value = (
-            0.5,
+            engine_rps,
+            9.5,
        )
-        # Calculate expected result manually
+        # Formula: ceil(num_req / interval / engine_rps)
        expected_decode_replicas = math.ceil(
-            next_num_req
+            next_num_req / planner.config.throughput_adjustment_interval / engine_rps
-            * next_osl
-            / planner.config.throughput_adjustment_interval
-            / decode_thpt_per_gpu
-            / planner.config.decode_engine_num_gpu
        )
-        # Set up valid metrics
        planner.last_metrics = Metrics(
            num_req=10, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
        )
-        # Mock workers info
+        async def mock_get_workers_info(*args, **kwargs):
-        async def mock_get_workers_info():
+            return (1, 1, True)
-            return (["prefill1"], ["decode1"])
        planner.get_workers_info = mock_get_workers_info
-        # Mock interpolation calls for correction factor calculation
-        planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
-        planner.decode_interpolator.interpolate_itl.return_value = 10.0
-        # Run the calculation
        asyncio.run(planner.make_adjustments())
-        # Check the results
        decode_component = "VllmDecodeWorker"
        calculated_decode_replicas = _replica_count(
            planner.last_target_replicas, decode_component
@@ -363,46 +335,43 @@ class TestReplicaCalculation:
        print(f"Expected decode replicas: {expected_decode_replicas}")
        print(f"Calculated decode replicas: {calculated_decode_replicas}")
-        # Allow for small differences due to min_endpoint constraints
        assert (
            max(expected_decode_replicas, planner.config.min_endpoint)
            == calculated_decode_replicas
        )
    @pytest.mark.parametrize(
-        "num_req,decode_thpt,expected_p,expected_d",
+        "num_req,decode_rps,expected_p,expected_d",
        [
-            (10, 10000, 1, 1),  # low_load_10_req_per_second
+            (10, 10000.0 / 150.0, 1, 1),  # low_load_10_req_per_second
-            (500, 1000, 1, 2),  # high_load_500_req_per_second (lower decode throughput)
+            (
+                500,
+                1000.0 / 150.0,
+                1,
+                2,
+            ),  # high_load_500_req_per_second (lower decode rps)
        ],
    )
    @pytest.mark.nightly
    @pytest.mark.gpu_2
    @pytest.mark.performance
    def test_scaling_scenario_low_to_high_load(
-        self, planner, num_req, decode_thpt, expected_p, expected_d
+        self, planner, num_req, decode_rps, expected_p, expected_d
    ):
        """Test scaling from low to high load scenarios."""
-        # Reset the planner state
-        planner.p_correction_factor = 1.0
-        planner.d_correction_factor = 1.0
-        # Mock predictor outputs for this case
        planner.num_req_predictor.predict_next.return_value = num_req
        planner.isl_predictor.predict_next.return_value = 3000
        planner.osl_predictor.predict_next.return_value = 150
-        # Mock interpolator outputs (based on H200 1P1D profiling data)
+        planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
-        planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = (
+            40000.0 / 3000.0,
-            40000  # tokens/s/gpu
+            75.0,
        )
-        planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
+        planner.itl_regression.find_best_engine_decode_rps.return_value = (
-            decode_thpt,
+            decode_rps,
-            0.01,
+            9.5,
-            0.5,
        )
-        # Set up metrics
        planner.last_metrics = Metrics(
            num_req=num_req,
            isl=3000,
@@ -412,23 +381,14 @@ class TestReplicaCalculation:
            request_duration=100.0,
        )
-        # Mock workers info
+        async def mock_get_workers_info(*args, **kwargs):
-        async def mock_get_workers_info():
+            return (1, 1, True)
-            return (["prefill1"], ["decode1"])
        planner.get_workers_info = mock_get_workers_info
-        # Mock interpolation calls for correction factor calculation
-        planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
-        planner.decode_interpolator.interpolate_itl.return_value = 10.0
-        # Reset the mock
        planner.connector.reset_mock()
-        # Run calculation
        asyncio.run(planner.make_adjustments())
-        # Verify results
        prefill_replicas = _replica_count(
            planner.last_target_replicas, "VllmPrefillWorker"
        )
@@ -449,41 +409,32 @@ class TestReplicaCalculation:
    @pytest.mark.performance
    def test_gpu_budget_constraint(self, planner):
        """Test that GPU budget constraints are properly applied."""
-        # Set a low GPU budget
        planner.config.max_gpu_budget = 3
-        # Mock predictor outputs that would normally require more GPUs
+        planner.num_req_predictor.predict_next.return_value = 50
-        planner.num_req_predictor.predict_next.return_value = 50  # High load
        planner.isl_predictor.predict_next.return_value = 3000
        planner.osl_predictor.predict_next.return_value = 150
-        # Mock interpolator outputs
+        planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
-        planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000
+            40000.0 / 3000.0,
-        planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
+            75.0,
-            10000,
+        )
-            0.01,
+        planner.itl_regression.find_best_engine_decode_rps.return_value = (
-            0.5,
+            10000.0 / 150.0,
+            9.5,
        )
-        # Set up metrics
        planner.last_metrics = Metrics(
            num_req=50, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
        )
-        # Mock workers info
+        async def mock_get_workers_info(*args, **kwargs):
-        async def mock_get_workers_info():
+            return (1, 1, True)
-            return (["prefill1"], ["decode1"])
        planner.get_workers_info = mock_get_workers_info
-        # Mock interpolation calls
-        planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
-        planner.decode_interpolator.interpolate_itl.return_value = 10.0
-        # Run calculation
        asyncio.run(planner.make_adjustments())
-        # Verify that total GPU usage doesn't exceed budget
        prefill_replicas = _replica_count(
            planner.last_target_replicas, "VllmPrefillWorker"
        )
@@ -510,38 +461,30 @@ class TestReplicaCalculation:
        """Test that minimum endpoint constraints are respected."""
        planner.config.min_endpoint = 2
-        # Mock predictor outputs that would normally require fewer workers
+        planner.num_req_predictor.predict_next.return_value = 1
-        planner.num_req_predictor.predict_next.return_value = 1  # Very low load
        planner.isl_predictor.predict_next.return_value = 100
        planner.osl_predictor.predict_next.return_value = 10
-        # Mock interpolator outputs
+        planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
-        planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000
+            40000.0 / 100.0,
-        planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
+            75.0,
-            10000,
+        )
-            0.01,
+        planner.itl_regression.find_best_engine_decode_rps.return_value = (
-            0.5,
+            10000.0 / 10.0,
+            9.5,
        )
-        # Set up metrics
        planner.last_metrics = Metrics(
            num_req=1, isl=100, osl=10, ttft=80.0, itl=10.0, request_duration=100.0
        )
-        # Mock workers info
+        async def mock_get_workers_info(*args, **kwargs):
-        async def mock_get_workers_info():
+            return (1, 1, True)
-            return (["prefill1"], ["decode1"])
        planner.get_workers_info = mock_get_workers_info
-        # Mock interpolation calls
-        planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
-        planner.decode_interpolator.interpolate_itl.return_value = 10.0
-        # Run calculation
        asyncio.run(planner.make_adjustments())
-        # Verify minimum constraints are respected
        prefill_replicas = _replica_count(
            planner.last_target_replicas, "VllmPrefillWorker"
        )
@@ -557,182 +500,47 @@ class TestReplicaCalculation:
            decode_replicas >= planner.config.min_endpoint
        ), "Decode replicas below minimum"
-    @pytest.mark.nightly
-    @pytest.mark.gpu_2
-    @pytest.mark.performance
-    def test_prefill_correction_factor_clamping(self, planner):
-        """Test that prefill correction factor > 1 is clamped to 1."""
-        # Set a high correction factor > 1
-        planner.p_correction_factor = 2.5
-        planner.d_correction_factor = 1.0
-        # Mock predictor outputs
-        planner.num_req_predictor.predict_next.return_value = 10
-        planner.isl_predictor.predict_next.return_value = 3000
-        planner.osl_predictor.predict_next.return_value = 150
-        # Mock interpolator outputs
-        planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000
-        planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
-            10000,
-            0.01,
-            0.5,
-        )
-        # Set up metrics
-        planner.last_metrics = Metrics(
-            num_req=10, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
-        )
-        # Mock workers info
-        async def mock_get_workers_info():
-            return (["prefill1"], ["decode1"])
-        planner.get_workers_info = mock_get_workers_info
-        # Mock interpolation calls
-        planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
-        planner.decode_interpolator.interpolate_itl.return_value = 10.0
-        # Calculate expected result manually with clamping
-        # Should use min(1, 2.5) = 1
-        pred_prefill_load_per_gpu = (
-            10
-            * 3000
-            / planner.config.throughput_adjustment_interval
-            * min(1, 2.5)  # Should be * 1
-        )
-        expected_prefill_replicas = math.ceil(
-            pred_prefill_load_per_gpu / 40000 / planner.config.prefill_engine_num_gpu
-        )
-        # Run calculation
-        asyncio.run(planner.make_adjustments())
-        # Verify that correction factor was effectively clamped
-        prefill_replicas = _replica_count(
-            planner.last_target_replicas, "VllmPrefillWorker"
-        )
-        print(
-            f"Correction factor clamping test: Expected={expected_prefill_replicas}, Got={prefill_replicas}"
-        )
-        assert prefill_replicas == max(
-            expected_prefill_replicas, planner.config.min_endpoint
-        ), "Prefill correction factor should be clamped to 1"
-    @pytest.mark.nightly
-    @pytest.mark.gpu_2
-    @pytest.mark.performance
-    def test_decode_correction_factor_zero_handling(self, planner):
-        """Test handling of d_correction_factor <= 0."""
-        # Test both 0 and negative values
-        for correction_factor in [0.0, -1.0]:
-            planner.p_correction_factor = 1.0
-            planner.d_correction_factor = correction_factor
-            # Mock predictor outputs
-            planner.num_req_predictor.predict_next.return_value = 10
-            planner.isl_predictor.predict_next.return_value = 3000
-            planner.osl_predictor.predict_next.return_value = 150
-            # Mock interpolator outputs
-            planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000
-            planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
-                10000,
-                0.01,
-                0.5,
-            )
-            # Set up metrics
-            planner.last_metrics = Metrics(
-                num_req=10,
-                isl=3000,
-                osl=150,
-                ttft=80.0,
-                itl=10.0,
-                request_duration=100.0,
-            )
-            # Mock workers info
-            async def mock_get_workers_info():
-                return (["prefill1"], ["decode1"])
-            planner.get_workers_info = mock_get_workers_info
-            # Mock interpolation calls
-            planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
-            planner.decode_interpolator.interpolate_itl.return_value = 10.0
-            # Run calculation
-            asyncio.run(planner.make_adjustments())
-            # Should handle gracefully without crashing
-            # The code should use args.itl directly instead of dividing by 0
-            decode_replicas = _replica_count(
-                planner.last_target_replicas, "VllmDecodeWorker"
-            )
-            print(
-                f"Correction factor {correction_factor} test: Decode replicas={decode_replicas}"
-            )
-            # Should get a valid result (not crash)
-            assert (
-                decode_replicas >= 1
-            ), f"Should handle correction factor {correction_factor} gracefully"
    @pytest.mark.nightly
    @pytest.mark.gpu_2
    @pytest.mark.performance
    def test_multi_gpu_engines(self, planner):
        """Test replica calculation with multi-GPU engines."""
-        # Set multi-GPU configuration
        planner.config.prefill_engine_num_gpu = 2
        planner.config.decode_engine_num_gpu = 4
-        # Mock predictor outputs
        planner.num_req_predictor.predict_next.return_value = 20
        planner.isl_predictor.predict_next.return_value = 3000
        planner.osl_predictor.predict_next.return_value = 150
-        # Mock interpolator outputs
+        # Engine-level request rate (already accounts for multi-GPU)
-        planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000
+        prefill_engine_rps = 40000.0 / 3000.0
-        planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
+        decode_engine_rps = 5000.0 / 150.0
-            5000,
+        planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
-            0.01,
+            prefill_engine_rps,
-            0.5,
+            75.0,
-        )  # Lower for scaling
+        )
+        planner.itl_regression.find_best_engine_decode_rps.return_value = (
+            decode_engine_rps,
+            9.5,
+        )
-        # Set up metrics
        planner.last_metrics = Metrics(
            num_req=20, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
        )
-        # Mock workers info
+        async def mock_get_workers_info(*args, **kwargs):
-        async def mock_get_workers_info():
+            return (1, 1, True)
-            return (["prefill1"], ["decode1"])
        planner.get_workers_info = mock_get_workers_info
-        # Mock interpolation calls
+        # No engine_num_gpu division — regression returns engine-level rps
-        planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
-        planner.decode_interpolator.interpolate_itl.return_value = 10.0
-        # Calculate expected results manually
-        pred_prefill_load_per_gpu = (
-            20 * 3000 / planner.config.throughput_adjustment_interval * 1.0
-        )
        expected_prefill_replicas = math.ceil(
-            pred_prefill_load_per_gpu / 40000 / 2
+            20 / planner.config.throughput_adjustment_interval / prefill_engine_rps
-        )  # 2 GPUs per engine
+        )
        expected_decode_replicas = math.ceil(
-            20 * 150 / planner.config.throughput_adjustment_interval / 5000 / 4
+            20 / planner.config.throughput_adjustment_interval / decode_engine_rps
-        )  # 4 GPUs per engine
+        )
-        # Run calculation
        asyncio.run(planner.make_adjustments())
        prefill_replicas = _replica_count(
@@ -742,10 +550,10 @@ class TestReplicaCalculation:
            planner.last_target_replicas, "VllmDecodeWorker"
        )
        print(
-            f"Multi-GPU test: P={prefill_replicas} (expected ~{expected_prefill_replicas}), D={decode_replicas} (expected ~{expected_decode_replicas})"
+            f"Multi-GPU test: P={prefill_replicas} (expected ~{expected_prefill_replicas}), "
+            f"D={decode_replicas} (expected ~{expected_decode_replicas})"
        )
-        # Verify calculations account for multiple GPUs per engine
        assert prefill_replicas == max(
            expected_prefill_replicas, planner.config.min_endpoint
        )
@@ -757,42 +565,39 @@ class TestReplicaCalculation:
    @pytest.mark.gpu_2
    @pytest.mark.performance
    def test_complex_gpu_budget_scaling(self, planner):
-        """Test complex GPU budget scaling with proportional reduction and decode adjustment."""
+        """Test complex GPU budget scaling with proportional reduction."""
-        # Set tight GPU budget that will trigger complex scaling
        planner.config.max_gpu_budget = 5
        planner.config.prefill_engine_num_gpu = 2
        planner.config.decode_engine_num_gpu = 2
        planner.config.min_endpoint = 1
-        # High load that would normally require more GPUs
        planner.num_req_predictor.predict_next.return_value = 100
        planner.isl_predictor.predict_next.return_value = 3000
        planner.osl_predictor.predict_next.return_value = 150
-        # Lower throughput to trigger higher replica needs
+        planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
-        planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 10000
+            10000.0 / 3000.0,
-        planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
+            300.0,
-            1000,
+        )
-            0.01,
+        planner.itl_regression.find_best_engine_decode_rps.return_value = (
-            0.5,
+            1000.0 / 150.0,
+            9.5,
        )
-        # Set up metrics
        planner.last_metrics = Metrics(
-            num_req=100, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
+            num_req=100,
+            isl=3000,
+            osl=150,
+            ttft=80.0,
+            itl=10.0,
+            request_duration=100.0,
        )
-        # Mock workers info
+        async def mock_get_workers_info(*args, **kwargs):
-        async def mock_get_workers_info():
+            return (1, 1, True)
-            return (["prefill1"], ["decode1"])
        planner.get_workers_info = mock_get_workers_info
-        # Mock interpolation calls
-        planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
-        planner.decode_interpolator.interpolate_itl.return_value = 10.0
-        # Run calculation
        asyncio.run(planner.make_adjustments())
        prefill_replicas = _replica_count(
@@ -801,14 +606,14 @@ class TestReplicaCalculation:
        decode_replicas = _replica_count(
            planner.last_target_replicas, "VllmDecodeWorker"
        )
-        # Verify total GPU usage doesn't exceed budget
        total_gpus = (
            prefill_replicas * planner.config.prefill_engine_num_gpu
            + decode_replicas * planner.config.decode_engine_num_gpu
        )
        print(
-            f"Complex GPU budget test: P={prefill_replicas}, D={decode_replicas}, Total GPUs={total_gpus}"
+            f"Complex GPU budget test: P={prefill_replicas}, D={decode_replicas}, "
+            f"Total GPUs={total_gpus}"
        )
        assert (
@@ -820,6 +625,3 @@ class TestReplicaCalculation:
        assert (
            decode_replicas >= planner.config.min_endpoint
        ), "Should respect min_endpoint for decode"
-# No need for unittest.main() with pytest!
--- a/components/src/dynamo/planner/tests/unit/test_sla_planner_scaling.py
+++ b/components/src/dynamo/planner/tests/unit/test_sla_planner_scaling.py
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-import argparse
 import asyncio
 import math
 import os
-from unittest.mock import Mock, patch
+from unittest.mock import MagicMock, Mock, patch
 import pytest
@@ -15,7 +14,6 @@ from dynamo.planner.core.decode import DecodePlanner
 from dynamo.planner.core.prefill import PrefillPlanner
 from dynamo.planner.core.state import PlannerSharedState
 from dynamo.planner.errors import DeploymentValidationError
-from dynamo.planner.offline.dryrun import run_sla_planner_dryrun
 pytestmark = [
    pytest.mark.gpu_0,
@@ -24,6 +22,10 @@ pytestmark = [
    pytest.mark.planner,
 ]
+PREFILL_ENGINE_RPS = 10.0
+DECODE_ENGINE_RPS = 5.0
+DECODE_ACTUAL_ITL_MS = 40.0
 @pytest.fixture(autouse=True)
 def mock_prometheus_metrics():
@@ -43,7 +45,6 @@ def _build_config():
        itl=50.0,
        backend="vllm",
        no_operation=True,
-        no_correction=True,
        metric_pulling_prometheus_endpoint="http://localhost:9090",
        metric_reporting_prometheus_port=0,
        load_predictor="constant",
@@ -90,6 +91,20 @@ def _build_planners(config, prometheus_client):
    prefill_planner.model_name = "test-model"
    decode_planner.model_name = "test-model"
+    prefill_planner.ttft_regression = MagicMock()
+    prefill_planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
+        PREFILL_ENGINE_RPS,
+        75.0,
+    )
+    prefill_planner.ttft_regression.has_sufficient_data.return_value = True
+    decode_planner.itl_regression = MagicMock()
+    decode_planner.itl_regression.find_best_engine_decode_rps.return_value = (
+        DECODE_ENGINE_RPS,
+        DECODE_ACTUAL_ITL_MS,
+    )
+    decode_planner.itl_regression.has_sufficient_data.return_value = True
    async def mock_get_workers_info(require_prefill=True, require_decode=True):
        return (
            1 if require_prefill else 0,
@@ -103,32 +118,20 @@ def _build_planners(config, prometheus_client):
 def _expected_prefill(config, prefill_planner, sample):
-    pred_prefill_throughput = (
+    demand_rps = sample["num_req"] / config.throughput_adjustment_interval
-        sample["num_req"] * sample["isl"] / config.throughput_adjustment_interval
+    engine_rps, _ = prefill_planner.ttft_regression.find_best_engine_prefill_rps(
-    )
+        ttft_sla=config.ttft, isl=sample["isl"]
-    thpt_per_gpu = prefill_planner.prefill_interpolator.interpolate_thpt_per_gpu(
-        sample["isl"]
-    )
-    expected = math.ceil(
-        pred_prefill_throughput / thpt_per_gpu / config.prefill_engine_num_gpu
    )
+    expected = math.ceil(demand_rps / engine_rps)
    return max(expected, config.min_endpoint)
 def _expected_decode(config, decode_planner, sample):
-    (
+    demand_rps = sample["num_req"] / config.throughput_adjustment_interval
-        pred_decode_thpt_per_gpu,
+    engine_rps, _ = decode_planner.itl_regression.find_best_engine_decode_rps(
-        _,
-        _,
-    ) = decode_planner.decode_interpolator.find_best_throughput_per_gpu(
        itl=config.itl, context_length=sample["isl"] + sample["osl"] / 2
    )
-    pred_decode_throughput = (
+    expected = math.ceil(demand_rps / engine_rps)
-        sample["num_req"] * sample["osl"] / config.throughput_adjustment_interval
-    )
-    expected = math.ceil(
-        pred_decode_throughput / pred_decode_thpt_per_gpu / config.decode_engine_num_gpu
-    )
    return max(expected, config.min_endpoint)
@@ -210,128 +213,114 @@ def test_disagg_scale_down():
    assert low_d < high_d
-# Tests for _initialize_gpu_counts
 class TestInitializeGpuCounts:
+    @staticmethod
+    def _make_config(**overrides):
+        defaults = dict(prefill_engine_num_gpu=None, decode_engine_num_gpu=None)
+        defaults.update(overrides)
+        return PlannerConfig.model_construct(**defaults)
    def test_kubernetes_mode_reads_from_dgd(self):
        """Test that GPU counts are read from DGD in Kubernetes mode"""
-        args = argparse.Namespace()
+        config = self._make_config()
-        args.prefill_engine_num_gpu = None
-        args.decode_engine_num_gpu = None
        connector = Mock()
        connector.get_gpu_counts = Mock(return_value=(2, 4))
        _initialize_gpu_counts(
-            args, connector, require_prefill=True, require_decode=True
+            config, connector, require_prefill=True, require_decode=True
        )
-        assert args.prefill_engine_num_gpu == 2
+        assert config.prefill_engine_num_gpu == 2
-        assert args.decode_engine_num_gpu == 4
+        assert config.decode_engine_num_gpu == 4
        connector.get_gpu_counts.assert_called_once_with(
            require_prefill=True, require_decode=True
        )
    def test_kubernetes_mode_prefill_only(self):
        """Test GPU count initialization for prefill-only mode"""
-        args = argparse.Namespace()
+        config = self._make_config()
-        args.prefill_engine_num_gpu = None
-        args.decode_engine_num_gpu = None
        connector = Mock()
        connector.get_gpu_counts = Mock(return_value=(2, 0))
        _initialize_gpu_counts(
-            args, connector, require_prefill=True, require_decode=False
+            config, connector, require_prefill=True, require_decode=False
        )
-        assert args.prefill_engine_num_gpu == 2
+        assert config.prefill_engine_num_gpu == 2
-        assert args.decode_engine_num_gpu == 0
+        assert config.decode_engine_num_gpu == 0
        connector.get_gpu_counts.assert_called_once_with(
            require_prefill=True, require_decode=False
        )
    def test_virtual_mode_uses_cli_args(self):
-        """Test that GPU counts come from CLI args in virtual mode"""
+        """Test that GPU counts come from config in virtual mode"""
-        args = argparse.Namespace()
+        config = self._make_config(prefill_engine_num_gpu=2, decode_engine_num_gpu=4)
-        args.prefill_engine_num_gpu = 2
-        args.decode_engine_num_gpu = 4
-        # Virtual connector doesn't have get_gpu_counts method
        connector = Mock(spec=[])
        _initialize_gpu_counts(
-            args, connector, require_prefill=True, require_decode=True
+            config, connector, require_prefill=True, require_decode=True
        )
-        # Values should remain unchanged
+        assert config.prefill_engine_num_gpu == 2
-        assert args.prefill_engine_num_gpu == 2
+        assert config.decode_engine_num_gpu == 4
-        assert args.decode_engine_num_gpu == 4
    def test_virtual_mode_missing_prefill_raises_error(self):
-        """Test that missing prefill GPU flag raises error in virtual mode"""
+        """Test that missing prefill GPU config raises error in virtual mode"""
-        args = argparse.Namespace()
+        config = self._make_config(decode_engine_num_gpu=4)
-        args.prefill_engine_num_gpu = None
-        args.decode_engine_num_gpu = 4
        connector = Mock(spec=[])
        with pytest.raises(DeploymentValidationError) as exc_info:
            _initialize_gpu_counts(
-                args, connector, require_prefill=True, require_decode=True
+                config, connector, require_prefill=True, require_decode=True
            )
        assert "prefill_engine_num_gpu" in str(exc_info.value)
    def test_virtual_mode_missing_decode_raises_error(self):
-        """Test that missing decode GPU flag raises error in virtual mode"""
+        """Test that missing decode GPU config raises error in virtual mode"""
-        args = argparse.Namespace()
+        config = self._make_config(prefill_engine_num_gpu=2)
-        args.prefill_engine_num_gpu = 2
-        args.decode_engine_num_gpu = None
        connector = Mock(spec=[])
        with pytest.raises(DeploymentValidationError) as exc_info:
            _initialize_gpu_counts(
-                args, connector, require_prefill=True, require_decode=True
+                config, connector, require_prefill=True, require_decode=True
            )
        assert "decode_engine_num_gpu" in str(exc_info.value)
    def test_virtual_mode_missing_both_raises_error_with_both_messages(self):
-        """Test that missing both GPU flags shows both error messages"""
+        """Test that missing both GPU configs shows both error messages"""
-        args = argparse.Namespace()
+        config = self._make_config()
-        args.prefill_engine_num_gpu = None
-        args.decode_engine_num_gpu = None
        connector = Mock(spec=[])
        with pytest.raises(DeploymentValidationError) as exc_info:
            _initialize_gpu_counts(
-                args, connector, require_prefill=True, require_decode=True
+                config, connector, require_prefill=True, require_decode=True
            )
        assert len(exc_info.value.errors) == 2
    def test_virtual_mode_decode_only_no_prefill_error(self):
-        """Test decode-only mode doesn't require prefill GPU flag"""
+        """Test decode-only mode doesn't require prefill GPU config"""
-        args = argparse.Namespace()
+        config = self._make_config(decode_engine_num_gpu=4)
-        args.prefill_engine_num_gpu = None
-        args.decode_engine_num_gpu = 4
        connector = Mock(spec=[])
-        # Should not raise - prefill not required
        _initialize_gpu_counts(
-            args, connector, require_prefill=False, require_decode=True
+            config, connector, require_prefill=False, require_decode=True
        )
-        assert args.decode_engine_num_gpu == 4
+        assert config.decode_engine_num_gpu == 4
    def test_kubernetes_mode_fallback_to_cli_on_dgd_error(self):
-        """Test that K8s mode falls back to CLI flags when DGD parsing fails"""
+        """Test that K8s mode falls back to config when DGD parsing fails"""
-        args = argparse.Namespace()
+        config = self._make_config(prefill_engine_num_gpu=2, decode_engine_num_gpu=4)
-        args.prefill_engine_num_gpu = 2
-        args.decode_engine_num_gpu = 4
        connector = Mock()
        connector.get_gpu_counts = Mock(
@@ -339,18 +328,15 @@ class TestInitializeGpuCounts:
        )
        _initialize_gpu_counts(
-            args, connector, require_prefill=True, require_decode=True
+            config, connector, require_prefill=True, require_decode=True
        )
-        # Should use CLI flag values after fallback
+        assert config.prefill_engine_num_gpu == 2
-        assert args.prefill_engine_num_gpu == 2
+        assert config.decode_engine_num_gpu == 4
-        assert args.decode_engine_num_gpu == 4
    def test_kubernetes_mode_fallback_missing_cli_flags_raises_error(self):
-        """Test that K8s fallback raises error when CLI flags are also missing"""
+        """Test that K8s fallback raises error when config also missing"""
-        args = argparse.Namespace()
+        config = self._make_config()
-        args.prefill_engine_num_gpu = None
-        args.decode_engine_num_gpu = None
        connector = Mock()
        connector.get_gpu_counts = Mock(
@@ -359,16 +345,14 @@ class TestInitializeGpuCounts:
        with pytest.raises(DeploymentValidationError) as exc_info:
            _initialize_gpu_counts(
-                args, connector, require_prefill=True, require_decode=True
+                config, connector, require_prefill=True, require_decode=True
            )
        assert len(exc_info.value.errors) == 2
    def test_kubernetes_mode_fallback_partial_cli_flags(self):
-        """Test K8s fallback with only one CLI flag provided"""
+        """Test K8s fallback with only one config value provided"""
-        args = argparse.Namespace()
+        config = self._make_config(prefill_engine_num_gpu=2)
-        args.prefill_engine_num_gpu = 2
-        args.decode_engine_num_gpu = None
        connector = Mock()
        connector.get_gpu_counts = Mock(
@@ -377,73 +361,7 @@ class TestInitializeGpuCounts:
        with pytest.raises(DeploymentValidationError) as exc_info:
            _initialize_gpu_counts(
-                args, connector, require_prefill=True, require_decode=True
+                config, connector, require_prefill=True, require_decode=True
            )
        assert "decode_engine_num_gpu" in str(exc_info.value)
-# Tests for dryrun GPU defaults
-class TestDryrunGpuDefaults:
-    @staticmethod
-    def _build_dryrun_config(**overrides) -> PlannerConfig:
-        defaults = dict(
-            throughput_adjustment_interval=60,
-            prefill_engine_num_gpu=1,
-            decode_engine_num_gpu=1,
-            min_endpoint=1,
-            max_gpu_budget=-1,
-            ttft=500.0,
-            itl=50.0,
-            backend="vllm",
-            no_operation=True,
-            no_correction=True,
-            metric_pulling_prometheus_endpoint="http://localhost:9090",
-            metric_reporting_prometheus_port=0,
-            load_predictor="constant",
-            load_predictor_warmup_trace=None,
-            load_predictor_log1p=False,
-            profile_results_dir=os.path.join(
-                os.path.dirname(__file__),
-                "..",
-                "data",
-                "profiling_results",
-                "H200_TP1P_TP1D",
-            ),
-            environment="kubernetes",
-            namespace="test-namespace",
-            mode="disagg",
-            enable_throughput_scaling=True,
-            enable_load_scaling=False,
-        )
-        defaults.update(overrides)
-        return PlannerConfig.model_construct(**defaults)
-    def test_dryrun_defaults_gpu_counts_when_none(self):
-        """Test that dryrun sets default GPU counts of 1 when None"""
-        config = self._build_dryrun_config(
-            prefill_engine_num_gpu=None, decode_engine_num_gpu=None
-        )
-        try:
-            run_sla_planner_dryrun(config, dataset="nonexistent.jsonl")
-        except (FileNotFoundError, ValueError):
-            pass
-        assert config.prefill_engine_num_gpu == 1
-        assert config.decode_engine_num_gpu == 1
-    def test_dryrun_preserves_cli_gpu_counts(self):
-        """Test that dryrun preserves GPU counts provided via config"""
-        config = self._build_dryrun_config(
-            prefill_engine_num_gpu=2, decode_engine_num_gpu=4
-        )
-        try:
-            run_sla_planner_dryrun(config, dataset="nonexistent.jsonl")
-        except (FileNotFoundError, ValueError):
-            pass
-        assert config.prefill_engine_num_gpu == 2
-        assert config.decode_engine_num_gpu == 4
--- a/container/deps/requirements.planner.txt
+++ b/container/deps/requirements.planner.txt
@@ -12,4 +12,4 @@ pmdarima==2.1.1
 prometheus-api-client==0.6.0
 prophet==1.2.1
 scikit-learn==1.7.2
-scipy<1.14.0  # Upper bound for pmdarima compatibility
+scipy>=1.14.0,<2.0
--- a/docs/components/planner/README.md
+++ b/docs/components/planner/README.md
@@ -25,8 +25,8 @@ The Dynamo **Planner** is an autoscaler purpose-built for these constraints. It
 The Planner supports two scaling modes that can run independently or together:
- **Throughput-based scaling**: Uses pre-deployment profiling data and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments.
+- **Throughput-based scaling**: Uses pre-deployment engine performance data (from self-benchmark or profiler) and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments.
- **Load-based scaling**: Uses ForwardPassMetrics (FPM) from the Dynamo event plane and fits an online linear regression to make scaling decisions. No profiling data or KV Router required. Adjusts on a short interval (default 5s) to respond quickly to bursts.
+- **Load-based scaling**: Uses ForwardPassMetrics (FPM) from the Dynamo event plane and fits an online linear regression to make scaling decisions. No pre-deployment data or KV Router required. Adjusts on a short interval (default 5s) to respond quickly to bursts.
 When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor.
@@ -36,12 +36,12 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
 |---------|:----------------:|:-------------------------:|
 | **Deployment** | | |
 | Disaggregated | Supported | Supported |
-| Aggregated | Unsupported | Supported |
+| Aggregated | Supported | Supported |
 | **LLM Framework** | | |
 | SGLang | Supported | Supported |
 | TensorRT-LLM | Supported | Supported |
 | vLLM | Supported | Supported |
-| **Requires Profiling Data** | Yes | No |
+| **Requires Pre-deployment Data** | Yes (self-benchmark or profiler) | No |
 | **Load Predictors** | ARIMA, Prophet, Kalman, Constant | N/A |
 | **Router** | | |
 | Any (round-robin, random, etc.) | Supported | Not supported |
@@ -52,8 +52,8 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
 ## When to Use Which Mode
- **Throughput-based scaling** should be enabled whenever engine profiling data is available (through pre-deployment profiling). It provides stable, prediction-based capacity planning.
+- **Throughput-based scaling** should be enabled whenever engine performance data is available (through self-benchmark or pre-deployment profiling). It provides stable, prediction-based capacity planning.
- **Load-based scaling** should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring profiling data.
+- **Load-based scaling** should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring pre-deployment data.
 - **Both modes together**: For the best of both worlds, enable both. Throughput-based scaling provides a lower bound (long-term capacity), while load-based scaling handles bursts above that floor. When both are enabled, use a longer `--adjustment-interval` for throughput-based scaling.
 ## Quick Start
@@ -63,7 +63,7 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
 - Dynamo platform installed on Kubernetes ([Installation Guide](../../kubernetes/installation-guide.md))
 - kube-prometheus-stack installed ([Metrics Setup](../../kubernetes/observability/metrics.md))
-For throughput-based scaling, pre-deployment profiling is also required ([Profiling Guide](../profiler/profiler-guide.md)).
+For throughput-based scaling, pre-deployment engine performance data is also required (via self-benchmark mode or [Profiling Guide](../profiler/profiler-guide.md)).
 ### Throughput-Based Scaling (with DGDR)
@@ -141,13 +141,11 @@ Load-based scaling has the following known limitations. Throughput-based scaling
 | `--adjustment-interval` | `180` | Seconds between throughput-based scaling decisions |
 | `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
 | `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
-| `--no-correction` | `true` | Disable correction factors (auto-disabled when load-based scaling is on) |
 | **Load-based scaling** | | |
 | `--enable-loadbased-scaling` | `false` | Enable load-based scaling |
-| `--disable-throughput-scaling` | `false` | Disable throughput-based scaling (required for `agg` mode) |
+| `--loadbased-adjustment-interval` | `5` | Seconds between FPM regression updates and load-based scaling decisions |
-| `--loadbased-router-metrics-url` | auto-discovered | URL to router's `/metrics` endpoint |
+| `--max-num-fpm-samples` | `64` | Maximum retained FPM observations for regression |
-| `--loadbased-adjustment-interval` | `5` | Seconds between load-based scaling decisions |
+| `--fpm-sample-bucket-size` | `16` | Number of buckets for observation retirement (must be perfect square) |
-| `--loadbased-learning-window` | `50` | Sliding window size for regression model |
 | `--loadbased-scaling-down-sensitivity` | `80` | Scale-down sensitivity 0-100 (0=never, 100=aggressive) |
 | `--loadbased-metric-samples` | `10` | Number of metric samples per adjustment interval |
 | `--loadbased-min-observations` | `5` | Minimum observations before regression activates |
@@ -175,7 +173,7 @@ The dashboard shows:
 - Worker counts and GPU usage over time
 - Observed TTFT, ITL, request rate, sequence lengths
 - Predicted load and recommended replica counts
- Correction factors (actual vs. expected performance)
+- FPM regression model status
 ### Prometheus Metrics

--- a/docs/components/planner/planner-guide.md
+++ b/docs/components/planner/planner-guide.md
@@ -12,12 +12,12 @@ For a quick overview, see the [Planner overview](README.md). For architecture in
 The planner supports two scaling modes that can be used independently or together:
- **Throughput-based scaling** (`enable_throughput_scaling: true`): Uses pre-deployment engine interpolation data and traffic prediction to plan capacity. Best for stable, predictable workloads. Requires profiling data generated by the [Profiler](../profiler/profiler-guide.md).
+- **Throughput-based scaling** (`enable_throughput_scaling: true`): Uses pre-deployment engine performance data (from self-benchmark or profiler) and traffic prediction to plan capacity. Best for stable, predictable workloads.
- **Load-based scaling** (`enable_load_scaling: true`): Uses real-time per-worker engine metrics and online regression. Best for bursty or unpredictable traffic. Does not require profiling data. Requires the [KV Router](../router/README.md) — see [Current Limitations](README.md#current-limitations).
+- **Load-based scaling** (`enable_load_scaling: true`): Uses real-time ForwardPassMetrics (FPM) from the Dynamo event plane and online regression to make scaling decisions. Best for bursty or unpredictable traffic. Does not require pre-deployment data.
 **When to use which:**
- Enable **throughput-based scaling** whenever profiling data is available. It provides stable, prediction-based capacity planning.
+- Enable **throughput-based scaling** whenever pre-deployment performance data is available (via self-benchmark or profiler). It provides stable, prediction-based capacity planning.
 - Enable **load-based scaling** when traffic is bursty. It reacts quickly to real-time load changes.
 - Enable **both** for the best of both worlds: throughput-based provides a capacity floor, load-based handles bursts above it. When both are enabled, use a longer `throughput_adjustment_interval`.
@@ -39,8 +39,8 @@ features:
 | Field | Type | Default | Description |
 |-------|------|---------|-------------|
-| `enable_throughput_scaling` | bool | `true` | Enable throughput-based scaling (requires pre-deployment profiling data). |
+| `enable_throughput_scaling` | bool | `true` | Enable throughput-based scaling (requires pre-deployment performance data). |
-| `enable_load_scaling` | bool | `false` | Enable load-based scaling (no pre-deployment profiling data required). |
+| `enable_load_scaling` | bool | `false` | Enable load-based scaling. |
 At least one scaling mode must be enabled.
@@ -48,9 +48,9 @@ At least one scaling mode must be enabled.
 | Field | Type | Default | Description |
 |-------|------|---------|-------------|
-| `pre_deployment_sweeping_mode` | string | `rapid` | How to generate engine interpolation data: `rapid` (AIC simulation, ~30s), `thorough` (real GPUs, 2-4h), or `none` (skip). |
+| `pre_deployment_sweeping_mode` | string | `rapid` | How to generate engine performance data: `rapid` (AIC simulation, ~30s), `thorough` (real GPUs, 2-4h), or `none` (skip). |
-When throughput-based scaling is enabled, the planner needs interpolation curves that map ISL to TTFT (prefill) and KV-cache utilization to ITL (decode). The profiler generates this data based on the `pre_deployment_sweeping_mode` setting. See the [Profiler Guide](../profiler/profiler-guide.md) for details on how this data is produced.
+When throughput-based scaling is enabled, the planner needs engine performance data. At startup, it first tries to fetch self-benchmark results from the `get_perf_metrics` Dynamo endpoint (see PR #7779). If unavailable, it falls back to profiler-generated data (npz or JSON) at `profile_results_dir`. Both sources are converted to ForwardPassMetrics and fed into the FPM regression model.
 ### Throughput-Based Scaling Settings
@@ -61,14 +61,14 @@ When throughput-based scaling is enabled, the planner needs interpolation curves
 | `max_gpu_budget` | int | `8` | Maximum total GPUs the planner may allocate. |
 | `ttft` | float | `500.0` | TTFT SLA target (ms) for scaling decisions. |
 | `itl` | float | `50.0` | ITL SLA target (ms) for scaling decisions. |
-| `no_correction` | bool | `true` | Disable latency correction factor. Auto-disabled when load-based scaling is on. |
 ### Load-Based Scaling Settings
 | Field | Type | Default | Description |
 |-------|------|---------|-------------|
-| `load_adjustment_interval` | int | `5` | Seconds between load-based scaling decisions. Must be shorter than `throughput_adjustment_interval`. |
+| `load_adjustment_interval` | int | `5` | Seconds between FPM regression updates and load-based scaling decisions. Even when only throughput scaling is enabled, live FPM observations are fed into the regression at this interval. Must be shorter than `throughput_adjustment_interval`. |
-| `load_learning_window` | int | `50` | Sliding window size for regression model. |
+| `max_num_fpm_samples` | int | `64` | Maximum retained FPM observations for regression. |
+| `fpm_sample_bucket_size` | int | `16` | Number of buckets for observation retirement (must be a perfect square). |
 | `load_scaling_down_sensitivity` | int | `80` | Scale-down sensitivity 0–100 (0=never, 100=aggressive). |
 | `load_metric_samples` | int | `10` | Number of metric samples to collect per decision. |
 | `load_min_observations` | int | `5` | Minimum observations before making scaling decisions. |
@@ -105,8 +105,8 @@ When throughput-based scaling is enabled, the planner needs interpolation curves
 When the profiler runs with planner enabled, it:
 1. Selects the best prefill and decode engine configurations
-2. Generates interpolation curves (TTFT vs ISL, ITL vs KV-cache utilization)
+2. Generates engine performance data (prefill TTFT vs ISL, decode ITL vs KV-cache utilization)
-3. Saves the `PlannerConfig` and profiling data into separate Kubernetes ConfigMaps
+3. Saves the `PlannerConfig` and performance data into separate Kubernetes ConfigMaps
 4. Adds the planner service to the generated DGD, configured to read from those ConfigMaps
 The planner receives its config via `--config /path/to/planner_config.json` which is mounted from the `planner-config-XXXX` ConfigMap. Profiling data is mounted from the `planner-profile-data-XXXX` ConfigMap.