@@ -19,52 +15,13 @@ You have two options to obtain the pre-deployment profiling data:
...
@@ -19,52 +15,13 @@ You have two options to obtain the pre-deployment profiling data:
### Option A: Use Test Configuration (Quickstart)
### Option A: Use Test Configuration (Quickstart)
Use the pre-configured test deployment with sample profiling data, we provide the results and the deployment configuration for the following models x hardware configurations:
Use the pre-configured test deployment with sample profiling data, we provide the results and the deployment configuration for the following models x hardware configurations:
-`nvidia/Llama-3.1-8B-Instruct-FP8` on H200 with max context length 16384, TP1 Prefill, and TP1 Decode. At ISL/OSL 3000/150, it achieves 40k tokens/s/gpu prefill with 80ms TTFT and 10k tokens/s/gpu decode with 10ms ITL. See `../tests/data/profiling_results/H200_TP1P_TP1D/`.
-`nvidia/Llama-3.1-8B-Instruct-FP8` on H200 with max context length 16384, TP1 Prefill, and TP1 Decode. At ISL/OSL 3000/150, it achieves 40k tokens/s/gpu prefill with 80ms TTFT and 10k tokens/s/gpu decode with 10ms ITL. See `../tests/data/profiling_results/H200_TP1P_TP1D/`.
### Option B: Use Your Own Profiling Results
### Option B: Use Your Own Profiling Results
1. Run pre-deployment profiling for your specific setup. See the [pre-deployment profiling documentation](../../../../../../docs/components/profiler/profiler-guide.md) for detailed instructions.
1. Run pre-deployment profiling for your specific setup. See the [pre-deployment profiling documentation](../../../../../../docs/components/profiler/profiler-guide.md) for detailed instructions.
## Interpolator Testing
SLA planner uses two interpolators to estimate the performance of prefill and decode. You can test the interpolators with the following command:
Using profile results from components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/
Interpolating prefill performance ...
Estimated TTFT=60.00ms <= target TTFT=200.00ms. Requests can queue 140.00ms maximally while meeting TTFT SLA.
Estimated throughput: 49481.09 tokens/s/gpu. Request rate at 16.49 requests/s will saturate one GPU.
Interpolating decode performance ...
Average context length: isl + osl/2 = 3150.
Estimated ITL=9.70ms <= target ITL=10.00ms at 16.16% active kv usage.
Estimated throughput: 4555.68 token/s/gpu. Request rate at 15.19 requests/s will saturate one GPU.
```
## Generating Load Dataset
## Generating Load Dataset
We provide a tool to generate load dataset with varying request rate. More details can be found in [sin_load_generator](../../../../../../benchmarks/sin_load_generator/README.md).
We provide a tool to generate load dataset with varying request rate. More details can be found in [sin_load_generator](../../../../../../benchmarks/sin_load_generator/README.md).
The dataset starts at 5 requests/s, increases to 45 requests/s at t=300s, decreases back to 5 requests/s at t=600s, and repeats.
The dataset starts at 5 requests/s, increases to 45 requests/s at t=300s, decreases back to 5 requests/s at t=600s, and repeats.
The total duration is 30 minutes or 1800 seconds.
The total duration is 30 minutes or 1800 seconds.
## Planner Dry Run
Before testing SLA planner on real deployments, we provide a dry run feature to test the autoscaling behavior on a given dataset. Specifically, in dry run mode,
- The load predictor will be tested. However, the load metrics will be different from the real deployment because the actual OSL is only known after the requests are processed.
- There will be no SLA predictions. Instead, sla planner will show the safe throughput limit that will ensure the requests can be processed within the SLA.
- The correction factor will be disabled because there is no SLA metrics as reference.
The first plot shows the actual request rate and the predicted request rate (in the unit of requests/adjustment_interval).
The second plot shows the actual ISL/OSL and the predicted ISL/OSL. The first two plots are useful when tuning the performance of the load predictor.
The third plot shows the actual prefill throughput, number of prefill workers that planner scales, and the safe throughput limit with the number of prefill workers. If the actual throughput is below the safe throughput limit, the deployment has the capacity to adhere the TTFT SLA. Note that in the real deployment, due to other factors such as queueing, load balancing, KV cache transfer latency, and ISL variance, it is not guaranteed that the actual deployment can adhere the TTFT SLA.
The fourth plot, similar to the third plot, shows the actual decode throughput, number of decode workers that planner scales, and the safe throughput limit with the number of decode workers. If the actual throughput is below the safe throughput limit, the deployment has the capacity to adhere the ITL SLA. Note that in the real deployment, due to other factors such as load balancing and OSL variance, it is not guaranteed that the actual deployment can adhere the ITL SLA.
## Scaling Tests
## Scaling Tests
This directory contains comprehensive tests for validating the SLA planner's scaling behavior. The tests validate both the replica calculation logic and end-to-end scaling behavior. The scaling test uses a graduated load approach rather than dataset files, as it proved more reliable for metric generation and scaling triggers.
This directory contains comprehensive tests for validating the SLA planner's scaling behavior. The tests validate both the replica calculation logic and end-to-end scaling behavior. The scaling test uses a graduated load approach rather than dataset files, as it proved more reliable for metric generation and scaling triggers.
...
@@ -132,6 +59,7 @@ This directory contains comprehensive tests for validating the SLA planner's sca
...
@@ -132,6 +59,7 @@ This directory contains comprehensive tests for validating the SLA planner's sca
### Quick Start for Unit Tests and End-to-End Tests
### Quick Start for Unit Tests and End-to-End Tests
#### Run Unit Tests Only
#### Run Unit Tests Only
Test the replica calculation logic without requiring Kubernetes:
Test the replica calculation logic without requiring Kubernetes:
In this test, we compare performance (goodput and goodput/GPU) on deployments on the following four deployments using the aforementioned 8b FP8 model on H200 and the dataset used in dryrun:
In this test, we compare performance (goodput and goodput/GPU) on deployments on the following four deployments using the aforementioned 8b FP8 model on H200 and the dataset used in dryrun:
- Config 1 with inefficient P/D ratio: 3xTP1P_1xTP1D_4GPU
- Config 1 with inefficient P/D ratio: 3xTP1P_1xTP1D_4GPU
`./perf_test_configs/disagg_8b_3p1d.yaml`
`./perf_test_configs/disagg_8b_3p1d.yaml`
- Config 2 with best static deployment: 2xTP1P_2xTP1D_4GPU
- Config 2 with best static deployment: 2xTP1P_2xTP1D_4GPU
...
@@ -214,12 +143,13 @@ aiperf profile \
...
@@ -214,12 +143,13 @@ aiperf profile \
#### E2E Perf Test Results
#### E2E Perf Test Results

Results
The table below shows the performance improvement of SLA planner across different deployment configurations:
The table below shows the performance improvement of SLA planner across different deployment configurations:
@@ -25,8 +25,8 @@ The Dynamo **Planner** is an autoscaler purpose-built for these constraints. It
...
@@ -25,8 +25,8 @@ The Dynamo **Planner** is an autoscaler purpose-built for these constraints. It
The Planner supports two scaling modes that can run independently or together:
The Planner supports two scaling modes that can run independently or together:
-**Throughput-based scaling**: Uses pre-deployment profiling data and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments.
-**Throughput-based scaling**: Uses pre-deployment engine performance data (from self-benchmark or profiler) and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments.
-**Load-based scaling**: Uses ForwardPassMetrics (FPM) from the Dynamo event plane and fits an online linear regression to make scaling decisions. No profiling data or KV Router required. Adjusts on a short interval (default 5s) to respond quickly to bursts.
-**Load-based scaling**: Uses ForwardPassMetrics (FPM) from the Dynamo event plane and fits an online linear regression to make scaling decisions. No pre-deployment data or KV Router required. Adjusts on a short interval (default 5s) to respond quickly to bursts.
When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor.
When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor.
...
@@ -36,12 +36,12 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
...
@@ -36,12 +36,12 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
| Any (round-robin, random, etc.) | Supported | Not supported |
| Any (round-robin, random, etc.) | Supported | Not supported |
...
@@ -52,8 +52,8 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
...
@@ -52,8 +52,8 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
## When to Use Which Mode
## When to Use Which Mode
-**Throughput-based scaling** should be enabled whenever engine profiling data is available (through pre-deployment profiling). It provides stable, prediction-based capacity planning.
-**Throughput-based scaling** should be enabled whenever engine performance data is available (through self-benchmark or pre-deployment profiling). It provides stable, prediction-based capacity planning.
-**Load-based scaling** should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring profiling data.
-**Load-based scaling** should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring pre-deployment data.
-**Both modes together**: For the best of both worlds, enable both. Throughput-based scaling provides a lower bound (long-term capacity), while load-based scaling handles bursts above that floor. When both are enabled, use a longer `--adjustment-interval` for throughput-based scaling.
-**Both modes together**: For the best of both worlds, enable both. Throughput-based scaling provides a lower bound (long-term capacity), while load-based scaling handles bursts above that floor. When both are enabled, use a longer `--adjustment-interval` for throughput-based scaling.
## Quick Start
## Quick Start
...
@@ -63,7 +63,7 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
...
@@ -63,7 +63,7 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
- Dynamo platform installed on Kubernetes ([Installation Guide](../../kubernetes/installation-guide.md))
- Dynamo platform installed on Kubernetes ([Installation Guide](../../kubernetes/installation-guide.md))
For throughput-based scaling, pre-deployment profiling is also required ([Profiling Guide](../profiler/profiler-guide.md)).
For throughput-based scaling, pre-deployment engine performance data is also required (via self-benchmark mode or [Profiling Guide](../profiler/profiler-guide.md)).
### Throughput-Based Scaling (with DGDR)
### Throughput-Based Scaling (with DGDR)
...
@@ -141,13 +141,11 @@ Load-based scaling has the following known limitations. Throughput-based scaling
...
@@ -141,13 +141,11 @@ Load-based scaling has the following known limitations. Throughput-based scaling
@@ -12,12 +12,12 @@ For a quick overview, see the [Planner overview](README.md). For architecture in
...
@@ -12,12 +12,12 @@ For a quick overview, see the [Planner overview](README.md). For architecture in
The planner supports two scaling modes that can be used independently or together:
The planner supports two scaling modes that can be used independently or together:
-**Throughput-based scaling** (`enable_throughput_scaling: true`): Uses pre-deployment engine interpolation data and traffic prediction to plan capacity. Best for stable, predictable workloads. Requires profiling data generated by the [Profiler](../profiler/profiler-guide.md).
-**Throughput-based scaling** (`enable_throughput_scaling: true`): Uses pre-deployment engine performance data (from self-benchmark or profiler) and traffic prediction to plan capacity. Best for stable, predictable workloads.
-**Load-based scaling** (`enable_load_scaling: true`): Uses real-time per-worker engine metrics and online regression. Best for bursty or unpredictable traffic. Does not require profiling data. Requires the [KV Router](../router/README.md) — see [Current Limitations](README.md#current-limitations).
-**Load-based scaling** (`enable_load_scaling: true`): Uses real-time ForwardPassMetrics (FPM) from the Dynamo event plane and online regression to make scaling decisions. Best for bursty or unpredictable traffic. Does not require pre-deployment data.
**When to use which:**
**When to use which:**
- Enable **throughput-based scaling** whenever profiling data is available. It provides stable, prediction-based capacity planning.
- Enable **throughput-based scaling** whenever pre-deployment performance data is available (via self-benchmark or profiler). It provides stable, prediction-based capacity planning.
- Enable **load-based scaling** when traffic is bursty. It reacts quickly to real-time load changes.
- Enable **load-based scaling** when traffic is bursty. It reacts quickly to real-time load changes.
- Enable **both** for the best of both worlds: throughput-based provides a capacity floor, load-based handles bursts above it. When both are enabled, use a longer `throughput_adjustment_interval`.
- Enable **both** for the best of both worlds: throughput-based provides a capacity floor, load-based handles bursts above it. When both are enabled, use a longer `throughput_adjustment_interval`.
@@ -48,9 +48,9 @@ At least one scaling mode must be enabled.
...
@@ -48,9 +48,9 @@ At least one scaling mode must be enabled.
| Field | Type | Default | Description |
| Field | Type | Default | Description |
|-------|------|---------|-------------|
|-------|------|---------|-------------|
| `pre_deployment_sweeping_mode` | string | `rapid` | How to generate engine interpolation data: `rapid` (AIC simulation, ~30s), `thorough` (real GPUs, 2-4h), or `none` (skip). |
| `pre_deployment_sweeping_mode` | string | `rapid` | How to generate engine performance data: `rapid` (AIC simulation, ~30s), `thorough` (real GPUs, 2-4h), or `none` (skip). |
When throughput-based scaling is enabled, the planner needs interpolation curves that map ISL to TTFT (prefill) and KV-cache utilization to ITL (decode). The profilergenerates this data based on the `pre_deployment_sweeping_mode` setting. See the [Profiler Guide](../profiler/profiler-guide.md) for details on how this data is produced.
When throughput-based scaling is enabled, the planner needs engine performance data. At startup, it first tries to fetch self-benchmark results from the `get_perf_metrics` Dynamo endpoint (see PR #7779). If unavailable, it falls back to profiler-generated data (npz or JSON) at `profile_results_dir`. Both sources are converted to ForwardPassMetrics and fed into the FPM regression model.
### Throughput-Based Scaling Settings
### Throughput-Based Scaling Settings
...
@@ -61,14 +61,14 @@ When throughput-based scaling is enabled, the planner needs interpolation curves
...
@@ -61,14 +61,14 @@ When throughput-based scaling is enabled, the planner needs interpolation curves
| `max_gpu_budget` | int | `8` | Maximum total GPUs the planner may allocate. |
| `max_gpu_budget` | int | `8` | Maximum total GPUs the planner may allocate. |
| `no_correction` | bool | `true` | Disable latency correction factor. Auto-disabled when load-based scaling is on. |
### Load-Based Scaling Settings
### Load-Based Scaling Settings
| Field | Type | Default | Description |
| Field | Type | Default | Description |
|-------|------|---------|-------------|
|-------|------|---------|-------------|
| `load_adjustment_interval` | int | `5` | Seconds between load-based scaling decisions. Must be shorter than `throughput_adjustment_interval`. |
| `load_adjustment_interval` | int | `5` | Seconds between FPM regression updates and load-based scaling decisions. Even when only throughput scaling is enabled, live FPM observations are fed into the regression at this interval. Must be shorter than `throughput_adjustment_interval`. |
| `load_learning_window` | int | `50` | Sliding window size for regression model. |
| `max_num_fpm_samples` | int | `64` | Maximum retained FPM observations for regression. |
| `fpm_sample_bucket_size` | int | `16` | Number of buckets for observation retirement (must be a perfect square). |
| `load_metric_samples` | int | `10` | Number of metric samples to collect per decision. |
| `load_metric_samples` | int | `10` | Number of metric samples to collect per decision. |
| `load_min_observations` | int | `5` | Minimum observations before making scaling decisions. |
| `load_min_observations` | int | `5` | Minimum observations before making scaling decisions. |
...
@@ -105,8 +105,8 @@ When throughput-based scaling is enabled, the planner needs interpolation curves
...
@@ -105,8 +105,8 @@ When throughput-based scaling is enabled, the planner needs interpolation curves
When the profiler runs with planner enabled, it:
When the profiler runs with planner enabled, it:
1. Selects the best prefill and decode engine configurations
1. Selects the best prefill and decode engine configurations
2. Generates interpolation curves (TTFT vs ISL, ITL vs KV-cache utilization)
2. Generates engine performance data (prefill TTFT vs ISL, decode ITL vs KV-cache utilization)
3. Saves the `PlannerConfig` and profiling data into separate Kubernetes ConfigMaps
3. Saves the `PlannerConfig` and performance data into separate Kubernetes ConfigMaps
4. Adds the planner service to the generated DGD, configured to read from those ConfigMaps
4. Adds the planner service to the generated DGD, configured to read from those ConfigMaps
The planner receives its config via `--config /path/to/planner_config.json` which is mounted from the `planner-config-XXXX` ConfigMap. Profiling data is mounted from the `planner-profile-data-XXXX` ConfigMap.
The planner receives its config via `--config /path/to/planner_config.json` which is mounted from the `planner-config-XXXX` ConfigMap. Profiling data is mounted from the `planner-profile-data-XXXX` ConfigMap.