This is a virtual connector for planner to output scaling decisions to non-native environments
This virtual connector does not actually scale the deployment, instead, it communicates with the non-native environment through ETCD
The deployment environment needs to read from ETCD to receive the scaling decisions and update ETCD to report scaling status
The prefix for the ETCD key is /{dynamo_namespace}/planner/
To output the scaling decisions, planner write to three keys:
- num_prefill_workers: an integer (stored as string), specifying how many prefill workers the deployment should have in the last scaling decision
- num_decode_workers: an integer (stored as string), specifying how many decode workers the deployment should have in the last scaling decision
- decision_id: an integer (stored as string), specifying an incremental id for the last scaling decision, if there's no scaling decision, the value would be -1
To receive the status of the scaling decisions, the deployment environment needs to write to this key:
- scaled_decision_id: an integer (stored as string), specifying if the newest decision_id that has been scaled
@@ -108,6 +108,8 @@ Finally, SLA planner applies the change by scaling up/down the number of prefill
...
@@ -108,6 +108,8 @@ Finally, SLA planner applies the change by scaling up/down the number of prefill
## Deploying
## Deploying
### K8s Deployment
For detailed deployment instructions including setup, configuration, troubleshooting, and architecture overview, see the [SLA Planner Deployment Guide](../guides/dynamo_deploy/sla_planner_deployment.md).
For detailed deployment instructions including setup, configuration, troubleshooting, and architecture overview, see the [SLA Planner Deployment Guide](../guides/dynamo_deploy/sla_planner_deployment.md).
> The SLA planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with the number of requests, ISL, OSL, TTFT, and ITL in the correct format. The dynamo frontend provides these metrics automatically.
> The SLA planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with the number of requests, ISL, OSL, TTFT, and ITL in the correct format. The dynamo frontend provides these metrics automatically.
### Virtual Deployment
The SLA planner supports virtual deployment mode for customized environments (e.g., customized cluster) through the `VirtualConnector`. This connector enables the planner to communicate scaling decisions via ETCD without directly managing the deployment infrastructure.
The `VirtualConnector` acts as a bridge between the SLA planner and external deployment environments. Instead of directly scaling Kubernetes resources, it writes scaling decisions to ETCD and waits for the deployment environment to acknowledge completion.
#### ETCD Communication Protocol
The VirtualConnector uses the following ETCD key structure under `/{dynamo_namespace}/planner/`:
**Planner Output Keys** (written by the planner):
-`num_prefill_workers`: Integer (stored as string) specifying the target number of prefill workers
-`num_decode_workers`: Integer (stored as string) specifying the target number of decode workers
-`decision_id`: Integer (stored as string) with incremental ID for each scaling decision (-1 if no decisions made)
**Deployment Environment Input Key** (written by the deployment environment):
-`scaled_decision_id`: Integer (stored as string) specifying the newest decision_id that has been successfully scaled
#### Scaling Decision Flow
1.**Decision Generation**: The planner calculates optimal worker counts and writes them to ETCD with an incremented `decision_id`
2.**Change Detection**: The planner skips scaling if the target counts match current counts, logging: `"No scaling needed (prefill=X, decode=Y), skipping ETCD update"`
3.**Readiness Check**: Before making new decisions, the planner verifies that previous scaling operations have completed by checking if `scaled_decision_id >= decision_id`
4.**Timeout Handling**: If a scaling decision isn't acknowledged within 30 minutes (1800 seconds), the planner proceeds with new decisions anyway
5.**Completion Tracking**: The planner can optionally wait for scaling completion confirmation (blocking mode)
#### Configuration
To use virtual deployment mode:
```yaml
environment:"virtual"
backend:"vllm"# or "sglang"
```
#### Deployment Environment Requirements
The external deployment environment must:
1.**Monitor ETCD**: Continuously watch the `/{dynamo_namespace}/planner/` prefix for scaling decisions
2.**Parse Decisions**: Read `num_prefill_workers`, `num_decode_workers`, and `decision_id` values
3.**Execute Scaling**: Apply the scaling decisions to the actual deployment infrastructure
4.**Acknowledge Completion**: Write the completed `decision_id` to `scaled_decision_id` when scaling is finished