This is a virtual connector for planner to output scaling decisions to non-native environments
This is a virtual connector for planner to output scaling decisions to non-native environments
This virtual connector does not actually scale the deployment, instead, it communicates with the non-native environment through ETCD
This virtual connector does not actually scale the deployment, instead, it communicates with the non-native environment through dynamo-runtime's VirtualConnectorCoordinator.
The deployment environment needs to read from ETCD to receive the scaling decisions and update ETCD to report scaling status
The deployment environment needs to use VirtualConnectorClient (in the Rust/Python bindings) to read from the scaling decisions and update report scaling status.
The prefix for the ETCD key is /{dynamo_namespace}/planner/
To output the scaling decisions, planner write to three keys:
- num_prefill_workers: an integer (stored as string), specifying how many prefill workers the deployment should have in the last scaling decision
- num_decode_workers: an integer (stored as string), specifying how many decode workers the deployment should have in the last scaling decision
- decision_id: an integer (stored as string), specifying an incremental id for the last scaling decision, if there's no scaling decision, the value would be -1
To receive the status of the scaling decisions, the deployment environment needs to write to this key:
- scaled_decision_id: an integer (stored as string), specifying if the newest decision_id that has been scaled
This handles the case where a worker is already initialized (common in CI)
by using the detached() method to reuse the existing runtime.
"""
try:
# Try to use existing runtime (common in CI where tests run in same process)
_runtime_instance=DistributedRuntime.detached()
exceptException:
# If no existing runtime, create a new one
loop=asyncio.get_running_loop()
_runtime_instance=DistributedRuntime(loop,False)
return_runtime_instance
# Fails in CI after 30+ minutes with:
# pyo3_runtime.PanicException: Cannot drop a runtime in a context where blocking is not allowed. This happens when a runtime is dropped from within an asynchronous context.
# Disabling until we have a faster CI to iterate with.
@pytest.mark.skip("See comment in source")
deftest_main():
"""
Connect a VirtualConnector (Dynamo Planner) and a VirtualConnectorClient (customer), and scale.
"""
asyncio.run(async_internal(get_runtime()))
asyncdefnext_scaling_decision(c):
"""Move the second decision in to a separate task so we can `.wait` for it."""
The SLA planner supports virtual deployment mode for customized environments (e.g., customized cluster) through the `VirtualConnector`. This connector enables the planner to communicate scaling decisions via ETCD without directly managing the deployment infrastructure.
The SLA planner supports virtual deployment mode for customized environments (e.g., customized cluster) through the `VirtualConnector`. This connector enables the planner to communicate scaling decisions without directly managing the deployment infrastructure.
The `VirtualConnector` acts as a bridge between the SLA planner and external deployment environments. Instead of directly scaling Kubernetes resources, it writes scaling decisions to ETCD and waits for the deployment environment to acknowledge completion.
The `VirtualConnector` acts as a bridge between the SLA planner and external deployment environments. Instead of directly scaling Kubernetes resources, it writes scaling decisions and waits for the deployment environment to acknowledge completion.
#### ETCD Communication Protocol
The VirtualConnector uses the following ETCD key structure under `/{dynamo_namespace}/planner/`:
**Planner Output Keys** (written by the planner):
-`num_prefill_workers`: Integer (stored as string) specifying the target number of prefill workers
-`num_decode_workers`: Integer (stored as string) specifying the target number of decode workers
-`decision_id`: Integer (stored as string) with incremental ID for each scaling decision (-1 if no decisions made)
**Deployment Environment Input Key** (written by the deployment environment):
-`scaled_decision_id`: Integer (stored as string) specifying the newest decision_id that has been successfully scaled
#### Scaling Decision Flow
#### Scaling Decision Flow
1.**Decision Generation**: The planner calculates optimal worker counts and writes them to ETCD with an incremented `decision_id`
1.**Decision Generation**: The planner calculates optimal worker counts
2.**Change Detection**: The planner skips scaling if the target counts match current counts, logging: `"No scaling needed (prefill=X, decode=Y), skipping ETCD update"`
2.**Change Detection**: The planner skips scaling if the target counts match current counts, logging: `"No scaling needed (prefill=X, decode=Y)"`
3.**Readiness Check**: Before making new decisions, the planner verifies that previous scaling operations have completed by checking if `scaled_decision_id >= decision_id`
3.**Readiness Check**: Before making new decisions, the planner verifies that previous scaling operations have completed by checking if `scaled_decision_id >= decision_id`
4.**Timeout Handling**: If a scaling decision isn't acknowledged within 30 minutes (1800 seconds), the planner proceeds with new decisions anyway
4.**Timeout Handling**: If a scaling decision isn't acknowledged within 30 minutes (1800 seconds), the planner proceeds with new decisions anyway
5.**Completion Tracking**: The planner can optionally wait for scaling completion confirmation (blocking mode)
5.**Completion Tracking**: The planner can optionally wait for scaling completion confirmation (blocking mode)
...
@@ -158,31 +146,23 @@ backend: "vllm" # or "sglang"
...
@@ -158,31 +146,23 @@ backend: "vllm" # or "sglang"
#### Deployment Environment Requirements
#### Deployment Environment Requirements
The external deployment environment must:
The external deployment environment must use `VirtualConnectorClient`:
1.**Monitor ETCD**: Continuously watch the `/{dynamo_namespace}/planner/` prefix for scaling decisions
2.**Parse Decisions**: Read `num_prefill_workers`, `num_decode_workers`, and `decision_id` values
3.**Execute Scaling**: Apply the scaling decisions to the actual deployment infrastructure
4.**Acknowledge Completion**: Write the completed `decision_id` to `scaled_decision_id` when scaling is finished
#### Example Integration
```
from dynamo._core import DistributedRuntime, VirtualConnectorClient