1.`local` - uses circus to start/stop worker subprocesses
1.`local` - uses circus to start/stop worker subprocesses
2.`kubernetes` - uses the kubernetes API to adjust replicas of each component's resource definition. This is a work in progress and not currently available
2.`kubernetes` - uses the kubernetes API to adjust replicas of the DynamoGraphDeployment resource, which automatically scales the corresponding worker pods up or down
## Local Backend (LocalPlanner)
## Local Backend (LocalPlanner)
...
@@ -118,10 +118,12 @@ If scaled to zero, the initial entry is kept without resources to maintain confi
...
@@ -118,10 +118,12 @@ If scaled to zero, the initial entry is kept without resources to maintain confi
### Testing
### Testing
For manual testing, you can use the controller_test.py file to add/remove components after you've run a serve command with `--enable-local-planner`.
For manual testing, you can use the controller_test.py file to add/remove components after you've run a serve command on a Dynamo pipeline where the planner is linked.
## Kubernetes Backend
## Kubernetes Backend
[Coming soon]
The Kubernetes backend works by updating the replicas count of the DynamoGraphDeployment custom resource. When the planner determines that workers need to be scaled up or down based on workload metrics, it uses the Kubernetes API to patch the DynamoGraphDeployment resource specification, changing the replicas count for the appropriate worker component. The Kubernetes operator then reconciles this change by creating or terminating the necessary pods. This provides a seamless autoscaling experience in Kubernetes environments without requiring manual intervention.
The Kubernetes backend will automatically be used by Planner when your pipeline is deployed with `dynamo deployment create`. By default, the planner will run in no-op mode, which means it will monitor metrics but not take scaling actions. To enable actual scaling, you should also specify `--Planner.no-operation=false`.
@@ -49,26 +49,48 @@ There are two additional rules set by planner to prevent over-compensation:
...
@@ -49,26 +49,48 @@ There are two additional rules set by planner to prevent over-compensation:
1. We do not scale up prefill worker if the prefill queue size is estimated to reduce below the `--prefill-queue-scale-up-threshold` within the next `NEW_PREFILL_WORKER_QUEUE_BUFFER_PERIOD=3` adjustment intervals following the trend observed in the current adjustment interval.
1. We do not scale up prefill worker if the prefill queue size is estimated to reduce below the `--prefill-queue-scale-up-threshold` within the next `NEW_PREFILL_WORKER_QUEUE_BUFFER_PERIOD=3` adjustment intervals following the trend observed in the current adjustment interval.
## Usage
## Usage
After you've deployed a dynamo graph - you can start the planner with the following command:
The planner is started automatically as part of Dynamo pipelines when running `dynamo serve`. You can configure the planner just as you would any other component in your pipeline either via YAML configuration or through CLI arguments.
Usage:
```bash
# Configure the planner through YAML configuration
dynamo serve graphs.disagg:Frontend -f disagg.yaml
# disagg.yaml
# ...
# Planner:
# environment: local
# no-operation: false
# log-dir: log/planner
# Configure the planner through CLI arguments
dynamo serve graphs.disagg:Frontend -f disagg.yaml --Planner.environment=local--Planner.no-operation=false--Planner.log-dir=log/planner
```
The planner accepts the following configuration options:
*`namespace` (str, default: "dynamo"): Namespace planner will look at
*`served-model-name` (str, default: "vllm"): Model name that is being served`
* `no-operation` (bool, default: false): Do not make any adjustments, just observe the metrics and log to tensorboard.
* `adjustment-interval` (int, default: 30): Interval in seconds between scaling adjustments
* `metric-pulling-interval` (int, default: 1): Interval in seconds between metric pulls
* `max-gpu-budget` (int, default: 8): Maximum number of GPUs to use, planner will not scale up more than this number of GPUs for prefill plus decode workers
* `min-gpu-budget` (int, default: 1): Minimum number of GPUs to use, planner will not scale down below this number of GPUs for prefill or decode workers
* `decode-kv-scale-up-threshold` (float, default: 0.9): KV cache utilization threshold to scale up decode workers
* `decode-kv-scale-down-threshold` (float, default: 0.5): KV cache utilization threshold to scale down decode workers
* `prefill-queue-scale-up-threshold` (float, default: 0.5): Queue utilization threshold to scale up prefill workers
* `prefill-queue-scale-down-threshold` (float, default: 0.2): Queue utilization threshold to scale down prefill workers
* `decode-engine-num-gpu` (int, default: 1): Number of GPUs per decode engine
* `prefill-engine-num-gpu` (int, default: 1): Number of GPUs per prefill engine
Alternatively, you can run the planner as a standalone python process. The configuration options above can be directly passed in as CLI arguments.
* `--adjustment-interval` (int, default: 30): Interval in seconds between scaling adjustments
* `--metric-pulling-interval` (int, default: 1): Interval in seconds between metric pulls
* `--max-gpu-budget` (int, default: 8): Maximum number of GPUs to use, planner will not scale up more than this number of GPUs for prefill plus decode workers
* `--min-gpu-budget` (int, default: 1): Minimum number of GPUs to use, planner will not scale down below this number of GPUs for prefill or decode workers
* `--decode-kv-scale-up-threshold` (float, default: 0.9): KV cache utilization threshold to scale up decode workers
* `--decode-kv-scale-down-threshold` (float, default: 0.5): KV cache utilization threshold to scale down decode workers
* `--prefill-queue-scale-up-threshold` (float, default: 0.5): Queue utilization threshold to scale up prefill workers
* `--prefill-queue-scale-down-threshold` (float, default: 0.2): Queue utilization threshold to scale down prefill workers
* `--decode-engine-num-gpu` (int, default: 1): Number of GPUs per decode engine
* `--prefill-engine-num-gpu` (int, default: 1): Number of GPUs per prefill engine
1. `local` - uses circus to start/stop worker subprocesses
1. `local` - uses circus to start/stop worker subprocesses
2. `kubernetes` - uses kubernetes to scale up/down the number of worker pods by updating the replicas count of the DynamoGraphDeployment resource
### Local Backend
### Local Backend
...
@@ -129,3 +152,7 @@ Note that we keep the initial non-suffix entry in order to know what cmd we will
...
@@ -129,3 +152,7 @@ Note that we keep the initial non-suffix entry in order to know what cmd we will
> [!NOTE]
> [!NOTE]
> At the moment - planner work best if your initial replicas per worker are 1. This is because if you specify replicas > 1 when you initially start `dynamo serve`, the current implementation in `serving.py` starts each process in the same watcher.
> At the moment - planner work best if your initial replicas per worker are 1. This is because if you specify replicas > 1 when you initially start `dynamo serve`, the current implementation in `serving.py` starts each process in the same watcher.
### Kubernetes Backend
The Kubernetes backend works by updating the replicas count of the DynamoGraphDeployment custom resource. When the planner detects the need to scale up or down a specific worker type, it uses the Kubernetes API to patch the DynamoGraphDeployment resource, modifying the replicas count for the appropriate component. The Kubernetes operator then reconciles this change by creating or removing the necessary pods. This provides a seamless scaling experience in Kubernetes environments without requiring manual intervention.
@@ -125,6 +125,9 @@ This figure shows an overview of the major components to deploy:
...
@@ -125,6 +125,9 @@ This figure shows an overview of the major components to deploy:
```
```
> [!NOTE]
> The planner component is enabled by default for all deployment architectures but is set to no-op mode. This means the planner observes metrics but doesn't take scaling actions. To enable active scaling, you can add `--Planner.no-operation=false` to your `dynamo serve` command. For more details, see the [Planner documentation](../../components/planner/README.md).
### Example architectures
### Example architectures
_Note_: For a non-dockerized deployment, first export `DYNAMO_HOME` to point to the dynamo repository root, e.g. `export DYNAMO_HOME=$(pwd)`
_Note_: For a non-dockerized deployment, first export `DYNAMO_HOME` to point to the dynamo repository root, e.g. `export DYNAMO_HOME=$(pwd)`
dynamo deployment create $DYNAMO_TAG-n$DEPLOYMENT_NAME-f ./configs/agg.yaml
dynamo deployment create $DYNAMO_TAG-n$DEPLOYMENT_NAME-f ./configs/agg.yaml
```
```
**Note**: Optionally add `--Planner.no-operation=false` at the end of the deployment command to enable the planner component to take scaling actions on your deployment.
### Testing the Deployment
### Testing the Deployment
Once the deployment is complete, you can test it using:
Once the deployment is complete, you can test it using: