This document covers load-based planner in `examples/llm/components/planner.py`.
This document covers load-based planner in `examples/llm/components/planner.py`.
> [!WARNING]
> Bare metal deployment with local connector is deprecated. The only option to deploy load-based planner is via k8s. We will update the examples in this document soon.
@@ -7,6 +7,9 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy
...
@@ -7,6 +7,9 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy
> [!NOTE]
> [!NOTE]
> Currently, SLA-based planner only supports disaggregated setup.
> Currently, SLA-based planner only supports disaggregated setup.
> [!WARNING]
> Bare metal deployment with local connector is deprecated. The only option to deploy SLA-based planner is via k8s. We will update the examples in this document soon.
## Features
## Features
***SLA-driven scaling**: Automatically scales prefill/decode workers to meet TTFT and ITL targets
***SLA-driven scaling**: Automatically scales prefill/decode workers to meet TTFT and ITL targets
This guide shows an example of benchmarking `LocalPlanner` performance with synthetic data. In this example, we focus on 8x H100 SXM GPU and `deepseek-ai/DeepSeek-R1-Distill-Llama-8B` model with TP1 prefill and decode engine.
This guide shows an example of benchmarking `LocalPlanner` performance with synthetic data. In this example, we focus on 8x H100 SXM GPU and `deepseek-ai/DeepSeek-R1-Distill-Llama-8B` model with TP1 prefill and decode engine.
> [!WARNING]
> Bare metal deployment with local connector is deprecated. The only option to deploy planner is via k8s. We will update the examples in this document soon.
## Synthetic Data Generation
## Synthetic Data Generation
We first generate synthetic data with varying request rate from 0.75 to 3 using the provided `generate_synthetic_data.py` script.
We first generate synthetic data with varying request rate from 0.75 to 3 using the provided `generate_synthetic_data.py` script.