docs: add Planner Quickstart doc (#3358)

Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com> Co-authored-by: Hongkuan Zhou <tedzhouhk@gmail.com>

docs: add Planner Quickstart doc (#3358)
Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com> Co-authored-by: Hongkuan Zhou <tedzhouhk@gmail.com>
89cf9107 · hhzhang16 · GitHub · 1faf0152 · 89cf9107 · 89cf9107
Unverified Commit 89cf9107 authored Oct 03, 2025 by hhzhang16 Committed by GitHub Oct 03, 2025
12 changed files
--- a/components/backends/vllm/deploy/README.md
+++ b/components/backends/vllm/deploy/README.md
@@ -237,7 +237,7 @@ args:
 - **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/kubernetes/create_deployment.md)
 - **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md)
 - **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/kubernetes/installation_guide.md)
- **SLA Planner**: [SLA Planner Deployment Guide](../../../../docs/kubernetes/sla_planner_deployment.md)
+- **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/kubernetes/sla_planner_quickstart.md)
 - **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
 - **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md)

--- a/docs/architecture/planner_intro.rst
+++ b/docs/architecture/planner_intro.rst
@@ -23,10 +23,16 @@ Currently, the planner can scale the number of vllm workers up and down based on
 Key features include:
-* **Load-based scaling** that monitors KV cache utilization and prefill queue size to make scaling decisions
 * **SLA-based scaling** that uses predictive modeling and performance interpolation to proactively meet TTFT and ITL targets
 * **Graceful scaling** that ensures no requests are dropped during scale-down operations
+.. admonition:: 🚀 Quick Start
+   :class: seealso
+   **New to SLA Planner?** Start with the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md) for a complete, step-by-step workflow.
+   **Prerequisites**: SLA-based planner requires pre-deployment profiling (2-4 hours on real silicon or a few minutes using simulator) before deployment. The Quick Start guide includes everything you need.
 .. list-table::
   :widths: 20 5 75
   :header-rows: 1
@@ -35,7 +41,7 @@ Key features include:
     -
     - Feature
   * - **Backend**
-     - ✅
+     - ❌
     - Local
   * -
     - ✅
@@ -47,7 +53,7 @@ Key features include:
     - ✅
     - TensorRT-LLM
   * -
-     - ❌
+     - ✅
     - SGLang
   * - **Serving Type**
     - ✅
@@ -56,7 +62,7 @@ Key features include:
     - ✅
     - Disaggregated
   * - **Planner Actions**
-     - ✅
+     - ❌
     - Load-based scaling up/down prefill/decode workers
   * -
     - ✅
@@ -71,6 +77,6 @@ Key features include:
   :hidden:
   Overview <self>
+   SLA Planner Quick Start <../kubernetes/sla_planner_quickstart>
   Pre-Deployment Profiling <../benchmarks/pre_deployment_profiling.md>
-   Load-based Planner <load_planner.md>
   SLA-based Planner <sla_planner.md>
--- a/docs/architecture/sla_planner.md
+++ b/docs/architecture/sla_planner.md
 # SLA-based Planner
-This document covers SLA-based planner in `examples/common/utils/planner_core.py`.
+> [!TIP]
+> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md).
+This document covers information regarding the SLA-based planner in `examples/common/utils/planner_core.py`.
 The SLA (Service Level Agreement)-based planner is an intelligent autoscaling system that monitors system performance and adjusts the number of prefill and decode workers to meet specified TTFT and ITL targets. Unlike the load-based planner that scales based on resource utilization thresholds, the SLA planner uses predictive modeling and performance interpolation to proactively scale the workers.
@@ -10,6 +13,24 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy
 > [!WARNING]
 > Bare metal deployment with local connector is deprecated. Please deploy the SLA planner in k8s.
+## Architecture Overview
+**Components:**
+- **Frontend**: Serves requests and exposes `/metrics`
+- **Prometheus**: Scrapes frontend metrics every 5s (by default, can be updated in the podmonitor manifest)
+- **Planner**: Queries Prometheus and adjusts worker scaling every adjustment interval
+- **Workers**: prefill and backend workers handle inference
+The adjustment interval can be defined in the planner manifest as an argument. The default interval value can be found in this [file](/components/planner/src/dynamo/planner/defaults.py).
+```mermaid
+flowchart LR
+  Frontend --"/metrics"--> Prometheus
+  Planner --"query API"--> Prometheus
+  Planner --"scaling decisions"--> Workers
+  Frontend -.->|"requests"| Workers
+```
 ## Features
 * **SLA-driven scaling**: Automatically scales prefill/decode workers to meet TTFT and ITL targets
@@ -108,15 +129,7 @@ Finally, SLA planner applies the change by scaling up/down the number of prefill
 ## Deploying
-### K8s Deployment
+For complete deployment instructions, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md).
-For detailed deployment instructions including setup, configuration, troubleshooting, and architecture overview, see the [SLA Planner Deployment Guide](../kubernetes/sla_planner_deployment.md).
-**To deploy SLA Planner:**
-```bash
-cd components/backends/vllm/deploy
-kubectl apply -f disagg_planner.yaml -n {$NAMESPACE}
-```
 > [!NOTE]
 > The SLA planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with the number of requests, ISL, OSL, TTFT, and ITL in the correct format. The dynamo frontend provides these metrics automatically.

--- a/docs/benchmarks/pre_deployment_profiling.md
+++ b/docs/benchmarks/pre_deployment_profiling.md
 # Pre-Deployment Profiling
+> [!TIP]
+> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md).
 ## Profiling Script
 To ensure Dynamo deployments comply with the SLA, we provide a pre-deployment script to profile the model performance with different parallelization mappings and recommend the parallelization mapping for prefill and decode workers and planner configurations. To use this script, the user needs to provide the target ISL, OSL, TTFT SLA, and ITL SLA.
@@ -93,40 +96,16 @@ After suggesting the optimal TP configuration, two `.npz` files that describe th
 SLA planner can work with any interpolation data that follows the above format. For best results, use fine-grained and high coverage interpolation data for the prefill and decode engines.
-## Running the Profiling Script in Kubernetes
+## Detailed Kubernetes Profiling Instructions
-Set up your Kubernetes namespace for profiling (one-time per namespace). First ensure Dynamo Cloud platform is installed by following the [main installation guide](/docs/kubernetes/installation_guide.md), then set up profiling resources using [deploy/utils/README](/deploy/utils/README.md). If your namespace is already set up, skip this step.
-**Prerequisites**: Ensure all dependencies are installed. If you ran the setup script above, dependencies are already installed. Otherwise, install them manually:
-```bash
-pip install -r deploy/utils/requirements.txt
-```
-**Step 1: Inject your DGD configuration**
+> [!TIP]
+> For a complete step-by-step workflow, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md).
-Use the injector utility to place your DGD manifest into the PVC. The profiling job will read the path you specify.
+This section provides detailed technical information for advanced users who need to customize the profiling process.
-   ```bash
-   # Use default disagg.yaml config
-   python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src components/backends/vllm/deploy/disagg.yaml --dest /data/configs/disagg.yaml
-   # Or use a custom disagg config file
-   python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/configs/disagg.yaml
-   # Or specify a custom target path in the PVC
-   python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/profiling_results/my-disagg.yaml
-   ```
-   > **Note**: All paths must start with `/data/` for security reasons. If you forget this prefix, the script will show a helpful error message with the correct path.
+### Configuration Options
-**Step 2: Set SLA target**
+**For dense models**, configure `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_job.yaml`:
-For dense models, edit `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_job.yaml` to set the target ISL, OSL, TTFT, and ITL. Also, set the backend type to match the dynamo deployment in the `DGD_CONFIG_FILE`.
-For MoE models, edit `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_moe_job.yaml` to set the target TEP, DEP, TTFT, and ITL.
-> [!NOTE]
-> If the model is too large to be downloaded every time, you can create a multi-attach PVC to cache the model. Refer to [recipes](../../recipes/README.md) for more details.
 ```yaml
 spec:
@@ -147,36 +126,13 @@ spec:
            - <vllm/sglang>
 ```
-**Step 3: Define the container image and config path**
+**For MoE models**, use `profile_sla_moe_job.yaml` with TEP/DEP configuration instead.
-1. **Set the container image:**
-   ```bash
-   export DOCKER_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
-   ```
-2. **Set the config path for the profiling job:**
-   ```bash
-   export DGD_CONFIG_FILE=/data/configs/disagg.yaml # should be the same path you set for --dest in Step 1
-   ```
-**Step 4: Run profiling (required)**
-```bash
-# for dense models
-envsubst < benchmarks/profiler/deploy/profile_sla_job.yaml | kubectl apply -f -
-# for MoE models
+### Advanced Configuration
-envsubst < benchmarks/profiler/deploy/profile_sla_moe_job.yaml | kubectl apply -f -
-# using aiconfigurator instead of real sweeping (see below for more details)
+- **Model caching**: For large models, create a multi-attach PVC to cache the model. See [recipes](../../recipes/README.md) for details.
-envsubst < benchmarks/profiler/deploy/profile_sla_aic_job.yaml | kubectl apply -f -
+- **Custom configurations**: Use the manifest injector to place custom DGD configurations in the PVC.
-```
+- **Resource allocation**: Modify the job YAML to adjust GPU and memory requirements.
-**Step 5: Wait for profiling to complete**
-```bash
-kubectl get jobs -n $NAMESPACE
-kubectl logs job/profile-sla -n $NAMESPACE
-```
 ### Viewing Profiling Results
@@ -265,53 +221,54 @@ If you see `ErrImagePull` or `ImagePullBackOff` errors with 401 unauthorized mes
 3. The service account should show `imagePullSecrets` containing `nvcr-imagepullsecret`.
-## Running the Profiling Script with `aiconfigurator`
+## Running the Profiling Script with AI Configurator
-The profiling script can be run much quicker by using `aiconfigurator` to estimate perf numbers instead of running and benchmarking real dynamo deployments. To enable estimation using `aiconfigurator`, pass the `--use-ai-configurator` flag to the profiling script.
+> [!NOTE]
+> **TensorRT-LLM Only**: AI Configurator currently supports TensorRT-LLM only. Support for vLLM and SGLang is coming soon.
+The profiling script can be run much faster using AI Configurator to estimate performance numbers instead of running real Dynamo deployments. This completes profiling in 20-30 seconds using performance simulation.
 **Advantages** of `--use-ai-configurator`:
-* Script will finish in seconds rather than hours.
+* Script completes in seconds rather than hours
-* No k8s or GPU access is required.
+* No Kubernetes or GPU access required
+* Ideal for rapid prototyping and testing
 **Disadvantages**:
-* Estimated perf could contain some error, especially when the input dimensions out-of-distribution compared to the sampled values in aiconfigurator.
+* Estimated performance may contain errors, especially for out-of-distribution input dimensions
-* `aiconfigurator` has a limited list of supported models.
+* Limited list of supported models, systems, and backends
-* `aiconfigurator`'s database has a limited list of systems and backends.
+* Less accurate than real deployment profiling
 ### Prerequisites
-You will need a virtual environment with `dynamo` installed. Either use the local dev environment or the docker images. If using local environment, install the required dependencies:
+Install AI Configurator:
 ```bash
-pip install -r deploy/utils/requirements.txt
+pip install aiconfigurator
 ```
-Additionally, install `aiconfigurator`:
+If using local environment, also install:
 ```bash
-pip install aiconfigurator
+pip install -r deploy/utils/requirements.txt
 ```
-### Available Models, Systems, and Backends
+### Check Support Matrix
-`aiconfigurator` supports a limited list of models, systems, and backends.
-You can use the `aiconfigurator` CLI to see the support matrix:
+View supported models, systems, and backends:
 ```bash
 aiconfigurator cli --help
 ```
-This will display:
-```
+**Supported configurations:**
-...options...
-  --model {GPT_7B,GPT_13B,GPT_30B,GPT_66B,GPT_175B,LLAMA2_7B,LLAMA2_13B,LLAMA2_70B,LLAMA3.1_8B,LLAMA3.1_70B,LLAMA3.1_405B,MOE_Mixtral8x7B,MOE_Mixtral8x22B,DEEPSEEK_V3,KIMI_K2,QWEN2.5_1.5B,QWEN2.5_7B,QWEN2.5_32B,QWEN2.5_72B,QWEN3_32B,QWEN3_235B,QWEN3_480B,Nemotron_super_v1.1}
-                        Model name
-  --system {h100_sxm,h200_sxm}
-                        System name
-  --backend {trtllm,sglang,vllm}
-                        Backend name, suport trtllm for now
-  --version VERSION     Version, 0.20.0,1.0.0rc3 for trtllm
-...more options...
 ```
+Models: GPT_7B, GPT_13B, GPT_30B, GPT_66B, GPT_175B, LLAMA2_7B, LLAMA2_13B, LLAMA2_70B, LLAMA3.1_8B, LLAMA3.1_70B, LLAMA3.1_405B, MOE_Mixtral8x7B, MOE_Mixtral8x22B, DEEPSEEK_V3, KIMI_K2, QWEN2.5_1.5B, QWEN2.5_7B, QWEN2.5_32B, QWEN2.5_72B, QWEN3_32B, QWEN3_235B, QWEN3_480B, Nemotron_super_v1.1
-### Running the Script
+Systems: h100_sxm, h200_sxm
-In addition to passing the `--use-ai-configurator` flag, you must also provide the `--aic-system`, `--aic-model-name`, and `--backend-version` arguments.
+Backends: trtllm (vllm and sglang support coming soon)
+```
-Example command:
+### Running Fast Profiling
+Example command for TensorRT-LLM:
 ```bash
 python3 -m benchmarks.profiler.profile_sla \
   --config ./components/backends/trtllm/deploy/disagg.yaml \
@@ -319,6 +276,11 @@ python3 -m benchmarks.profiler.profile_sla \
   --aic-system h200_sxm \
   --aic-model-name QWEN3_32B \
   --backend trtllm \
-   --backend-version 0.20.0
+   --backend-version 0.20.0 \
+   --isl 3000 \
+   --osl 150 \
+   --ttft 0.2 \
+   --itl 0.02
 ```
-The output will be written to `./profiling_results/`.
+The output will be written to `./profiling_results/` and can be used directly with SLA planner deployment.
--- a/docs/guides/planner_benchmark/OLD_disagg_1p1d.yml
+++ b/docs/guides/planner_benchmark/OLD_disagg_1p1d.yml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-Common:
-  model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  block-size: 64
-  max-model-len: 16384
-  kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
-  router: kv-load
-Frontend:
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  endpoint: dynamo.Processor.chat/completions
-  port: 8000
-Processor:
-  common-configs: [model, block-size, router]
-Router:
-  min-workers: 1
-  common-configs: [model, block-size, router]
-VllmWorker:
-  remote-prefill: true
-  conditional-disagg: false
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
-  common-configs: [model, block-size, max-model-len, kv-transfer-config]
-PrefillWorker:
-  max-num-batched-tokens: 16384
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
-  common-configs: [model, block-size, max-model-len, kv-transfer-config]
-Planner:
-  environment: local
-  no-operation: false
-  metric-pulling-interval: 1
-  adjustment-interval: 10
-  prefill-queue-scale-down-threshold: 0.2
-  prefill-queue-scale-up-threshold: 10
-  decode-kv-scale-down-threshold: 0.3
-  decode-kv-scale-up-threshold: 0.6
-  log-dir: log/planner
--- a/docs/guides/planner_benchmark/OLD_disagg_2p2d.yaml
+++ b/docs/guides/planner_benchmark/OLD_disagg_2p2d.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-Common:
-  model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  block-size: 64
-  max-model-len: 16384
-  kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
-  router: kv-load
-Frontend:
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  endpoint: dynamo.Processor.chat/completions
-  port: 8000
-Processor:
-  router: kv-load
-  common-configs: [model, block-size, router]
-Router:
-  min-workers: 1
-  common-configs: [model, block-size, router]
-VllmWorker:
-  remote-prefill: true
-  conditional-disagg: false
-  ServiceArgs:
-    workers: 2
-    resources:
-      gpu: 1
-  common-configs: [model, block-size, max-model-len, kv-transfer-config]
-PrefillWorker:
-  max-num-batched-tokens: 16384
-  ServiceArgs:
-    workers: 2
-    resources:
-      gpu: 1
-  common-configs: [model, block-size, max-model-len, kv-transfer-config]
-Planner:
-  environment: local
-  no-operation: true
-  log-dir: log/2p2d
--- a/docs/guides/planner_benchmark/README.md
+++ b/docs/guides/planner_benchmark/README.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-# Planner Benchmark Example
-This guide shows an example of benchmarking `LocalPlanner` performance with synthetic data. In this example, we focus on 8x H100 SXM GPU and `deepseek-ai/DeepSeek-R1-Distill-Llama-8B` model with TP1 prefill and decode engine.
-> [!WARNING]
-> Bare metal deployment with local connector is deprecated. The only option to deploy planner is via k8s. We will update the examples in this document soon.
-## Synthetic Data Generation
-We first generate synthetic data with varying request rate from 0.75 to 3 using the provided `generate_synthetic_data.py` script.
-```bash
-python sin_synth.py \
-    --time-duration 600 \
-    --request-rate-min 5 \
-    --request-rate-max 20 \
-    --request-rate-period 150 \
-    --isl1 3000 \
-    --osl1 150 \
-    --isl2 3000 \
-    --osl2 150
-```
-This generates a [mooncake style trace](https://github.com/kvcache-ai/Mooncake) with
-* duration = 600 seconds
-* isl/osl = 3000/150
-* request rate varies sinusoidally from 0.75 to 3 requests with a period of 150 seconds
-For other models and GPU SKUs, adjust the request rate ranges accordingly to match the load.
-## Run the Benchmark
-To measure the performance of dynamo with planner, we start from a 1p1d deployment and set planner to make adjustments every 10 seconds:
-```bash
-# Start Kubernetes with one frontend node, one prefill and one decode worker
-# TODO
-# in terminal 2
-genai-perf profile \
-    --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
-    -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
-    --endpoint-type chat \
-    --url http://localhost:8000 \
-    --streaming \
-    --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
-```
-To view the performance metrics and planner decisions, launch tensorboard with
-```bash
-tensorboard --logdir log
-```
-and open `http://localhost:6006` in your browser. The following metrics are available:
-* `average_kv_load`: the average KV load in decode workers
-* `prefill_queue_size`: the size of the prefill queue
-* `num_queued_request`: the number of requests queued in decode workers
-* `num_prefill_workers`: the number of prefill workers
-* `num_decode_workers`: the number of decode workers
-* `num_gpu`: the total number of GPUs used
-The benchmark results are printed out in terminal 3 that runs the `genai-perf` command.
-In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--no-operation` flag to watch and log the metrics without making any adjustments:
-```bash
-# in terminal 1
-# Start Kubernetes with one frontend node, two prefill and two decode workers
-# TODO
-# in terminal 2
-genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8000 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
-```
-## Results
-The below two figures show the performance comparison between planner and the baseline 2p2d deployment. Planner achieves 1.5x speedup while using 7.4% less GPU resources.
-![Two bar charts comparing 2P2D and Planner. Planner shows lower GPU usage and lower average sequence latency.](../../images/planner_perf.png)
-![Planner Tensorboard; four line graphs comparing two runs: 2p2d_rr5-20_2 and planner_rr5-20.](../../images/planner_tensorboard.png)
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -33,7 +33,6 @@
   kubernetes/model_caching_with_fluid.md
   kubernetes/README.md
   guides/dynamo_run.md
-   kubernetes/sla_planner_deployment.md
   guides/metrics.md
   guides/run_kvbm_in_vllm.md
   guides/run_kvbm_in_trtllm.md

--- a/docs/index.rst
+++ b/docs/index.rst
@@ -72,7 +72,7 @@ Quickstart
   :caption: Developer Guide
   Benchmarking Guide <benchmarks/benchmarking.md>
-   Planner Benchmark Example <guides/planner_benchmark/README.md>
+   SLA Planner (Autoscaling) Quickstart <kubernetes/sla_planner_quickstart>
   Logging <guides/logging.md>
   Health Checks <guides/health_check.md>
   Tuning Disaggregated Serving Performance <guides/disagg_perf_tuning.md>

--- a/docs/kubernetes/installation_guide.md
+++ b/docs/kubernetes/installation_guide.md
@@ -188,7 +188,7 @@ kubectl get pods -n ${NAMESPACE}
 3. **Optional:**
   - [Set up Prometheus & Grafana](metrics.md)
-   - [SLA Planner Deployment Guide](sla_planner_deployment.md) (for advanced SLA-aware scheduling and autoscaling)
+   - [SLA Planner Quickstart Guide](sla_planner_quickstart.md) (for SLA-aware scheduling and autoscaling)
 ## Troubleshooting

--- a/docs/kubernetes/sla_planner_deployment.md
+++ b/docs/kubernetes/sla_planner_deployment.md
-# SLA Planner Deployment Guide
-Quick deployment guide for the disaggregated planner with automatic scaling.
-> [!NOTE]
-> For high-level architecture and concepts, see [SLA-based Planner](/docs/architecture/sla_planner.md).
-## Architecture Overview
-**Components:**
- **Frontend**: Serves requests and exposes `/metrics`
- **Prometheus**: Scrapes frontend metrics every 5s (by default, can be updated in the podmonitor manifest)
- **Planner**: Queries Prometheus and adjusts worker scaling every adjustment interval
- **Workers**: prefill and backend workers handle inference
-The adjustment interval can be defined in the planner manifest as an argument. The default interval value can be found in this [file](/components/planner/src/dynamo/planner/defaults.py).
-```mermaid
-flowchart LR
-  Frontend --"/metrics"--> Prometheus
-  Planner --"query API"--> Prometheus
-  Planner --"scaling decisions"--> Workers
-  Frontend -.->|"requests"| Workers
-```
-## Prerequisites
- Kubernetes cluster with GPU nodes
- [Pre-Deployment Profiling](/docs/benchmarks/pre_deployment_profiling.md) completed and its results saved to `dynamo-pvc` PVC.
- Prefill and decode worker uses the best parallelization mapping suggested by the pre-deployment profiling script.
- [kube-prometheus-stack](/docs/kubernetes/metrics.md) installed and running. By default, the prometheus server is not deployed in the `monitoring` namespace. If it is deployed to a different namespace, set `dynamo-operator.dynamo.metrics.prometheusEndpoint="http://prometheus-kube-prometheus-prometheus.<namespace>.svc.cluster.local:9090"`.
-> [!NOTE]
-> **Important**: The profiling that occurs before Planner deployment requires additional Kubernetes manifests (ServiceAccount, Role, RoleBinding, PVC) that are not included in standard Dynamo deployments. Apply these manifests in the same namespace as `$NAMESPACE`. For a complete setup, start with the [Quick Start guide](/deploy/utils/README.md#quick-start), which provides a fully encapsulated deployment including all required manifests.
-```bash
-export NAMESPACE=your-namespace
-```
-## 1. Deploy the System
-We use vllm as the backend engine in this guide. SLA planner also supports SGLang and TensorRT-LLM. Checkout `disagg_planner.yaml` in their example deployment folders for more details. The deployment is the same for all backends.
-```bash
-# Apply the disaggregated planner deployment
-kubectl apply -f components/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE # for vllm
-kubectl apply -f components/backends/sglang/deploy/disagg_planner.yaml -n $NAMESPACE # for sglang
-kubectl apply -f components/backends/trtllm/deploy/disagg_planner.yaml -n $NAMESPACE # for trtllm
-# Check deployment status
-kubectl get pods -n $NAMESPACE
-```
-Expected pods (all should be `1/1 Running`):
-```
-# For vLLM:
-vllm-disagg-planner-frontend-*            1/1 Running
-vllm-disagg-planner-planner-*             1/1 Running
-vllm-disagg-planner-backend-*             1/1 Running
-vllm-disagg-planner-prefill-*             1/1 Running
-```
-## 2. Test the System
-```bash
-# Port forward to frontend
-kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000
-# Send a request
-curl -N http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "Qwen/Qwen3-0.6B",
-    "messages": [
-    {
-        "role": "user",
-        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
-    }
-    ],
-    "stream":true,
-    "max_tokens": 30
-  }'
-```
-## 3. Monitor Scaling
-```bash
-# Check planner logs for scaling decisions
-kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-planner --tail=10
-# Expected successful output (after streaming requests):
-# New adjustment interval started!
-# Observed num_req: X.XXX isl: X.XXX osl: X.XXX
-# Observed ttft: X.XXXs itl: X.XXXs
-# Number of prefill workers: 1, number of decode workers: 1
-```
-### Metrics Requirements
- **Basic metrics** (request count): Available with any request type
- **Latency metrics** (TTFT/ITL): Available for both streaming and non-streaming requests
- **Scaling decisions**: Require sufficient request volume
-## 4. Troubleshooting
-**Connection Issues:**
-```bash
-# Verify Prometheus is accessible
-kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090
-curl "http://localhost:9090/api/v1/query?query=up"
-```
-**Missing Metrics:**
-```bash
-# Check frontend metrics
-kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000
-curl http://localhost:8000/metrics | grep nv_llm_http_service
-```
-**Worker Issues:**
- Large models can take 10+ minutes to initialize
- Check worker logs: `kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-backend`
- Ensure GPU resources are available for workers
-**Unknown Field subComponentType:**
-If you encounter the following error when attempting to apply the deployment:
-```bash
-Error from server (BadRequest): error when creating "components/backends/vllm/deploy/disagg.yaml": DynamoGraphDeployment in version "v1alpha1" cannot be handled as a DynamoGraphDeployment: strict decoding error: unknown field "spec.services.DecodeWorker.subComponentType", unknown field "spec.services.PrefillWorker.subComponentType"
-```
-This is because the `subComponentType` field has only been added in newer versions of the DynamoGraphDeployment CRD (> 0.5.0). You can upgrade the CRD version by following the instructions [here](/docs/kubernetes/installation_guide.md).
--- a/docs/kubernetes/sla_planner_quickstart.md
+++ b/docs/kubernetes/sla_planner_quickstart.md
+# SLA Planner Quick Start Guide
+Complete workflow to deploy SLA-based autoscaling for Dynamo deployments. This guide consolidates all necessary steps into a clear, sequential process.
+> [!IMPORTANT]
+> **Prerequisites**: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the [Dynamo Platform installation](/docs/kubernetes/installation_guide.md).
+## Overview
+The SLA Planner automatically scales prefill and decode workers to meet your TTFT (Time To First Token) and ITL (Inter-Token Latency) targets.
+The deployment process consists of two mandatory phases:
+1. **Pre-Deployment Profiling** (2-4 hours) - Generates performance data
+2. **SLA Planner Deployment** (5-10 minutes) - Enables autoscaling
+> [!TIP]
+> **Fast Profiling with AI Configurator**: For TensorRT-LLM users, we provide AI Configurator (AIC) that can complete profiling in 20-30 seconds using performance simulation instead of real deployments. Support for vLLM and SGLang coming soon. See [AI Configurator section](/docs/benchmarks/pre_deployment_profiling.md#running-the-profiling-script-with-aiconfigurator) in the Profiling Guide.
+```mermaid
+flowchart TD
+    A[Start Setup] --> B{Profiling Done?}
+    B -->|No| C[Run Profiling<br/>2-4 hours]
+    C --> D[Verify Results]
+    D --> E[Deploy Planner<br/>5-10 minutes]
+    B -->|Yes| E
+    E --> F[Test System]
+    F --> G[Ready!]
+    style A fill:#e1f5fe
+    style C fill:#fff3e0
+    style E fill:#e8f5e8
+    style G fill:#f3e5f5
+    style B fill:#fff8e1
+```
+## Phase 1: Pre-Deployment Profiling (REQUIRED)
+> [!WARNING]
+> **MANDATORY**: Pre-deployment profiling must be completed before deploying SLA planner. This process analyzes your model's performance characteristics to determine optimal tensor parallelism configurations and scaling parameters.
+### Step 1.1: Set Up Profiling Environment
+Set up your Kubernetes namespace for profiling (one-time per namespace). If your namespace is already set up, skip this step.
+```bash
+export NAMESPACE=your-namespace
+```
+**Prerequisites**: Ensure all dependencies are installed:
+```bash
+pip install -r deploy/utils/requirements.txt
+```
+### Step 1.2: Inject Your Configuration
+Use the injector utility to place your DGD manifest into the PVC:
+```bash
+# Use default disagg.yaml config
+python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src components/backends/vllm/deploy/disagg.yaml --dest /data/configs/disagg.yaml
+# Or use a custom disagg config file
+python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/configs/disagg.yaml
+```
+> **Note**: All paths must start with `/data/` for security reasons.
+### Step 1.3: Configure SLA Targets
+For dense models, edit `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_job.yaml`:
+```yaml
+spec:
+  template:
+    spec:
+      containers:
+        - name: profile-sla
+          args:
+            - --isl
+            - "3000" # average ISL is 3000 tokens
+            - --osl
+            - "150" # average OSL is 150 tokens
+            - --ttft
+            - "200" # target TTFT is 200ms
+            - --itl
+            - "20" # target ITL is 20ms
+            - --backend
+            - <vllm/sglang>
+```
+For MoE models, edit `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_moe_job.yaml` instead.
+### Step 1.4: Run Profiling
+Set the container image and config path:
+```bash
+export DOCKER_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+export DGD_CONFIG_FILE=/data/configs/disagg.yaml
+```
+Run profiling:
+```bash
+# for dense models
+envsubst < benchmarks/profiler/deploy/profile_sla_job.yaml | kubectl apply -f -
+# for MoE models
+envsubst < benchmarks/profiler/deploy/profile_sla_moe_job.yaml | kubectl apply -f -
+# using aiconfigurator instead of real sweeping (see below for more details)
+envsubst < benchmarks/profiler/deploy/profile_sla_aic_job.yaml | kubectl apply -f -
+```
+### Step 1.5: Monitor Profiling Progress
+```bash
+kubectl get jobs -n $NAMESPACE
+kubectl logs job/profile-sla -n $NAMESPACE
+```
+> [!NOTE]
+> **Time Investment**: This profiling process is comprehensive and typically takes **2-4 hours** to complete. The script systematically tests multiple tensor parallelism configurations and load conditions to find optimal performance settings.
+### Step 1.6: Download Profiling Results (Optional)
+If you want to view the profiling results and performance plots:
+```bash
+# Download to directory
+python3 -m deploy.utils.download_pvc_results --namespace $NAMESPACE --output-dir ./results --folder /data/profiling_results
+```
+For detailed information about the output structure, performance plots, and how to analyze the results, see the [Viewing Profiling Results](/docs/benchmarks/pre_deployment_profiling.md#viewing-profiling-results) section in the Profiling Guide.
+**Verify Success**: Look for terminal output like:
+```
+Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
+Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
+```
+## Phase 2: Deploy SLA Planner
+### Step 2.1: Verify Prerequisites
+Before deploying the SLA planner, ensure:
+- **Pre-deployment profiling completed successfully** (from Phase 1)
+- **Profiling results saved to `dynamo-pvc` PVC**
+- **[kube-prometheus-stack](/docs/kubernetes/metrics.md) installed and running.** By default, the prometheus server is not deployed in the `monitoring` namespace. If it is deployed to a different namespace, set `dynamo-operator.dynamo.metrics.prometheusEndpoint="http://prometheus-kube-prometheus-prometheus.<namespace>.svc.cluster.local:9090"`.
+- **Dynamo platform installed** (see [Installation Guide](/docs/kubernetes/installation_guide.md))
+- **Prefill and decode workers use the best parallelization mapping from profiling**
+### Step 2.2: Deploy the System
+We use vllm as the backend engine in this guide. SLA planner also supports SGLang and TensorRT-LLM.
+```bash
+# Apply the disaggregated planner deployment
+kubectl apply -f components/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE # for vllm
+kubectl apply -f components/backends/sglang/deploy/disagg_planner.yaml -n $NAMESPACE # for sglang
+kubectl apply -f components/backends/trtllm/deploy/disagg_planner.yaml -n $NAMESPACE # for trtllm
+# Check deployment status
+kubectl get pods -n $NAMESPACE
+```
+**Expected pods** (all should be `1/1 Running`):
+```
+vllm-disagg-planner-frontend-*            1/1 Running
+vllm-disagg-planner-planner-*             1/1 Running
+vllm-disagg-planner-backend-*             1/1 Running
+vllm-disagg-planner-prefill-*             1/1 Running
+```
+### Step 2.3: Test the System
+```bash
+# Port forward to frontend
+kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000
+# Send a request
+curl -N http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-0.6B",
+    "messages": [
+    {
+        "role": "user",
+        "content": "Hello, how are you?"
+    }
+    ],
+    "stream":true,
+    "max_tokens": 30
+  }'
+```
+### Step 2.4: Monitor Scaling
+```bash
+# Check planner logs for scaling decisions
+kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-planner --tail=10
+```
+**Expected successful output** (after streaming requests):
+```
+New adjustment interval started!
+Observed num_req: X.XXX isl: X.XXX osl: X.XXX
+Observed ttft: X.XXXs itl: X.XXXs
+Number of prefill workers: 1, number of decode workers: 1
+```
+## Phase 3: Production Readiness
+### Monitoring Metrics
+- **Basic metrics** (request count): Available with any request type
+- **Latency metrics** (TTFT/ITL): Available for both streaming and non-streaming requests
+- **Scaling decisions**: Require sufficient request volume
+### Troubleshooting
+**Connection Issues:**
+```bash
+# Verify Prometheus is accessible
+kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090
+curl "http://localhost:9090/api/v1/query?query=up"
+```
+**Missing Metrics:**
+```bash
+# Check frontend metrics
+kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000
+curl http://localhost:8000/metrics | grep nv_llm_http_service
+```
+**Worker Issues:**
+- Large models can take 10+ minutes to initialize
+- Check worker logs: `kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-backend`
+- Ensure GPU resources are available for workers
+**Unknown Field subComponentType:**
+If you encounter the following error when applying the deployment:
+```bash
+Error from server (BadRequest): error when creating "components/backends/vllm/deploy/disagg.yaml": DynamoGraphDeployment in version "v1alpha1" cannot be handled as a DynamoGraphDeployment: strict decoding error: unknown field "spec.services.DecodeWorker.subComponentType", unknown field "spec.services.PrefillWorker.subComponentType"
+```
+This is because the `subComponentType` field has only been added in newer versions of the DynamoGraphDeployment CRD (> 0.5.0). You can upgrade the CRD version by following the instructions [here](/docs/kubernetes/installation_guide.md).
+## Next Steps
+- **Architecture Details**: See [SLA-based Planner Architecture](/docs/architecture/sla_planner.md) for technical details
+- **Performance Tuning**: See [Pre-Deployment Profiling Guide](/docs/benchmarks/pre_deployment_profiling.md) for advanced profiling options
+- **Load Testing**: See [SLA Planner Load Test](/tests/planner/README.md) for comprehensive testing tools
+## Quick Reference
+| Phase | Duration | Purpose | Status Check |
+|-------|----------|---------|--------------|
+| Profiling | 2-4 hours | Generate performance data | `kubectl logs job/profile-sla` |
+| Deployment | 5-10 minutes | Enable autoscaling | `kubectl get pods` |
+| Testing | 5 minutes | Verify functionality | `kubectl logs deployment/planner` |
+---
+> [!TIP]
+> **Need Help?** If you encounter issues, check the [troubleshooting section](#troubleshooting) or refer to the detailed guides linked in [Next Steps](#next-steps).
\ No newline at end of file