fix: typo in planner doc and log (#1165)

3d697d4d · Hongkuan Zhou · GitHub · 6d5da821 · 3d697d4d · 3d697d4d
Unverified Commit 3d697d4d authored May 22, 2025 by Hongkuan Zhou Committed by GitHub May 22, 2025
Show whitespace changes
Inline Side-by-side

Showing with 7 additions and 6 deletions

docs/guides/planner.md docs/guides/planner.md +6 -5

examples/llm/utils/profile_sla.py examples/llm/utils/profile_sla.py +1 -1

No files found.
--- a/docs/guides/planner.md
+++ b/docs/guides/planner.md
@@ -70,8 +70,8 @@ The script will first detect the number of available GPUs on the current nodes (
 After the profiling finishes, two plots will be generated in the `output-dir`. For example, here are the profiling results for `examples/llm/configs/disagg.yaml`:
-![Prefill Performance](images/h100_prefill_performance.png)
+![Prefill Performance](../images/h100_prefill_performance.png)
-![Decode Performance](images/h100_decode_performance.png)
+![Decode Performance](../images/h100_decode_performance.png)
 For the prefill performance, the script will plot the TTFT for different TP sizes and select the best TP size that meet the target TTFT SLA and delivers the best throughput per GPU. Based on how close the TTFT of the selected TP size is to the SLA, the script will also recommend the upper and lower bounds of the prefill queue size to be used in planner.
@@ -83,7 +83,7 @@ The following information will be printed out in the terminal:
 2025-05-16 15:20:24 - __main__ - INFO - Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
 2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for prefill queue size: 0.24/0.10
 2025-05-16 15:20:24 - __main__ - INFO - Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
-2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for decode kv cache utilization: 0.10/0.2
+2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for decode kv cache utilization: 0.20/0.10
 ```
 ## Usage
@@ -107,8 +107,9 @@ dynamo serve graphs.disagg:Frontend -f disagg.yaml --Planner.environment=local -
 The planner accepts the following configuration options:
 * `namespace` (str, default: "dynamo"): Namespace planner will look at
-* `served-model-name` (str, default: "vllm"): Model name that is being served`
+* `environment` (str, default: "local"): Environment to run the planner in (local, kubernetes)
-* `no-operation` (bool, default: false): Do not make any adjustments, just observe the metrics and log to tensorboard.
+* `served-model-name` (str, default: "vllm"): Model name that is being served
+* `no-operation` (bool, default: false): Do not make any adjustments, just observe the metrics and log to tensorboard
 * `log-dir` (str, default: None): Tensorboard logging directory
 * `adjustment-interval` (int, default: 30): Interval in seconds between scaling adjustments
 * `metric-pulling-interval` (int, default: 1): Interval in seconds between metric pulls

--- a/examples/llm/utils/profile_sla.py
+++ b/examples/llm/utils/profile_sla.py
@@ -643,5 +643,5 @@ if __name__ == "__main__":
    )
    # set a +- 20% range for the kv cache utilization
    logger.info(
-        f"Suggested planner upper/lower bound for decode kv cache utilization: {max(0.1, selected_decode_kv_cache_utilization - 0.2):.2f}/{min(1, selected_decode_kv_cache_utilization + 0.2):.2f}"
+        f"Suggested planner upper/lower bound for decode kv cache utilization: {min(1, selected_decode_kv_cache_utilization + 0.2):.2f}/{max(0.1, selected_decode_kv_cache_utilization - 0.2):.2f}"
    )