Unverified Commit 3d697d4d authored by Hongkuan Zhou's avatar Hongkuan Zhou Committed by GitHub
Browse files

fix: typo in planner doc and log (#1165)

parent 6d5da821
...@@ -70,8 +70,8 @@ The script will first detect the number of available GPUs on the current nodes ( ...@@ -70,8 +70,8 @@ The script will first detect the number of available GPUs on the current nodes (
After the profiling finishes, two plots will be generated in the `output-dir`. For example, here are the profiling results for `examples/llm/configs/disagg.yaml`: After the profiling finishes, two plots will be generated in the `output-dir`. For example, here are the profiling results for `examples/llm/configs/disagg.yaml`:
![Prefill Performance](images/h100_prefill_performance.png) ![Prefill Performance](../images/h100_prefill_performance.png)
![Decode Performance](images/h100_decode_performance.png) ![Decode Performance](../images/h100_decode_performance.png)
For the prefill performance, the script will plot the TTFT for different TP sizes and select the best TP size that meet the target TTFT SLA and delivers the best throughput per GPU. Based on how close the TTFT of the selected TP size is to the SLA, the script will also recommend the upper and lower bounds of the prefill queue size to be used in planner. For the prefill performance, the script will plot the TTFT for different TP sizes and select the best TP size that meet the target TTFT SLA and delivers the best throughput per GPU. Based on how close the TTFT of the selected TP size is to the SLA, the script will also recommend the upper and lower bounds of the prefill queue size to be used in planner.
...@@ -83,7 +83,7 @@ The following information will be printed out in the terminal: ...@@ -83,7 +83,7 @@ The following information will be printed out in the terminal:
2025-05-16 15:20:24 - __main__ - INFO - Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU) 2025-05-16 15:20:24 - __main__ - INFO - Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for prefill queue size: 0.24/0.10 2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for prefill queue size: 0.24/0.10
2025-05-16 15:20:24 - __main__ - INFO - Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU) 2025-05-16 15:20:24 - __main__ - INFO - Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for decode kv cache utilization: 0.10/0.2 2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for decode kv cache utilization: 0.20/0.10
``` ```
## Usage ## Usage
...@@ -107,8 +107,9 @@ dynamo serve graphs.disagg:Frontend -f disagg.yaml --Planner.environment=local - ...@@ -107,8 +107,9 @@ dynamo serve graphs.disagg:Frontend -f disagg.yaml --Planner.environment=local -
The planner accepts the following configuration options: The planner accepts the following configuration options:
* `namespace` (str, default: "dynamo"): Namespace planner will look at * `namespace` (str, default: "dynamo"): Namespace planner will look at
* `served-model-name` (str, default: "vllm"): Model name that is being served` * `environment` (str, default: "local"): Environment to run the planner in (local, kubernetes)
* `no-operation` (bool, default: false): Do not make any adjustments, just observe the metrics and log to tensorboard. * `served-model-name` (str, default: "vllm"): Model name that is being served
* `no-operation` (bool, default: false): Do not make any adjustments, just observe the metrics and log to tensorboard
* `log-dir` (str, default: None): Tensorboard logging directory * `log-dir` (str, default: None): Tensorboard logging directory
* `adjustment-interval` (int, default: 30): Interval in seconds between scaling adjustments * `adjustment-interval` (int, default: 30): Interval in seconds between scaling adjustments
* `metric-pulling-interval` (int, default: 1): Interval in seconds between metric pulls * `metric-pulling-interval` (int, default: 1): Interval in seconds between metric pulls
......
...@@ -643,5 +643,5 @@ if __name__ == "__main__": ...@@ -643,5 +643,5 @@ if __name__ == "__main__":
) )
# set a +- 20% range for the kv cache utilization # set a +- 20% range for the kv cache utilization
logger.info( logger.info(
f"Suggested planner upper/lower bound for decode kv cache utilization: {max(0.1, selected_decode_kv_cache_utilization - 0.2):.2f}/{min(1, selected_decode_kv_cache_utilization + 0.2):.2f}" f"Suggested planner upper/lower bound for decode kv cache utilization: {min(1, selected_decode_kv_cache_utilization + 0.2):.2f}/{max(0.1, selected_decode_kv_cache_utilization - 0.2):.2f}"
) )
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment