docs: add docs for profile_sla npz file format (#1859)

fde25fef · Hongkuan Zhou · GitHub · 382fa2c2 · fde25fef
Unverified Commit fde25fef authored Jul 10, 2025 by Hongkuan Zhou Committed by GitHub Jul 10, 2025
Show whitespace changes
Inline Side-by-side

Showing with 16 additions and 1 deletion

docs/architecture/sla_planner.md docs/architecture/sla_planner.md +16 -1

No files found.
--- a/docs/architecture/sla_planner.md
+++ b/docs/architecture/sla_planner.md
@@ -54,9 +54,24 @@ In decode engine, decode requests are added inflight and iteration time (or ITL)
 ![images](../images/itl_interpolation.png)
 The script profiles the selected decode TP configuration across different active kv blocks and average context length.
+### Output Format of Interpolation Data
+After suggesting the optimal TP configuration, two `.npz` files that describe the performance characteristics of the prefill and decode engines in their suggested parallel configurations will be generated. The two `.npz` files are:
+* `${benchmark_result_dir}/selected_prefill_interpolation/raw_data.npz}`
+  * `prefill_isl`: a 1D Numpy array to store the ISLs used to profile the prefill engine.
+  * `prefill_ttft`: a 1D Numpy array to store the TTFTs under the corresponding ISLs when the prefill engine is exclusively running each prefill request (i.e., with batch size of 1). The unit is in milliseconds.
+  * `prefill_thpt_per_gpu`: a 1D Numpy array to store the prefill throughput per GPU under the corresponding ISLs. The unit is in tokens per second per GPU.
+* `${benchmark_result_dir}/selected_decode_interpolation/raw_data.npz`
+  * `max_kv_tokens`: a 1D Numpy array with only one element to store the total number of KV tokens in the decode engine.
+  * `x_kv_usage`: a 1D Numpy array to store the percentage of the active KV blocks (in the range of [0, 1]) used to profile the decode engine. The active KV blocks can be controlled by varying `(ISL + OSL / 2) * concurrency`.
+  * `y_context_length`: a 1D Numpy array to store the average context length (ISL + OSL / 2) used to profile the decode engine.
+  * `z_itl`: a 1D Numpy array to store the ITLs under the corresponding active KV usage and context length. To skip the prefill stage while maintaining the context length, benchmark can be done by turn on kv reuse and warmup the engine with the prompts first before running the actual profiling. The unit is in milliseconds.
+  * `z_thpt_per_gpu`: a 1D Numpy array to store the decode throughput per GPU under the corresponding active KV usage and context length. The unit is in tokens per second per GPU.
+SLA planner can work with any interpolation data that follows the above format. For best results, use fine-grained and high coverage interpolation data for the prefill and decode engines.
 ## Load Prediction
 The SLA planner use load predictor to predict the number of requests, ISL, and OSL in the next adjustment interval. Currently, three load prediction model is supported: