[Doc] use power of 2 (#23172)

2c3f557f · Tialo · GitHub · 21bcc826 · 2c3f557f
Unverified Commit 2c3f557f authored Aug 19, 2025 by Tialo Committed by GitHub Aug 19, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 1 deletion

docs/configuration/optimization.md docs/configuration/optimization.md +1 -1

No files found.
--- a/docs/configuration/optimization.md
+++ b/docs/configuration/optimization.md
@@ -48,7 +48,7 @@ You can tune the performance by adjusting `max_num_batched_tokens`:

 - Smaller values (e.g., 2048) achieve better inter-token latency (ITL) because there are fewer prefills slowing down decodes.
 - Higher values achieve better time to first token (TTFT) as you can process more prefill tokens in a batch.
- For optimal throughput, we recommend setting `max_num_batched_tokens > 8096` especially for smaller models on large GPUs.
+- For optimal throughput, we recommend setting `max_num_batched_tokens > 8192` especially for smaller models on large GPUs.
 - If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes).

 ```python