"tools/git@developer.sourcefind.cn:OpenDAS/openpcdet.git" did not exist on "dd2edf9da0944238d3711f8145cc5b86c65f5a79"
Unverified Commit 1ff119c7 authored by Hongkuan Zhou's avatar Hongkuan Zhou Committed by GitHub
Browse files

docs: fix typo in disagg perf tuning guide (#859)


Signed-off-by: default avatarHongkuan Zhou <tedzhouhk@gmail.com>
Co-authored-by: default avatarRyan McCormick <rmccormick@nvidia.com>
parent 94702c79
...@@ -95,4 +95,4 @@ At high load where KV cache capacity is the bottleneck, disaggregation has the f ...@@ -95,4 +95,4 @@ At high load where KV cache capacity is the bottleneck, disaggregation has the f
* Decrease the total amount of KV cache: * Decrease the total amount of KV cache:
* Some GPUs are configured as prefill engines whose KV cache is not used in the decode phase. * Some GPUs are configured as prefill engines whose KV cache is not used in the decode phase.
Since Dynamo current allocates the KV blocks immediately when the decode engine get the requests, it is advisable to use as few decode engines as possible (even no prefill engine) to maximize the KV cache utilization. To prevent queueing at prefill engines, users can set a large `max-local-prefill-length` and piggyback more prefill requests at decode engines. Since Dynamo currently allocates the KV blocks immediately when the decode engine get the requests, it is advisable to use as few prefill engines as possible (even no prefill engine) to maximize the available KV cache in decode engines. To prevent queueing at prefill engines, users can set a large `max-local-prefill-length` and piggyback more prefill requests at decode engines.
\ No newline at end of file \ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment