docs: add trouble shooting section in benchmarking guide. Add known pitfall (#1503)

Signed-off-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

docs: add trouble shooting section in benchmarking guide. Add known pitfall (#1503)
Signed-off-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
e621f249 · GuanLuo · GitHub · ae7e08a3 · e621f249
Unverified Commit e621f249 authored Jun 13, 2025 by GuanLuo Committed by GitHub Jun 13, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 40 additions and 0 deletions

examples/llm/benchmarks/README.md examples/llm/benchmarks/README.md +40 -0

No files found.
--- a/examples/llm/benchmarks/README.md
+++ b/examples/llm/benchmarks/README.md
@@ -392,3 +392,43 @@ when the script is invoked, it will:
 For instructions on how to acquire per worker metrics and visualize them using Grafana,
 please see the provided [Visualization with Prometheus and Grafana](../../../deploy/metrics/README.md).
+## Troubleshooting
+When benchmarking disaggregation performance, there can be cases where the latency and
+throughput number don't match the expectation within some margin. Below is a list of scenarios
+that have been encountered, and details on observations and resolutions.
+### Interconnect Configuration
+Even if the nodes have faster interconnect hardware available, there can be misconfiguration such that
+the fastest route may not be selected by NIXL ([example regression](https://github.com/ai-dynamo/dynamo/pull/1314)). NIXL simplifies the interconnect but also hides
+selection detail. Therefore this can be the cause if you observe abnormal TTFT increase when
+splitting prefill workers and decode workers to different nodes. For example, we have seen instances of ~2 second overhead added to TTFT when TCP is selected over RDMA for KV Cache transfer due to a misconfigured environment.
+Currently NIXL doesn't provide utility for reporting which transport is selected. Therefore
+you will need to verify if that is the cause by using backend specific debug options.
+In the case of UCX backend, you can use `ucx_info -d` to check if the desired interconnect
+devices are being recognized. At runtime, `UCX_LOG_LEVEL=debug` and `UCX_PROTO_INFO=y`
+can be set as environment variables to provide detailed logs on UCX activities. This will
+reveal whether the desired transport is being used.
+### The Full Deployment is Configured Correctly
+As benchmarking often focuses on configurations where multiple workers are being used,
+one may mistakenly consider a deployment ready for benchmarking while there are only a
+subset of workers taking requests. For example, in the aggregated baseline benchmarking,
+a user can miss updating the ip address to the other node in upstream section of `nginx.conf`.
+This could lead to only one of the nodes serving requests. In such a case,
+the benchmark can still run to completion, but the result will not reflect the deployment
+capacity, because not all the compute resources are being utilized.
+Therefore, it is important to verify that the requests can be routed to all workers before
+performing the benchmark:
+- **Framework-only benchmark** The simplest way is to send sample requests and check
+the logs of all workers. Each framework may provide utilities for readiness checks, so please
+refer to the framework's documentation for those details.
+- **Dynamo based benchmark** Once you start the deployment, you can follow
+the instructions in [monitor benchmark startup status](#Monitor-Benchmark-Startup-Status),
+which will periodically poll the workers exposed to specific endpoints
+and return HTTP 200 code when the expected number of workers are met.