Unverified Commit e621f249 authored by GuanLuo's avatar GuanLuo Committed by GitHub
Browse files

docs: add trouble shooting section in benchmarking guide. Add known pitfall (#1503)


Signed-off-by: default avatarGuanLuo <41310872+GuanLuo@users.noreply.github.com>
Co-authored-by: default avatarRyan McCormick <rmccormick@nvidia.com>
parent ae7e08a3
...@@ -392,3 +392,43 @@ when the script is invoked, it will: ...@@ -392,3 +392,43 @@ when the script is invoked, it will:
For instructions on how to acquire per worker metrics and visualize them using Grafana, For instructions on how to acquire per worker metrics and visualize them using Grafana,
please see the provided [Visualization with Prometheus and Grafana](../../../deploy/metrics/README.md). please see the provided [Visualization with Prometheus and Grafana](../../../deploy/metrics/README.md).
## Troubleshooting
When benchmarking disaggregation performance, there can be cases where the latency and
throughput number don't match the expectation within some margin. Below is a list of scenarios
that have been encountered, and details on observations and resolutions.
### Interconnect Configuration
Even if the nodes have faster interconnect hardware available, there can be misconfiguration such that
the fastest route may not be selected by NIXL ([example regression](https://github.com/ai-dynamo/dynamo/pull/1314)). NIXL simplifies the interconnect but also hides
selection detail. Therefore this can be the cause if you observe abnormal TTFT increase when
splitting prefill workers and decode workers to different nodes. For example, we have seen instances of ~2 second overhead added to TTFT when TCP is selected over RDMA for KV Cache transfer due to a misconfigured environment.
Currently NIXL doesn't provide utility for reporting which transport is selected. Therefore
you will need to verify if that is the cause by using backend specific debug options.
In the case of UCX backend, you can use `ucx_info -d` to check if the desired interconnect
devices are being recognized. At runtime, `UCX_LOG_LEVEL=debug` and `UCX_PROTO_INFO=y`
can be set as environment variables to provide detailed logs on UCX activities. This will
reveal whether the desired transport is being used.
### The Full Deployment is Configured Correctly
As benchmarking often focuses on configurations where multiple workers are being used,
one may mistakenly consider a deployment ready for benchmarking while there are only a
subset of workers taking requests. For example, in the aggregated baseline benchmarking,
a user can miss updating the ip address to the other node in upstream section of `nginx.conf`.
This could lead to only one of the nodes serving requests. In such a case,
the benchmark can still run to completion, but the result will not reflect the deployment
capacity, because not all the compute resources are being utilized.
Therefore, it is important to verify that the requests can be routed to all workers before
performing the benchmark:
- **Framework-only benchmark** The simplest way is to send sample requests and check
the logs of all workers. Each framework may provide utilities for readiness checks, so please
refer to the framework's documentation for those details.
- **Dynamo based benchmark** Once you start the deployment, you can follow
the instructions in [monitor benchmark startup status](#Monitor-Benchmark-Startup-Status),
which will periodically poll the workers exposed to specific endpoints
and return HTTP 200 code when the expected number of workers are met.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment