chore: add utilities for benchmarking (#1371)

Signed-off-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

chore: add utilities for benchmarking (#1371)
Signed-off-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
fcfc21f2 · GuanLuo · GitHub · 0bba09a4 · fcfc21f2 · fcfc21f2
Unverified Commit fcfc21f2 authored Jun 12, 2025 by GuanLuo Committed by GitHub Jun 12, 2025
4 changed files
--- a/examples/llm/benchmarks/README.md
+++ b/examples/llm/benchmarks/README.md
@@ -337,6 +337,57 @@ Regardless of the deployment mechanism, the GenAI-Perf tool will report the same
 - [Dynamo vLLM Deployments](../../../docs/examples/llm_deployment.md)
+## Monitor Benchmark Startup Status
+When running dynamo deployment, you may have multiple instances of the same worker kind for a particular benchmark run.
+The deployment can process the workflow as long as at least one worker is ready, in the case where the benchmark is run
+as soon as dynamo is responsive to inference request, which may result in inaccurate benchmark result at the beginning of
+the benchmark. In such a case, you may additionally deploy benchmark watcher to provide signal on whether the full deployment
+is ready. For instance, if you expect the total number of prefill and decode workers to be 10, you can run the below to start
+the watcher, which will exit if the total number is less than 10 after timeout. In addition to that, the watcher will create
+a HTTP server on port 7001 by default, which you can use to send GET request for readiness to build external benchmarking workflow.
+```bash
+# start your benchmark deployment
+...
+# start monitor separately, or it can be part of the deployment above
+dynamo serve --service-name Watcher benchmark_watcher:Watcher --Watcher.total-workers=10 --Watcher.timeout=10
+# Send curl request to check liveness
+curl localhost:7001
+127.0.0.1 - - [12/Jun/2025 23:31:52] "GET / HTTP/1.1" 400 -
+...
+curl localhost:7001
+127.0.0.1 - - [12/Jun/2025 23:32:46] "GET / HTTP/1.1" 200 -
+```
+## Utility for Setting Up Environment
+### vLLM
+- `vllm_multinode_setup.sh` is a helper script to configure the node for dynamo deployment for
+vLLM. Depending on whether environment variable `HEAD_NODE_IP` and `RAY_LEADER_NODE_IP` are set
+when the script is invoked, it will:
+  - start nats server and etcd on the current node if `HEAD_NODE_IP` is not set, otherwise
+  set the environment variables as expected by dynamo.
+  - run Ray and connect to the Ray cluster started by `RAY_LEADER_NODE_IP`, otherwise start
+  the Ray cluster with current node as the head node.
+  - print the command with `HEAD_NODE_IP` and `RAY_LEADER_NODE_IP` set, which can be used in
+  another node to setup connectivity with the current node.
+  ```bash
+  # On node 0
+  source vllm_multinode_setup.sh
+  ... # starting nats server, etcd and ray cluster
+  # script print command
+  HEAD_NODE_IP=NODE_0_IP RAY_LEADER_NODE_IP=NODE_0_IP source vllm_multinode_setup.sh
+  # On node 1
+  HEAD_NODE_IP=NODE_0_IP RAY_LEADER_NODE_IP=NODE_0_IP source vllm_multinode_setup.sh
+  ... # connecting to Ray cluster
+  ```
 ## Metrics and Visualization
 For instructions on how to acquire per worker metrics and visualize them using Grafana,

--- a/examples/llm/benchmarks/benchmark_watcher.py
+++ b/examples/llm/benchmarks/benchmark_watcher.py
+# type: ignore  # Ignore all mypy errors in this file
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import asyncio
+import logging
+import threading
+import time
+from argparse import Namespace
+from http.server import BaseHTTPRequestHandler, HTTPServer
+from dynamo.sdk import async_on_start, dynamo_context, service
+from dynamo.sdk.lib.config import ServiceConfig
+logger = logging.getLogger(__name__)
+def start_server(server):
+    # Setup stuff here...
+    server.serve_forever()
+class HealthServer(HTTPServer):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.ready = False
+    def set_ready(self, ready: bool):
+        self.ready = ready
+class RequestHandler(BaseHTTPRequestHandler):
+    def do_GET(self):
+        if self.server.ready:
+            self.send_response(200)
+            self.end_headers()
+            self.wfile.write(b"Ready.")
+        else:
+            self.send_response(400)
+            self.end_headers()
+            self.wfile.write(b"Not Ready")
+            return
+def parse_args(service_name, prefix) -> Namespace:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--total-workers",
+        type=int,
+        default=1,
+        help="Total number of workers to be registered",
+    )
+    parser.add_argument(
+        "--worker-components",
+        nargs="+",
+        default=["VllmWorker", "PrefillWorker"],
+        help="Components that we are tracking worker readiness",
+    )
+    parser.add_argument(
+        "--component-endpoints",
+        nargs="+",
+        default=["generate", "mock"],
+        help="Components that we are tracking worker readiness",
+    )
+    parser.add_argument(
+        "--timeout",
+        type=int,
+        default=600,
+        help="Timeout (seconds) for waiting for workers to be ready",
+    )
+    parser.add_argument(
+        "--port",
+        type=int,
+        default=7001,
+        help="port for readiness check",
+    )
+    config = ServiceConfig.get_instance()
+    config_args = config.as_args(service_name, prefix=prefix)
+    args = parser.parse_args(config_args)
+    if len(args.worker_components) != len(args.component_endpoints):
+        parser.error(
+            "--worker-components and --component-endpoints must have the same number "
+            f"of items, but got {args.worker_components} and {args.component_endpoints}"
+        )
+    return args
+# Use dynamo style to have access to clients
+@service(
+    dynamo={
+        "namespace": "dynamo",
+    },
+    resources={"cpu": "1", "memory": "1Gi"},
+    workers=1,
+)
+class Watcher:
+    def __init__(self):
+        self.args = parse_args(self.__class__.__name__, "")
+    @async_on_start
+    async def async_init(self):
+        self.runtime = dynamo_context["runtime"]
+        self.workers_clients = []
+        for component, endpoint in zip(
+            self.args.worker_components, self.args.component_endpoints
+        ):
+            self.workers_clients.append(
+                await self.runtime.namespace("dynamo")
+                .component(component)
+                .endpoint(endpoint)
+                .client()
+            )
+            logger.info(f"Component {component}/{endpoint} is registered")
+        logger.info(f"Total number of workers to be waited: {self.args.total_workers}")
+        logger.info(f"Timeout for waiting for workers to be ready: {self.args.timeout}")
+        self.server = HealthServer(("0.0.0.0", self.args.port), RequestHandler)
+        print(f"Serving on 0.0.0.0:{self.args.port}, listening to readiness check...")
+        self._server_thread = threading.Thread(target=start_server, args=(self.server,))
+        self._server_thread.start()
+        await check_required_workers(
+            self.workers_clients, self.args.total_workers, self.args.timeout
+        )
+        self.server.set_ready(True)
+        logger.info("All workers are ready.")
+async def check_required_workers(
+    workers_clients, required_workers: int, timeout: int, poll_interval=1
+):
+    """Wait until the minimum number of workers are ready."""
+    start_time = time.time()
+    num_workers = 0
+    while num_workers < required_workers and time.time() - start_time < timeout:
+        num_workers = sum(map(lambda wc: len(wc.instance_ids()), workers_clients))
+        if num_workers < required_workers:
+            logger.info(
+                f"Waiting for more workers to be ready.\n"
+                f" Current: {num_workers},"
+                f" Required: {required_workers}"
+            )
+            await asyncio.sleep(poll_interval)
+    if num_workers < required_workers:
+        raise TimeoutError(
+            f"Timed out waiting for {required_workers} workers to be ready."
+        )
--- a/examples/llm/benchmarks/vllm_multinode_setup.sh
+++ b/examples/llm/benchmarks/vllm_multinode_setup.sh
+#!/usr/bin/env bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# start nats and etcd
+if [[ -z "${HEAD_NODE_IP}" ]]; then
+    nats-server -js &
+    etcd --advertise-client-urls http://0.0.0.0:2379 --listen-client-urls http://0.0.0.0:2379 &
+    HEAD_NODE_IP=`hostname -i`
+else
+    export NATS_SERVER=nats://${HEAD_NODE_IP}:4222
+    export ETCD_ENDPOINTS=${HEAD_NODE_IP}:2379
+fi
+# start ray cluster
+if [[ -z "${RAY_LEADER_NODE_IP}" ]]; then
+    ray start --head --port=6379 --disable-usage-stats
+    RAY_LEADER_NODE_IP=`hostname -i`
+else
+    ray start --address=${RAY_LEADER_NODE_IP}:6379
+fi
+echo "HEAD_NODE_IP=${HEAD_NODE_IP} RAY_LEADER_NODE_IP=${RAY_LEADER_NODE_IP=} source ${BASH_SOURCE[0]}"
--- a/examples/llm/configs/mutinode_disagg_r1.yaml
+++ b/examples/llm/configs/mutinode_disagg_r1.yaml
@@ -18,6 +18,7 @@ Common:
  max-model-len: 16384
  kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
  tensor-parallel-size: 16
+  disable-log-requests: true
 Frontend:
  served_model_name: deepseek-ai/DeepSeek-R1
@@ -35,7 +36,7 @@ VllmWorker:
    workers: 1
    resources:
      gpu: '16'
-  common-configs: [model, block-size, max-model-len, kv-transfer-config, tensor-parallel-size]
+  common-configs: [model, block-size, max-model-len, kv-transfer-config, tensor-parallel-size, disable-log-requests]
 PrefillWorker:
  max-num-batched-tokens: 16384
@@ -43,4 +44,4 @@ PrefillWorker:
    workers: 1
    resources:
      gpu: '16'
-  common-configs: [model, block-size, max-model-len, kv-transfer-config, tensor-parallel-size]
+  common-configs: [model, block-size, max-model-len, kv-transfer-config, tensor-parallel-size, disable-log-requests]