feat: add mooncake-style benchmarking for router (#3068)

Signed-off-by: PeaBrane <yanrpei@gmail.com>

feat: add mooncake-style benchmarking for router (#3068)
Signed-off-by: PeaBrane <yanrpei@gmail.com>
38300f22 · Yan Ru Pei · GitHub · bc0a7633 · 38300f22 · 38300f22
Unverified Commit 38300f22 authored Sep 16, 2025 by Yan Ru Pei Committed by GitHub Sep 16, 2025
3 changed files
--- a/benchmarks/router/README.md
+++ b/benchmarks/router/README.md
@@ -26,6 +26,7 @@ This directory contains scripts for benchmarking the Dynamo router with prefix c
  - `dynamo` package (with vllm and frontend modules)
  - `genai-perf` for benchmarking
  - `matplotlib` for plotting results
+  - `data-generator` package (install with `pip install -e ./benchmarks` from repo root)

 ### Setting up etcd and NATS

@@ -43,6 +44,7 @@ This will start both etcd and NATS with the required configurations in the backg
 - **`run_engines.sh`** - Launches multiple vLLM worker instances
 - **`ping.sh`** - Simple test script to verify the setup is working
 - **`prefix_ratio_benchmark.py`** - Main benchmarking script that sweeps prefix ratios
+- **`real_data_benchmark.py`** - Benchmarking script that uses real mooncake-style trace data
 - **`plot_prefix_ratio_comparison.py`** - Generates comparison plots from benchmark results

 ## Usage Instructions
@@ -160,20 +162,39 @@ python prefix_ratio_benchmark.py --url http://localhost:8000 http://localhost:80
 python prefix_ratio_benchmark.py --output-dir results/experiment1
 ```

-### Benchmark Output
+### Step 4 (Alternative): Run Benchmarks with Real Trace Data

-The benchmark script generates:
+Instead of synthetic benchmarks with controlled prefix ratios, you can benchmark using real trace data in [mooncake-style format](https://github.com/kvcache-ai/Mooncake/blob/d21da178bae8db9651cf18a76824c084145fc725/mooncake_trace.jsonl). This approach uses actual request patterns from production traces, potentially modified with synthesis parameters.

-1. **Performance plots** (`prefix_ratio_performance.png`):
-   - TTFT (Time to First Token) vs Prefix Ratio
-   - Throughput (tokens/s) vs Prefix Ratio
+```bash
+python real_data_benchmark.py --input-file mooncake_trace.jsonl
+```
+
+The script can apply various modifications on top of the original trace file to simulate different scenarios and workload conditions. This script accepts the same synthesis parameters as the [prefix data generator](../prefix_data_generator/README.md):

-2. **Results summary** (`results_summary.json`):
-   - Raw data for all prefix ratios tested
-   - Configuration parameters used
+**Key parameters:**
+- `--num-requests`: Number of requests to synthesize from the trace (default: use all)
+- `--speedup-ratio`: Speed up request arrival times (e.g., 2.0 makes requests arrive 2x faster)
+- `--prefix-len-multiplier`: Scale the length of shared prefixes (e.g., 2.0 doubles prefix lengths)
+- `--prefix-root-multiplier`: Replicate the prefix tree structure N times with different roots
+- `--prompt-len-multiplier`: Scale the length of unique user prompts (e.g., 0.5 for shorter prompts)
+- `--max-isl`: Filter out requests exceeding this input sequence length

-3. **Detailed artifacts** (in subdirectories):
-   - Full genai-perf profiling data for each run
+Examples:
+
+```bash
+# Use original trace file as-is (no synthesis parameters specified)
+python real_data_benchmark.py --input-file trace.jsonl
+
+# Speed up request rate by 2x and use only first 1000 requests
+python real_data_benchmark.py --input-file trace.jsonl --num-requests 1000 --speedup-ratio 2.0
+
+# Double prefix lengths to test cache efficiency with longer shared contexts
+python real_data_benchmark.py --input-file trace.jsonl --prefix-len-multiplier 2.0
+
+# Create more diverse workload by replicating prefix tree 3 times
+python real_data_benchmark.py --input-file trace.jsonl --prefix-root-multiplier 3
+```

 ## Troubleshooting


--- a/benchmarks/router/prefix_ratio_benchmark.py
+++ b/benchmarks/router/prefix_ratio_benchmark.py
@@ -324,7 +324,7 @@ def main():
    parser.add_argument("--osl", type=int, default=200, help="Output sequence length")
    parser.add_argument("--requests", type=int, default=200, help="Number of requests")
    parser.add_argument("--concurrency", type=int, default=20, help="Concurrency level")
-    parser.add_argument("--seed", type=int, default=420, help="Initial random seed")
+    parser.add_argument("--seed", type=int, default=0, help="Initial random seed")
    parser.add_argument(
        "--prefix-ratios",
        type=float,

--- a/benchmarks/router/real_data_benchmark.py
+++ b/benchmarks/router/real_data_benchmark.py
+#!/usr/bin/env python3
+
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+import argparse
+import json
+import logging
+import os
+import subprocess
+
+import numpy as np
+from prefix_data_generator.synthesizer import Synthesizer
+
+# Setup logging
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.INFO)
+console_handler = logging.StreamHandler()
+console_handler.setLevel(logging.INFO)
+formatter = logging.Formatter(
+    "%(asctime)s - %(name)s - %(levelname)s - %(message)s", "%Y-%m-%d %H:%M:%S"
+)
+console_handler.setFormatter(formatter)
+logger.addHandler(console_handler)
+
+
+def get_genai_perf_cmd_for_trace(
+    model,
+    tokenizer,
+    input_file,
+    artifact_dir,
+    seed,
+    url="http://localhost:8888",
+):
+    """Build genai-perf command for trace file input"""
+    return [
+        "genai-perf",
+        "profile",
+        "--model",
+        model,
+        "--tokenizer",
+        tokenizer,
+        "--endpoint-type",
+        "chat",
+        "--endpoint",
+        "v1/chat/completions",
+        "--streaming",
+        "--url",
+        url,
+        "--input-file",
+        input_file,
+        "--random-seed",
+        str(seed),
+        "--artifact-dir",
+        artifact_dir,
+        "--",
+        "-v",
+        "--max-threads",
+        "256",
+        "-H",
+        "Authorization: Bearer NOT USED",
+        "-H",
+        "Accept: text/event-stream",
+    ]
+
+
+def run_benchmark_with_trace(
+    model,
+    tokenizer,
+    trace_file,
+    artifact_dir,
+    url,
+    seed,
+):
+    """Run genai-perf benchmark with a trace file"""
+    genai_perf_cmd = get_genai_perf_cmd_for_trace(
+        model,
+        tokenizer,
+        trace_file,
+        artifact_dir,
+        seed,
+        url,
+    )
+
+    logger.info(f"Running genai-perf with trace file: {trace_file}")
+    logger.info(f"Command: {' '.join(genai_perf_cmd)}")
+
+    try:
+        # Run genai-perf and let it output directly to terminal
+        subprocess.run(genai_perf_cmd, check=True)
+
+        logger.info("Genai-perf profiling completed successfully")
+
+    except subprocess.CalledProcessError as e:
+        logger.error(f"Genai-perf failed with error code: {e.returncode}")
+        logger.error(f"stderr: {e.stderr}")
+        raise
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Benchmark with real or synthesized mooncake-style trace data"
+    )
+
+    # Model and server configuration
+    parser.add_argument(
+        "--model",
+        type=str,
+        default="deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
+        help="Model name",
+    )
+    parser.add_argument(
+        "--tokenizer",
+        type=str,
+        default=None,
+        help="Tokenizer name (defaults to model)",
+    )
+    parser.add_argument(
+        "--url",
+        type=str,
+        default="http://localhost:8080",
+        help="Server URL",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="real_data_benchmark_results",
+        help="Output directory for results",
+    )
+
+    # Trace file and synthesis configuration (similar to synthesizer.py)
+    parser.add_argument(
+        "--input-file",
+        type=str,
+        default="mooncake_trace.jsonl",
+        help="Path to the input mooncake-style trace file",
+    )
+    parser.add_argument(
+        "--num-requests",
+        type=int,
+        default=None,
+        help="Number of requests to synthesize (default: use all from input file)",
+    )
+    parser.add_argument(
+        "--speedup-ratio",
+        type=float,
+        default=1.0,
+        help="Factor to speed up request intervals (default: 1.0)",
+    )
+    parser.add_argument(
+        "--prefix-len-multiplier",
+        type=float,
+        default=1.0,
+        help="Multiplier for prefix lengths (default: 1.0)",
+    )
+    parser.add_argument(
+        "--prefix-root-multiplier",
+        type=int,
+        default=1,
+        help="Number of times to replicate the core radix tree (default: 1)",
+    )
+    parser.add_argument(
+        "--prompt-len-multiplier",
+        type=float,
+        default=1.0,
+        help="Multiplier for leaf path lengths (default: 1.0, use <1 for shorter prompts)",
+    )
+    parser.add_argument(
+        "--max-isl",
+        type=int,
+        default=None,
+        help="Maximum input sequence length to include in output (default: None, no filtering)",
+    )
+    parser.add_argument(
+        "--block-size",
+        type=int,
+        default=512,
+        help="Block size for prefilling and decoding (default: 512)",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=0,
+        help="Random seed for reproducibility (default: 0)",
+    )
+
+    args = parser.parse_args()
+
+    # Use tokenizer from model if not specified
+    if args.tokenizer is None:
+        args.tokenizer = args.model
+
+    # Create output directory
+    os.makedirs(args.output_dir, exist_ok=True)
+
+    # Determine whether to use original or synthesized data
+    # Check if any synthesis parameters are non-default
+    needs_synthesis = (
+        args.num_requests is not None
+        or args.speedup_ratio != 1.0
+        or args.prefix_len_multiplier != 1.0
+        or args.prefix_root_multiplier != 1
+        or args.prompt_len_multiplier != 1.0
+        or args.max_isl is not None
+    )
+
+    if not needs_synthesis:
+        # No synthesis needed, use original file
+        trace_file_path = args.input_file
+        logger.info(
+            f"Using original trace file (no synthesis parameters modified): {trace_file_path}"
+        )
+    else:
+        # Generate synthetic data based on input file
+        logger.info("Generating synthetic trace data...")
+        logger.info(f"  Base file: {args.input_file}")
+        logger.info(
+            f"  Num requests: {args.num_requests if args.num_requests else 'all'}"
+        )
+        logger.info(f"  Speedup ratio: {args.speedup_ratio}")
+        logger.info(f"  Prefix len multiplier: {args.prefix_len_multiplier}")
+        logger.info(f"  Prefix root multiplier: {args.prefix_root_multiplier}")
+        logger.info(f"  Prompt len multiplier: {args.prompt_len_multiplier}")
+        logger.info(f"  Max ISL: {args.max_isl if args.max_isl else 'no limit'}")
+        logger.info(f"  Random seed: {args.seed}")
+
+        # Set random seed for reproducibility
+        np.random.seed(args.seed)
+
+        # Create synthesizer
+        synthesizer = Synthesizer(
+            args.input_file,
+            block_size=args.block_size,
+            speedup_ratio=args.speedup_ratio,
+            prefix_len_multiplier=args.prefix_len_multiplier,
+            prefix_root_multiplier=args.prefix_root_multiplier,
+            prompt_len_multiplier=args.prompt_len_multiplier,
+        )
+
+        # Determine number of requests
+        if args.num_requests is None:
+            # Count requests in original file
+            with open(args.input_file, "r") as f:
+                num_requests = sum(1 for _ in f)
+            logger.info(f"Using all {num_requests} requests from input file")
+        else:
+            num_requests = args.num_requests
+
+        # Generate synthetic requests
+        requests = synthesizer.synthesize_requests(num_requests, args.max_isl)
+        logger.info(f"Generated {len(requests)} synthetic requests")
+
+        # Save synthetic data to a permanent file in output directory
+        synthetic_trace_filename = "synthetic_trace.jsonl"
+        trace_file_path = os.path.join(args.output_dir, synthetic_trace_filename)
+
+        # Write synthetic data to file
+        with open(trace_file_path, "w") as f:
+            for request in requests:
+                f.write(json.dumps(request) + "\n")
+
+        logger.info(f"Synthetic trace data saved to: {trace_file_path}")
+
+    # Run benchmark with the trace file
+    artifact_dir = os.path.join(args.output_dir, "genai_perf_artifacts")
+    os.makedirs(artifact_dir, exist_ok=True)
+
+    run_benchmark_with_trace(
+        args.model,
+        args.tokenizer,
+        trace_file_path,
+        artifact_dir,
+        args.url,
+        args.seed,
+    )
+
+    logger.info(f"Results saved to: {artifact_dir}")
+
+
+if __name__ == "__main__":
+    main()