Unverified Commit 38300f22 authored by Yan Ru Pei's avatar Yan Ru Pei Committed by GitHub
Browse files

feat: add mooncake-style benchmarking for router (#3068)


Signed-off-by: default avatarPeaBrane <yanrpei@gmail.com>
parent bc0a7633
......@@ -26,6 +26,7 @@ This directory contains scripts for benchmarking the Dynamo router with prefix c
- `dynamo` package (with vllm and frontend modules)
- `genai-perf` for benchmarking
- `matplotlib` for plotting results
- `data-generator` package (install with `pip install -e ./benchmarks` from repo root)
### Setting up etcd and NATS
......@@ -43,6 +44,7 @@ This will start both etcd and NATS with the required configurations in the backg
- **`run_engines.sh`** - Launches multiple vLLM worker instances
- **`ping.sh`** - Simple test script to verify the setup is working
- **`prefix_ratio_benchmark.py`** - Main benchmarking script that sweeps prefix ratios
- **`real_data_benchmark.py`** - Benchmarking script that uses real mooncake-style trace data
- **`plot_prefix_ratio_comparison.py`** - Generates comparison plots from benchmark results
## Usage Instructions
......@@ -160,20 +162,39 @@ python prefix_ratio_benchmark.py --url http://localhost:8000 http://localhost:80
python prefix_ratio_benchmark.py --output-dir results/experiment1
```
### Benchmark Output
### Step 4 (Alternative): Run Benchmarks with Real Trace Data
The benchmark script generates:
Instead of synthetic benchmarks with controlled prefix ratios, you can benchmark using real trace data in [mooncake-style format](https://github.com/kvcache-ai/Mooncake/blob/d21da178bae8db9651cf18a76824c084145fc725/mooncake_trace.jsonl). This approach uses actual request patterns from production traces, potentially modified with synthesis parameters.
1. **Performance plots** (`prefix_ratio_performance.png`):
- TTFT (Time to First Token) vs Prefix Ratio
- Throughput (tokens/s) vs Prefix Ratio
```bash
python real_data_benchmark.py --input-file mooncake_trace.jsonl
```
The script can apply various modifications on top of the original trace file to simulate different scenarios and workload conditions. This script accepts the same synthesis parameters as the [prefix data generator](../prefix_data_generator/README.md):
2. **Results summary** (`results_summary.json`):
- Raw data for all prefix ratios tested
- Configuration parameters used
**Key parameters:**
- `--num-requests`: Number of requests to synthesize from the trace (default: use all)
- `--speedup-ratio`: Speed up request arrival times (e.g., 2.0 makes requests arrive 2x faster)
- `--prefix-len-multiplier`: Scale the length of shared prefixes (e.g., 2.0 doubles prefix lengths)
- `--prefix-root-multiplier`: Replicate the prefix tree structure N times with different roots
- `--prompt-len-multiplier`: Scale the length of unique user prompts (e.g., 0.5 for shorter prompts)
- `--max-isl`: Filter out requests exceeding this input sequence length
3. **Detailed artifacts** (in subdirectories):
- Full genai-perf profiling data for each run
Examples:
```bash
# Use original trace file as-is (no synthesis parameters specified)
python real_data_benchmark.py --input-file trace.jsonl
# Speed up request rate by 2x and use only first 1000 requests
python real_data_benchmark.py --input-file trace.jsonl --num-requests 1000 --speedup-ratio 2.0
# Double prefix lengths to test cache efficiency with longer shared contexts
python real_data_benchmark.py --input-file trace.jsonl --prefix-len-multiplier 2.0
# Create more diverse workload by replicating prefix tree 3 times
python real_data_benchmark.py --input-file trace.jsonl --prefix-root-multiplier 3
```
## Troubleshooting
......
......@@ -324,7 +324,7 @@ def main():
parser.add_argument("--osl", type=int, default=200, help="Output sequence length")
parser.add_argument("--requests", type=int, default=200, help="Number of requests")
parser.add_argument("--concurrency", type=int, default=20, help="Concurrency level")
parser.add_argument("--seed", type=int, default=420, help="Initial random seed")
parser.add_argument("--seed", type=int, default=0, help="Initial random seed")
parser.add_argument(
"--prefix-ratios",
type=float,
......
#!/usr/bin/env python3
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import argparse
import json
import logging
import os
import subprocess
import numpy as np
from prefix_data_generator.synthesizer import Synthesizer
# Setup logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)
formatter = logging.Formatter(
"%(asctime)s - %(name)s - %(levelname)s - %(message)s", "%Y-%m-%d %H:%M:%S"
)
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)
def get_genai_perf_cmd_for_trace(
model,
tokenizer,
input_file,
artifact_dir,
seed,
url="http://localhost:8888",
):
"""Build genai-perf command for trace file input"""
return [
"genai-perf",
"profile",
"--model",
model,
"--tokenizer",
tokenizer,
"--endpoint-type",
"chat",
"--endpoint",
"v1/chat/completions",
"--streaming",
"--url",
url,
"--input-file",
input_file,
"--random-seed",
str(seed),
"--artifact-dir",
artifact_dir,
"--",
"-v",
"--max-threads",
"256",
"-H",
"Authorization: Bearer NOT USED",
"-H",
"Accept: text/event-stream",
]
def run_benchmark_with_trace(
model,
tokenizer,
trace_file,
artifact_dir,
url,
seed,
):
"""Run genai-perf benchmark with a trace file"""
genai_perf_cmd = get_genai_perf_cmd_for_trace(
model,
tokenizer,
trace_file,
artifact_dir,
seed,
url,
)
logger.info(f"Running genai-perf with trace file: {trace_file}")
logger.info(f"Command: {' '.join(genai_perf_cmd)}")
try:
# Run genai-perf and let it output directly to terminal
subprocess.run(genai_perf_cmd, check=True)
logger.info("Genai-perf profiling completed successfully")
except subprocess.CalledProcessError as e:
logger.error(f"Genai-perf failed with error code: {e.returncode}")
logger.error(f"stderr: {e.stderr}")
raise
def main():
parser = argparse.ArgumentParser(
description="Benchmark with real or synthesized mooncake-style trace data"
)
# Model and server configuration
parser.add_argument(
"--model",
type=str,
default="deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
help="Model name",
)
parser.add_argument(
"--tokenizer",
type=str,
default=None,
help="Tokenizer name (defaults to model)",
)
parser.add_argument(
"--url",
type=str,
default="http://localhost:8080",
help="Server URL",
)
parser.add_argument(
"--output-dir",
type=str,
default="real_data_benchmark_results",
help="Output directory for results",
)
# Trace file and synthesis configuration (similar to synthesizer.py)
parser.add_argument(
"--input-file",
type=str,
default="mooncake_trace.jsonl",
help="Path to the input mooncake-style trace file",
)
parser.add_argument(
"--num-requests",
type=int,
default=None,
help="Number of requests to synthesize (default: use all from input file)",
)
parser.add_argument(
"--speedup-ratio",
type=float,
default=1.0,
help="Factor to speed up request intervals (default: 1.0)",
)
parser.add_argument(
"--prefix-len-multiplier",
type=float,
default=1.0,
help="Multiplier for prefix lengths (default: 1.0)",
)
parser.add_argument(
"--prefix-root-multiplier",
type=int,
default=1,
help="Number of times to replicate the core radix tree (default: 1)",
)
parser.add_argument(
"--prompt-len-multiplier",
type=float,
default=1.0,
help="Multiplier for leaf path lengths (default: 1.0, use <1 for shorter prompts)",
)
parser.add_argument(
"--max-isl",
type=int,
default=None,
help="Maximum input sequence length to include in output (default: None, no filtering)",
)
parser.add_argument(
"--block-size",
type=int,
default=512,
help="Block size for prefilling and decoding (default: 512)",
)
parser.add_argument(
"--seed",
type=int,
default=0,
help="Random seed for reproducibility (default: 0)",
)
args = parser.parse_args()
# Use tokenizer from model if not specified
if args.tokenizer is None:
args.tokenizer = args.model
# Create output directory
os.makedirs(args.output_dir, exist_ok=True)
# Determine whether to use original or synthesized data
# Check if any synthesis parameters are non-default
needs_synthesis = (
args.num_requests is not None
or args.speedup_ratio != 1.0
or args.prefix_len_multiplier != 1.0
or args.prefix_root_multiplier != 1
or args.prompt_len_multiplier != 1.0
or args.max_isl is not None
)
if not needs_synthesis:
# No synthesis needed, use original file
trace_file_path = args.input_file
logger.info(
f"Using original trace file (no synthesis parameters modified): {trace_file_path}"
)
else:
# Generate synthetic data based on input file
logger.info("Generating synthetic trace data...")
logger.info(f" Base file: {args.input_file}")
logger.info(
f" Num requests: {args.num_requests if args.num_requests else 'all'}"
)
logger.info(f" Speedup ratio: {args.speedup_ratio}")
logger.info(f" Prefix len multiplier: {args.prefix_len_multiplier}")
logger.info(f" Prefix root multiplier: {args.prefix_root_multiplier}")
logger.info(f" Prompt len multiplier: {args.prompt_len_multiplier}")
logger.info(f" Max ISL: {args.max_isl if args.max_isl else 'no limit'}")
logger.info(f" Random seed: {args.seed}")
# Set random seed for reproducibility
np.random.seed(args.seed)
# Create synthesizer
synthesizer = Synthesizer(
args.input_file,
block_size=args.block_size,
speedup_ratio=args.speedup_ratio,
prefix_len_multiplier=args.prefix_len_multiplier,
prefix_root_multiplier=args.prefix_root_multiplier,
prompt_len_multiplier=args.prompt_len_multiplier,
)
# Determine number of requests
if args.num_requests is None:
# Count requests in original file
with open(args.input_file, "r") as f:
num_requests = sum(1 for _ in f)
logger.info(f"Using all {num_requests} requests from input file")
else:
num_requests = args.num_requests
# Generate synthetic requests
requests = synthesizer.synthesize_requests(num_requests, args.max_isl)
logger.info(f"Generated {len(requests)} synthetic requests")
# Save synthetic data to a permanent file in output directory
synthetic_trace_filename = "synthetic_trace.jsonl"
trace_file_path = os.path.join(args.output_dir, synthetic_trace_filename)
# Write synthetic data to file
with open(trace_file_path, "w") as f:
for request in requests:
f.write(json.dumps(request) + "\n")
logger.info(f"Synthetic trace data saved to: {trace_file_path}")
# Run benchmark with the trace file
artifact_dir = os.path.join(args.output_dir, "genai_perf_artifacts")
os.makedirs(artifact_dir, exist_ok=True)
run_benchmark_with_trace(
args.model,
args.tokenizer,
trace_file_path,
artifact_dir,
args.url,
args.seed,
)
logger.info(f"Results saved to: {artifact_dir}")
if __name__ == "__main__":
main()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment