feat: a script to convert BurstGPT to mooncake format (#3479)

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

feat: a script to convert BurstGPT to mooncake format (#3479)
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
87825339 · Hongkuan Zhou · GitHub · 83e259a7 · 87825339 · 87825339
Unverified Commit 87825339 authored Oct 07, 2025 by Hongkuan Zhou Committed by GitHub Oct 08, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 397 additions and 0 deletions

benchmarks/burstgpt_loadgen/README.md benchmarks/burstgpt_loadgen/README.md +102 -0

benchmarks/burstgpt_loadgen/convert.py benchmarks/burstgpt_loadgen/convert.py +295 -0

No files found.
--- a/benchmarks/burstgpt_loadgen/README.md
+++ b/benchmarks/burstgpt_loadgen/README.md
+# BurstGPT Load Generator Converter
+A tool to convert CSV files containing ChatGPT/GPT-4 conversation logs into mooncake-style JSONL format for load testing and simulation.
+> [!NOTE]
+> Currently, KV reuse is not considered in the output. We will update the script once [BurstGPT](https://github.com/HPMLL/BurstGPT) adds user session information.
+## Input Format
+The input CSV can be downloaded from [BurstGPT Release v1.1](https://github.com/HPMLL/BurstGPT/releases/tag/v1.1):
+- `Timestamp`: Request timestamp in seconds
+- `Model`: Model name (e.g., "ChatGPT", "GPT-4")
+- `Request tokens`: Number of input tokens
+- `Response tokens`: Number of output tokens
+- `Total tokens`: Total tokens (not used)
+- `Log Type`: Type of log (e.g., "Conversation log", "API log")
+Example:
+```csv
+Timestamp,Model,Request tokens,Response tokens,Total tokens,Log Type
+5,ChatGPT,472,18,490,Conversation log
+45,ChatGPT,1087,230,1317,Conversation log
+118,GPT-4,417,276,693,Conversation log
+```
+## Output Format
+The output is a JSONL file where each line is a JSON object:
+```json
+{"timestamp": 5000, "input_length": 472, "output_length": 18, "hash_ids": [123, 456, 789, ...]}
+```
+Fields:
+- `timestamp`: Request time in milliseconds (integer)
+- `input_length`: Number of input tokens
+- `output_length`: Number of output tokens
+- `hash_ids`: Array of random hash IDs simulating KV cache blocks
+## Usage
+### Basic Usage
+```bash
+python convert.py --input-file <BurstGPT CSV data>
+```
+If `--output-file` is not specified, the output will use the input filename with `.jsonl` extension.
+### Command Line Arguments
+#### Required Arguments
+- `--input-file`: Path to the input CSV file
+#### Optional Arguments
+**Filtering:**
+- `--model`: Filter by model (`ChatGPT` or `GPT-4`), None for no filtering
+- `--log-type`: Filter by log type (`Conversation log` or `API log`), None for no filtering
+- `--num-prompt`: Limit number of rows in the final output, None for no filtering
+**Timestamp Adjustment:**
+- `--speed-ratio`: Adjust request timing (default: 1.0)
+  - Values > 1: Speed up (e.g., 2.0 = 2x faster)
+  - Values < 1: Slow down (e.g., 0.5 = 2x slower)
+  - Formula: `new_timestamp = old_timestamp / speed_ratio`
+**Hash Generation:**
+- `--block-size`: Block size in mooncake traces (default: 128)
+- `--num-hash-blocks`: Maximum hash ID value (default: 10000). Hash IDs are randomly chosen from 0 to this value for each block.
+**Output:**
+- `--output-file`: Path to output JSONL file (default: input filename with .jsonl extension)
+## Statistics Output
+After conversion, the script displays statistics about the generated workload:
+```
+============================================================
+STATISTICS
+============================================================
+Input Length (ISL):
+  Min: 37
+  Max: 1528
+  Avg: 705.89
+  Std: 524.33
+Output Length (OSL):
+  Min: 18
+  Max: 1656
+  Avg: 494.67
+  Std: 513.21
+Sequence Length (ISL + OSL):
+  Max: 3184
+Request Rate:
+  Total requests: 9
+  Duration: 405.00 seconds
+  Average RPS: 0.02
+============================================================
+```
--- a/benchmarks/burstgpt_loadgen/convert.py
+++ b/benchmarks/burstgpt_loadgen/convert.py
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+import argparse
+import math
+import os
+import random
+import pandas as pd
+from tqdm import tqdm
+def parse_args():
+    """Parse command line arguments."""
+    parser = argparse.ArgumentParser(description="Convert CSV file to mooncake format")
+    parser.add_argument("--input-file", type=str, help="Path to the input CSV file")
+    parser.add_argument(
+        "--output-file",
+        type=str,
+        default=None,
+        help="Path to the output mooncake-style jsonl file. If not provided, will use input file name but change extension from .csv to .jsonl",
+    )
+    parser.add_argument(
+        "--model",
+        type=str,
+        default=None,
+        choices=["ChatGPT", "GPT-4"],
+        help="Filter by model (ChatGPT or GPT-4). If not specified, no filtering is applied.",
+    )
+    parser.add_argument(
+        "--log-type",
+        type=str,
+        default=None,
+        choices=["Conversation log", "API log"],
+        help="Filter by log type (Conversation log or API log). If not specified, no filtering is applied.",
+    )
+    parser.add_argument(
+        "--num-prompt",
+        type=int,
+        default=None,
+        help="Limit the number of rows to output after filtering. If not specified, all rows are output.",
+    )
+    parser.add_argument(
+        "--speed-ratio",
+        type=float,
+        default=1.0,
+        help="Speed ratio to adjust timestamps. Values > 1 speed up requests, < 1 slow down. Default: 1.0 (no change)",
+    )
+    parser.add_argument(
+        "--block-size",
+        type=int,
+        default=128,
+        help="Block size for calculating hash array length: ceil(input_length / block_size)",
+    )
+    parser.add_argument(
+        "--num-hash-blocks",
+        type=int,
+        default=10000,
+        help="Maximum hash ID value for random hash generation. Default: 10000",
+    )
+    return parser.parse_args()
+def load_csv(filepath):
+    """Load CSV file into a pandas DataFrame."""
+    try:
+        df = pd.read_csv(filepath)
+        print(f"Successfully loaded {filepath}")
+        print(f"DataFrame shape: {df.shape}")
+        print(f"Columns: {list(df.columns)}")
+        print("First few rows:")
+        print(df.head())
+        return df
+    except FileNotFoundError:
+        print(f"Error: File {filepath} not found")
+        return None
+    except Exception as e:
+        print(f"Error loading CSV: {e}")
+        return None
+def apply_filters(df, model=None, log_type=None, num_prompt=None):
+    """
+    Apply filters to the DataFrame.
+    Args:
+        df: Input DataFrame
+        model: Model to filter by (ChatGPT or GPT-4)
+        log_type: Log type to filter by (Conversation log or API log)
+        num_prompt: Number of rows to keep after filtering
+    Returns:
+        Filtered DataFrame
+    """
+    filtered_df = df.copy()
+    # Apply model filter
+    if model is not None:
+        filtered_df = filtered_df[filtered_df["Model"] == model]
+        print(f"After model filter ({model}): {len(filtered_df)} rows")
+    # Apply log type filter
+    if log_type is not None:
+        filtered_df = filtered_df[filtered_df["Log Type"] == log_type]
+        print(f"After log type filter ({log_type}): {len(filtered_df)} rows")
+    # Apply num_prompt limit
+    if num_prompt is not None:
+        filtered_df = filtered_df.head(num_prompt)
+        print(f"After num_prompt limit ({num_prompt}): {len(filtered_df)} rows")
+    return filtered_df
+def apply_speed_ratio(df, speed_ratio):
+    """
+    Apply speed ratio to timestamps.
+    Args:
+        df: Input DataFrame
+        speed_ratio: Speed ratio to adjust timestamps (timestamp /= speed_ratio)
+    Returns:
+        DataFrame with adjusted timestamps
+    """
+    if speed_ratio == 1.0:
+        print("Speed ratio is 1.0, no timestamp adjustment needed")
+        return df
+    adjusted_df = df.copy()
+    adjusted_df["Timestamp"] = adjusted_df["Timestamp"] / speed_ratio
+    print(f"Applied speed ratio: {speed_ratio}")
+    print(
+        f"Original timestamps: {df['Timestamp'].min():.2f} - {df['Timestamp'].max():.2f}"
+    )
+    print(
+        f"Adjusted timestamps: {adjusted_df['Timestamp'].min():.2f} - {adjusted_df['Timestamp'].max():.2f}"
+    )
+    return adjusted_df
+def convert_to_mooncake(df, block_size, num_hash_blocks):
+    """
+    Convert DataFrame to mooncake format.
+    Args:
+        df: Input DataFrame with columns: Timestamp, Request tokens, Response tokens
+        block_size: Block size for calculating hash array length
+        num_hash_blocks: Maximum hash ID value for random generation
+    Returns:
+        DataFrame in mooncake format with columns: timestamp, input_length, output_length, hash_ids
+    """
+    mooncake_data = []
+    for _, row in tqdm(df.iterrows(), total=len(df)):
+        # Convert timestamp from seconds to milliseconds (integer)
+        timestamp_ms = int(row["Timestamp"] * 1000)
+        # Map request tokens to input_length and response tokens to output_length
+        input_length = int(row["Request tokens"])
+        output_length = int(row["Response tokens"])
+        # Calculate hash array length based on block size
+        hash_array_length = math.ceil(input_length / block_size)
+        # Generate random hash IDs
+        hash_ids = [
+            random.randint(0, num_hash_blocks) for _ in range(hash_array_length)
+        ]
+        mooncake_data.append(
+            {
+                "timestamp": timestamp_ms,
+                "input_length": input_length,
+                "output_length": output_length,
+                "hash_ids": hash_ids,
+            }
+        )
+    print(f"Converted {len(mooncake_data)} rows to mooncake format")
+    return pd.DataFrame(mooncake_data)
+def print_statistics(df):
+    """
+    Print statistics about the converted mooncake data.
+    Args:
+        df: DataFrame in mooncake format
+    """
+    print("\n" + "=" * 60)
+    print("STATISTICS")
+    print("=" * 60)
+    # Input length statistics
+    isl_min = df["input_length"].min()
+    isl_max = df["input_length"].max()
+    isl_avg = df["input_length"].mean()
+    isl_std = df["input_length"].std()
+    print("\nInput Length (ISL):")
+    print(f"  Min: {isl_min}")
+    print(f"  Max: {isl_max}")
+    print(f"  Avg: {isl_avg:.2f}")
+    print(f"  Std: {isl_std:.2f}")
+    # Output length statistics
+    osl_min = df["output_length"].min()
+    osl_max = df["output_length"].max()
+    osl_avg = df["output_length"].mean()
+    osl_std = df["output_length"].std()
+    print("\nOutput Length (OSL):")
+    print(f"  Min: {osl_min}")
+    print(f"  Max: {osl_max}")
+    print(f"  Avg: {osl_avg:.2f}")
+    print(f"  Std: {osl_std:.2f}")
+    # Sequence length (ISL + OSL) - calculate without modifying df
+    max_seq_len = (df["input_length"] + df["output_length"]).max()
+    print("\nSequence Length (ISL + OSL):")
+    print(f"  Max: {max_seq_len}")
+    # RPS calculation
+    if len(df) > 1:
+        # Timestamps are in milliseconds, convert to seconds
+        min_timestamp_s = df["timestamp"].min() / 1000.0
+        max_timestamp_s = df["timestamp"].max() / 1000.0
+        duration_s = max_timestamp_s - min_timestamp_s
+        if duration_s > 0:
+            avg_rps = len(df) / duration_s
+            print("\nRequest Rate:")
+            print(f"  Total requests: {len(df)}")
+            print(f"  Duration: {duration_s:.2f} seconds")
+            print(f"  Average RPS: {avg_rps:.2f}")
+        else:
+            print("\nRequest Rate:")
+            print(f"  Total requests: {len(df)}")
+            print("  Duration: 0 seconds (all requests at same timestamp)")
+            print("  Average RPS: N/A")
+    else:
+        print("\nRequest Rate:")
+        print(f"  Total requests: {len(df)}")
+        print("  Average RPS: N/A (only 1 request)")
+    print("=" * 60)
+def main():
+    args = parse_args()
+    # Load the CSV file
+    df = load_csv(args.input_file)
+    if df is None:
+        return 1
+    # Apply filters
+    print("\nApplying filters...")
+    print(f"Initial rows: {len(df)}")
+    filtered_df = apply_filters(
+        df, model=args.model, log_type=args.log_type, num_prompt=args.num_prompt
+    )
+    # Apply Speedup
+    adjusted_df = apply_speed_ratio(filtered_df, args.speed_ratio)
+    # Convert to mooncake format
+    print("\nConverting to mooncake format...")
+    print(f"Block size: {args.block_size}")
+    print(f"Num hash blocks: {args.num_hash_blocks}")
+    mooncake_df = convert_to_mooncake(
+        adjusted_df, args.block_size, args.num_hash_blocks
+    )
+    # Print statistics
+    print_statistics(mooncake_df)
+    # Save to file
+    # Determine output file name
+    if args.output_file is None:
+        # Use input file name but change extension from .csv to .jsonl
+        base_name = os.path.splitext(args.input_file)[0]
+        args.output_file = base_name + ".jsonl"
+    mooncake_df.to_json(args.output_file, orient="records", lines=True)
+    print(f"\nSaved {len(mooncake_df)} rows to {args.output_file}")
+    return 0
+if __name__ == "__main__":
+    exit(main())