README.md 3.29 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# BurstGPT Load Generator Converter

A tool to convert CSV files containing ChatGPT/GPT-4 conversation logs into mooncake-style JSONL format for load testing and simulation.

> [!NOTE]
> Currently, KV reuse is not considered in the output. We will update the script once [BurstGPT](https://github.com/HPMLL/BurstGPT) adds user session information.

## Input Format

The input CSV can be downloaded from [BurstGPT Release v1.1](https://github.com/HPMLL/BurstGPT/releases/tag/v1.1):
- `Timestamp`: Request timestamp in seconds
- `Model`: Model name (e.g., "ChatGPT", "GPT-4")
- `Request tokens`: Number of input tokens
- `Response tokens`: Number of output tokens
- `Total tokens`: Total tokens (not used)
- `Log Type`: Type of log (e.g., "Conversation log", "API log")

Example:
```csv
Timestamp,Model,Request tokens,Response tokens,Total tokens,Log Type
5,ChatGPT,472,18,490,Conversation log
45,ChatGPT,1087,230,1317,Conversation log
118,GPT-4,417,276,693,Conversation log
```

## Output Format

The output is a JSONL file where each line is a JSON object:
```json
{"timestamp": 5000, "input_length": 472, "output_length": 18, "hash_ids": [123, 456, 789, ...]}
```

Fields:
- `timestamp`: Request time in milliseconds (integer)
- `input_length`: Number of input tokens
- `output_length`: Number of output tokens
- `hash_ids`: Array of random hash IDs simulating KV cache blocks

## Usage

### Basic Usage

```bash
python convert.py --input-file <BurstGPT CSV data>
```

If `--output-file` is not specified, the output will use the input filename with `.jsonl` extension.

### Command Line Arguments

#### Required Arguments
- `--input-file`: Path to the input CSV file

#### Optional Arguments

**Filtering:**
- `--model`: Filter by model (`ChatGPT` or `GPT-4`), None for no filtering
- `--log-type`: Filter by log type (`Conversation log` or `API log`), None for no filtering
59
60
- `--skip-num-prompt`: Skip the first N rows after filtering (default: 0). Applied **before** `--num-prompt`.
- `--num-prompt`: Limit number of rows in the final output, None for no filtering (applied **after** `--skip-num-prompt`)
61
62
63
64
65
66

**Timestamp Adjustment:**
- `--speed-ratio`: Adjust request timing (default: 1.0)
  - Values > 1: Speed up (e.g., 2.0 = 2x faster)
  - Values < 1: Slow down (e.g., 0.5 = 2x slower)
  - Formula: `new_timestamp = old_timestamp / speed_ratio`
67
  - After filtering/skip/cap and speed-ratio adjustment, timestamps are shifted so the first kept request starts at `t=0`.
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104

**Hash Generation:**
- `--block-size`: Block size in mooncake traces (default: 128)
- `--num-hash-blocks`: Maximum hash ID value (default: 10000). Hash IDs are randomly chosen from 0 to this value for each block.
**Output:**
- `--output-file`: Path to output JSONL file (default: input filename with .jsonl extension)

## Statistics Output

After conversion, the script displays statistics about the generated workload:

```
============================================================
STATISTICS
============================================================

Input Length (ISL):
  Min: 37
  Max: 1528
  Avg: 705.89
  Std: 524.33

Output Length (OSL):
  Min: 18
  Max: 1656
  Avg: 494.67
  Std: 513.21

Sequence Length (ISL + OSL):
  Max: 3184

Request Rate:
  Total requests: 9
  Duration: 405.00 seconds
  Average RPS: 0.02
============================================================
```