README.md 5.39 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# NAT Trace Converter

Convert NeMo Agent Toolkit (NAT) profiler traces to mooncake format for aiperf benchmarking.

## Overview

This converter transforms `all_requests_profiler_traces.json` from NAT profiler into mooncake-style JSONL that can be used with aiperf for:
- Multi-turn conversation replay with `session_id` serialization
- Prefix cache benchmarking with `hash_ids`

## Obtaining Trace Data

Traces are generated by running evaluations with the [NeMo Agent Toolkit](https://github.com/NVIDIA/NeMo-Agent-Toolkit) profiler. See the [NAT documentation](https://docs.nvidia.com/nemo/agent-toolkit/latest/) for details on running agent workflows with profiling enabled.

An example trace file will be open-sourced soon to make benchmarking more accessible.

## Input Format

NAT profiler trace JSON (`all_requests_profiler_traces.json`):
```json
[
  {
    "request_number": 0,
    "intermediate_steps": [
      {
        "payload": {
          "event_type": "LLM_START",
          "metadata": {"chat_inputs": [{"content": "..."}]},
          "name": "llama-3.3-70b",
          "UUID": "..."
        }
      },
      {
        "payload": {
          "event_type": "LLM_END",
          "usage_info": {"token_usage": {"prompt_tokens": 9176, "completion_tokens": 142}},
          "UUID": "..."
        }
      }
    ]
  }
]
```

## Output Format

Mooncake JSONL with session-based serialization:
```json
{"session_id": "conv_0", "input_length": 9176, "output_length": 142, "hash_ids": [1, 2, 3]}
{"session_id": "conv_0", "input_length": 9500, "output_length": 98, "hash_ids": [1, 2, 3, 4]}
{"session_id": "conv_1", "input_length": 8234, "output_length": 156, "hash_ids": [5, 6]}
```

## Usage

Basic conversion:
```bash
python convert.py --input-file /path/to/all_requests_profiler_traces.json
```

With custom tokenizer:
```bash
python convert.py \
    --input-file /path/to/all_requests_profiler_traces.json \
    --tokenizer meta-llama/Llama-3.3-70B-Instruct \
    --block-size 128
```

Limit number of requests:
```bash
python convert.py \
    --input-file /path/to/all_requests_profiler_traces.json \
    --num-requests 100 \
    --skip-requests 10
```

## Arguments

| Argument | Description | Default |
|----------|-------------|---------|
| `--input-file` | Path to NAT profiler JSON | Required |
| `--output-file` | Output JSONL path | `<input>_mooncake.jsonl` |
| `--tokenizer` | HuggingFace tokenizer name | Auto-inferred from trace |
| `--block-size` | Block size for hash generation | 128 |
| `--num-requests` | Max requests to process | All |
| `--skip-requests` | Skip first N requests | 0 |

88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---

# Telemetry Trace Converter

Convert OpenAI-style telemetry JSONL (e.g., from agentic research pipelines) to mooncake format for aiperf benchmarking.

## Overview

This converter transforms `telemetry.jsonl` containing `llm_call` and `tool_call` events into mooncake-style JSONL. It identifies 6 agent types from the telemetry and tags each entry accordingly.

## Input Format

Telemetry JSONL with one event per line:
```json
{"event_type": "llm_call", "timestamp": "...", "session_id": "...", "latency_priority": "HIGH", "latency_ms": 738.22, "request_payload": {"messages": [...], "model": "gpt-5.2"}, "response_payload": {"usage": {"prompt_tokens": 256, "completion_tokens": 4}}}
{"event_type": "tool_call", "tool_name": "tavily_web_search", "session_id": "...", "start_time": "...", "end_time": "...", "duration_ms": 181.08}
```

Only `llm_call` events are processed; `tool_call` events are dropped.

## Output Format

Mooncake JSONL with agent type and priority:
```json
{"session_id": "082e33c7-...", "agent_type": "deep_coordinator", "input_length": 2426, "output_length": 33, "hash_ids": [1, 2, 3], "priority": "HIGH"}
{"session_id": "082e33c7-...", "agent_type": "research_worker", "input_length": 4800, "output_length": 154, "hash_ids": [1, 2, 3, 4], "priority": "LOW"}
```

## Agent Types

Agents are identified by system prompt prefix matching:

| Agent Type | System Prompt Prefix |
|---|---|
| `deep_coordinator` | `You are a Deep Research agent` |
| `research_worker` | `Gather and synthesize comprehe` |
| `research_planner` | `For the given task, generate a` |
| `shallow_agent` | `Current date and time:` |
| `classifier` | (no system msg) — "Classify" in user msg |
| `complexity_analyzer` | (no system msg) — "complexity analyzer" in user msg |

## Usage

Basic conversion:
```bash
python convert_telemetry.py --input-file /path/to/telemetry.jsonl
```

With custom tokenizer:
```bash
python convert_telemetry.py \
    --input-file /path/to/telemetry.jsonl \
    --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --block-size 128
```

## Arguments

| Argument | Description | Default |
|---|---|---|
| `--input-file` | Path to telemetry JSONL | Required |
| `--output-file` | Output JSONL path | `<input>_mooncake.jsonl` |
| `--tokenizer` | HuggingFace tokenizer name | `deepseek-ai/DeepSeek-R1-Distill-Llama-8B` |
| `--block-size` | Block size for hash generation | 64 |

---

155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
## Running with aiperf

After conversion, use with aiperf:
```bash
aiperf profile \
    --model <model-name> \
    --tokenizer <tokenizer> \
    --endpoint-type chat \
    --streaming \
    --url http://localhost:8000 \
    --input-file output_mooncake.jsonl \
    --custom-dataset-type mooncake_trace \
    --concurrency 10
```

The `session_id` field ensures:
- Turns within a conversation are serialized (executed in order)
- Different conversations run in parallel up to `--concurrency`