README.md 5.32 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
<!-- SPDX-License-Identifier: Apache-2.0 -->

# BPF Tracing Scripts

eBPF/bpftrace scripts for low-overhead kernel-level tracing of the Dynamo frontend. These scripts attach to kernel tracepoints and kprobes to measure scheduling, syscall, TCP, and context-switch behavior without modifying the application.

## Setup

```bash
# Full setup: install bpftrace, configure kernel, grant capabilities
sudo bash setup.sh

# Or step by step:
sudo bash setup.sh --install    # install bpftrace
sudo bash setup.sh --kernel     # set perf_event_paranoid=-1, kptr_restrict=0
sudo bash setup.sh --caps       # grant capabilities (run bpftrace without sudo)

# Check current state
sudo bash setup.sh --check

# Undo everything
sudo bash setup.sh --reset
```

After granting capabilities, bpftrace runs **without sudo**.

## Quick Start

```bash
# Get the frontend PID from a running capture
FRONTEND_PID=$(pgrep -f "dynamo.frontend" | head -1)

# Run a single script
./run.sh --pid $FRONTEND_PID offcputime

# List all available scripts
./run.sh --list

# Check if BPF environment is ready
./run.sh --check
```

## Scripts

| Script | What it measures | Attach to PID? |
|--------|-----------------|----------------|
| `offcputime.bt` | Kernel stacks of threads going off-CPU (>1ms). Shows why threads block (futex, epoll, I/O). | Yes |
| `syscall_latency.bt` | Slow syscalls (>10us) by syscall ID, filtered to Tokio workers. | Yes |
| `runqlat.bt` | Scheduler run-queue latency — how long threads wait to be scheduled after wakeup. | Yes |
| `context_switches.bt` | Context switch rate and overhead per thread. | Yes |
| `cpudist.bt` | On-CPU time distribution per thread (how long a thread runs before being preempted). | Yes |
| `funclatency.bt` | Latency histogram for a specific kernel/user function (template — edit the probe). | Yes |
| `transport_latency.bt` | Socket read/write latency via syscall tracepoints. | Yes |
| `tcplife.bt` | TCP connection lifetimes — shows short-lived connections wasting setup cost. | No (system-wide) |
| `tcpretrans.bt` | TCP retransmission events. | No (system-wide) |

## Recommended Order for Frontend Analysis

**1. Start with off-CPU analysis** — identifies what's blocking Tokio workers:
```bash
bpftrace -p $FRONTEND_PID offcputime.bt
```
Look for `futex_wait_queue` (mutex contention), `ep_poll` (normal I/O), `schedule_timeout` (timers).

**2. Syscall latency** — find the expensive syscalls:
```bash
bpftrace -p $FRONTEND_PID syscall_latency.bt
```
`futex` with high avg latency = lock contention. `writev` = TCP send overhead.

**3. Run-queue latency** — check if threads are starved for CPU:
```bash
bpftrace -p $FRONTEND_PID runqlat.bt
```
p99 > 1ms means CPU contention is contributing to tail latency.

**4. TCP connection lifetimes** — verify connection reuse:
```bash
bpftrace tcplife.bt
```
Many short-lived connections (< 100ms) to localhost = connection pooling opportunity (Part 3 bottleneck).

**5. Context switches** — quantify scheduling overhead:
```bash
bpftrace -p $FRONTEND_PID context_switches.bt
```

## Interpreting Results

### Off-CPU Stacks
```
@blocked_us[tokio-runtime-w,
    futex_wait_queue        ← mutex/condvar contention
    futex / do_futex
    __x64_sys_futex
    entry_SYSCALL_64
]: [1ms, 10ms) = 4812
```
- `futex_wait_queue` → mutex blocked (check Prometheus registry, TCP endpoint table)
- `ep_poll` → epoll_wait (normal Tokio I/O loop — healthy)
- `schedule_timeout` → timer/sleep
- `do_wait` → join handle or channel receive

### Syscall Latency
```
@slow[futex]: count=6504, avg=110ms, total=717s
```
High `futex` total = lock contention dominates. Cross-reference with off-CPU stacks.

### TCP Lifetimes
```
PID   COMM        LADDR     LPORT  RADDR     RPORT  TX_KB  RX_KB  MS
12345 tokio-run   127.0.0.1 43210  127.0.0.1 8081   0      128    45
```
Many connections with lifetime < 100ms to mocker ports = no connection pooling.

## Directory Layout

```
bpf/
├── run.sh          # Script runner with capability detection
├── setup.sh        # Install bpftrace, configure kernel, grant caps
├── README.md
└── traces/         # bpftrace probe scripts (.bt files)
    ├── runqlat.bt
    ├── cpudist.bt
    ├── offcputime.bt
    ├── funclatency.bt
    ├── transport_latency.bt
    ├── tcplife.bt
    ├── tcpretrans.bt
    ├── syscall_latency.bt
    └── context_switches.bt
```

## Integration with Capture Script

The main capture script can run BPF traces automatically:
```bash
# Include BPF traces in the full capture (requires root)
sudo bash benchmarks/frontend/scripts/full_observability_run_perf.sh \
  --skip-nsys --skip-perf \
  --model Qwen/Qwen3-0.6B --concurrency 64 --num-requests 4096

# Compare event planes with BPF tracing:
sudo bash full_observability_run_perf.sh \
  --skip-nsys --skip-perf \
  --model Qwen/Qwen3-0.6B --concurrency 64 --num-requests 4096 \
  --event-plane zmq

sudo bash full_observability_run_perf.sh \
  --skip-nsys --skip-perf \
  --model Qwen/Qwen3-0.6B --concurrency 64 --num-requests 4096 \
  --event-plane nats

# BPF output appears in artifacts/obs_<timestamp>/bpf/
```

## Requirements

- Linux kernel >= 4.18 (for BPF CO-RE support)
- `bpftrace` >= 0.16
- Root or `CAP_BPF + CAP_PERFMON + CAP_NET_ADMIN + CAP_SYS_PTRACE` capabilities
- Kernel headers (for some kprobe scripts): `apt install linux-headers-$(uname -r)`