feat(perf): add BPF for frontend perf tracing (#6737)

35a796a3 · Biswa Panda · GitHub · 63245237 · 35a796a3 · 35a796a3
Unverified Commit 35a796a3 authored Mar 02, 2026 by Biswa Panda Committed by GitHub Mar 02, 2026
12 changed files
--- a/benchmarks/frontend/scripts/bpf/README.md
+++ b/benchmarks/frontend/scripts/bpf/README.md
+<!-- SPDX-License-Identifier: Apache-2.0 -->
+
+# BPF Tracing Scripts
+
+eBPF/bpftrace scripts for low-overhead kernel-level tracing of the Dynamo frontend. These scripts attach to kernel tracepoints and kprobes to measure scheduling, syscall, TCP, and context-switch behavior without modifying the application.
+
+## Setup
+
+```bash
+# Full setup: install bpftrace, configure kernel, grant capabilities
+sudo bash setup.sh
+
+# Or step by step:
+sudo bash setup.sh --install    # install bpftrace
+sudo bash setup.sh --kernel     # set perf_event_paranoid=-1, kptr_restrict=0
+sudo bash setup.sh --caps       # grant capabilities (run bpftrace without sudo)
+
+# Check current state
+sudo bash setup.sh --check
+
+# Undo everything
+sudo bash setup.sh --reset
+```
+
+After granting capabilities, bpftrace runs **without sudo**.
+
+## Quick Start
+
+```bash
+# Get the frontend PID from a running capture
+FRONTEND_PID=$(pgrep -f "dynamo.frontend" | head -1)
+
+# Run a single script
+./run.sh --pid $FRONTEND_PID offcputime
+
+# List all available scripts
+./run.sh --list
+
+# Check if BPF environment is ready
+./run.sh --check
+```
+
+## Scripts
+
+| Script | What it measures | Attach to PID? |
+|--------|-----------------|----------------|
+| `offcputime.bt` | Kernel stacks of threads going off-CPU (>1ms). Shows why threads block (futex, epoll, I/O). | Yes |
+| `syscall_latency.bt` | Slow syscalls (>10us) by syscall ID, filtered to Tokio workers. | Yes |
+| `runqlat.bt` | Scheduler run-queue latency — how long threads wait to be scheduled after wakeup. | Yes |
+| `context_switches.bt` | Context switch rate and overhead per thread. | Yes |
+| `cpudist.bt` | On-CPU time distribution per thread (how long a thread runs before being preempted). | Yes |
+| `funclatency.bt` | Latency histogram for a specific kernel/user function (template — edit the probe). | Yes |
+| `transport_latency.bt` | Socket read/write latency via syscall tracepoints. | Yes |
+| `tcplife.bt` | TCP connection lifetimes — shows short-lived connections wasting setup cost. | No (system-wide) |
+| `tcpretrans.bt` | TCP retransmission events. | No (system-wide) |
+
+## Recommended Order for Frontend Analysis
+
+**1. Start with off-CPU analysis** — identifies what's blocking Tokio workers:
+```bash
+bpftrace -p $FRONTEND_PID offcputime.bt
+```
+Look for `futex_wait_queue` (mutex contention), `ep_poll` (normal I/O), `schedule_timeout` (timers).
+
+**2. Syscall latency** — find the expensive syscalls:
+```bash
+bpftrace -p $FRONTEND_PID syscall_latency.bt
+```
+`futex` with high avg latency = lock contention. `writev` = TCP send overhead.
+
+**3. Run-queue latency** — check if threads are starved for CPU:
+```bash
+bpftrace -p $FRONTEND_PID runqlat.bt
+```
+p99 > 1ms means CPU contention is contributing to tail latency.
+
+**4. TCP connection lifetimes** — verify connection reuse:
+```bash
+bpftrace tcplife.bt
+```
+Many short-lived connections (< 100ms) to localhost = connection pooling opportunity (Part 3 bottleneck).
+
+**5. Context switches** — quantify scheduling overhead:
+```bash
+bpftrace -p $FRONTEND_PID context_switches.bt
+```
+
+## Interpreting Results
+
+### Off-CPU Stacks
+```
+@blocked_us[tokio-runtime-w,
+    futex_wait_queue        ← mutex/condvar contention
+    futex / do_futex
+    __x64_sys_futex
+    entry_SYSCALL_64
+]: [1ms, 10ms) = 4812
+```
+- `futex_wait_queue` → mutex blocked (check Prometheus registry, TCP endpoint table)
+- `ep_poll` → epoll_wait (normal Tokio I/O loop — healthy)
+- `schedule_timeout` → timer/sleep
+- `do_wait` → join handle or channel receive
+
+### Syscall Latency
+```
+@slow[futex]: count=6504, avg=110ms, total=717s
+```
+High `futex` total = lock contention dominates. Cross-reference with off-CPU stacks.
+
+### TCP Lifetimes
+```
+PID   COMM        LADDR     LPORT  RADDR     RPORT  TX_KB  RX_KB  MS
+12345 tokio-run   127.0.0.1 43210  127.0.0.1 8081   0      128    45
+```
+Many connections with lifetime < 100ms to mocker ports = no connection pooling.
+
+## Directory Layout
+
+```
+bpf/
+├── run.sh          # Script runner with capability detection
+├── setup.sh        # Install bpftrace, configure kernel, grant caps
+├── README.md
+└── traces/         # bpftrace probe scripts (.bt files)
+    ├── runqlat.bt
+    ├── cpudist.bt
+    ├── offcputime.bt
+    ├── funclatency.bt
+    ├── transport_latency.bt
+    ├── tcplife.bt
+    ├── tcpretrans.bt
+    ├── syscall_latency.bt
+    └── context_switches.bt
+```
+
+## Integration with Capture Script
+
+The main capture script can run BPF traces automatically:
+```bash
+# Include BPF traces in the full capture (requires root)
+sudo bash benchmarks/frontend/scripts/full_observability_run_perf.sh \
+  --skip-nsys --skip-perf \
+  --model Qwen/Qwen3-0.6B --concurrency 64 --num-requests 4096
+
+# Compare event planes with BPF tracing:
+sudo bash full_observability_run_perf.sh \
+  --skip-nsys --skip-perf \
+  --model Qwen/Qwen3-0.6B --concurrency 64 --num-requests 4096 \
+  --event-plane zmq
+
+sudo bash full_observability_run_perf.sh \
+  --skip-nsys --skip-perf \
+  --model Qwen/Qwen3-0.6B --concurrency 64 --num-requests 4096 \
+  --event-plane nats
+
+# BPF output appears in artifacts/obs_<timestamp>/bpf/
+```
+
+## Requirements
+
+- Linux kernel >= 4.18 (for BPF CO-RE support)
+- `bpftrace` >= 0.16
+- Root or `CAP_BPF + CAP_PERFMON + CAP_NET_ADMIN + CAP_SYS_PTRACE` capabilities
+- Kernel headers (for some kprobe scripts): `apt install linux-headers-$(uname -r)`
--- a/benchmarks/frontend/scripts/bpf/run.sh
+++ b/benchmarks/frontend/scripts/bpf/run.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# BPF script runner with capability detection.
+#
+# Usage:
+#   ./run.sh --pid 12345 runqlat             # run specific script
+#   ./run.sh --pid 12345 --batch --output-dir /tmp/bpf --duration 30
+#   ./run.sh --list                          # list available scripts
+#   ./run.sh --check                         # check capabilities
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PID=""
+DURATION="30"
+BATCH=false
+OUTPUT_DIR=""
+
+# Default set of scripts for batch mode (safe subset that works with --pid)
+BATCH_SCRIPTS=(runqlat syscall_latency transport_latency context_switches offcputime cpudist)
+
+# Available scripts
+SCRIPTS=(
+    "runqlat:CPU run queue latency"
+    "cpudist:On-CPU time distribution"
+    "offcputime:Off-CPU stack traces"
+    "funclatency:Function latency (template)"
+    "transport_latency:Socket read/write latency"
+    "tcplife:TCP connection lifetimes"
+    "tcpretrans:TCP retransmissions"
+    "syscall_latency:Top slow syscalls"
+    "context_switches:Context switch histograms"
+)
+
+check_capabilities() {
+    echo "=== BPF Capability Check ==="
+    local ok=true
+
+    if ! command -v bpftrace &>/dev/null; then
+        echo "FAIL: bpftrace not found (install: apt install bpftrace)"
+        ok=false
+    else
+        echo "OK:   bpftrace $(bpftrace --version 2>/dev/null | head -1)"
+    fi
+
+    if [[ $(id -u) -ne 0 ]]; then
+        echo "WARN: Not running as root. BPF scripts require CAP_BPF + CAP_PERFMON."
+        # Check effective capabilities from the "Current:" line.
+        # The IAB line uses !cap_xxx for *denied* caps — grepping the full
+        # output would false-positive on those negated entries.
+        if command -v capsh &>/dev/null; then
+            local current_caps
+            current_caps=$(capsh --print 2>/dev/null | grep '^Current:' || true)
+            if echo "$current_caps" | grep -q cap_bpf; then
+                echo "OK:   CAP_BPF available"
+            else
+                echo "FAIL: CAP_BPF not available"
+                ok=false
+            fi
+            if echo "$current_caps" | grep -q cap_perfmon; then
+                echo "OK:   CAP_PERFMON available"
+            else
+                echo "FAIL: CAP_PERFMON not available"
+                ok=false
+            fi
+        else
+            echo "WARN: capsh not found, cannot check capabilities"
+        fi
+    else
+        echo "OK:   Running as root"
+    fi
+
+    local kernel_ver
+    kernel_ver=$(uname -r | cut -d. -f1-2)
+    local major minor
+    major=$(echo "$kernel_ver" | cut -d. -f1)
+    minor=$(echo "$kernel_ver" | cut -d. -f2)
+    if [[ $major -gt 4 ]] || { [[ $major -eq 4 ]] && [[ $minor -ge 18 ]]; }; then
+        echo "OK:   Kernel $(uname -r) (>= 4.18 required)"
+    else
+        echo "FAIL: Kernel $(uname -r) (>= 4.18 required)"
+        ok=false
+    fi
+
+    if [[ "$ok" == true ]]; then
+        echo ""
+        echo "All checks passed. Ready to trace."
+    else
+        echo ""
+        echo "Some checks failed. Fix issues above before tracing."
+        return 1
+    fi
+}
+
+list_scripts() {
+    echo "Available BPF scripts:"
+    echo ""
+    for entry in "${SCRIPTS[@]}"; do
+        local name="${entry%%:*}"
+        local desc="${entry#*:}"
+        printf "  %-25s %s\n" "$name" "$desc"
+    done
+}
+
+run_script() {
+    local name=$1
+    local script="${SCRIPT_DIR}/traces/${name}.bt"
+
+    if [[ ! -f "$script" ]]; then
+        echo "ERROR: Script not found: $script"
+        echo "Available scripts:"
+        list_scripts
+        return 1
+    fi
+
+    local args=()
+    if [[ -n "$PID" ]]; then
+        args+=(-p "$PID")
+    fi
+
+    echo "Running: bpftrace ${args[*]} $script"
+    echo "Press Ctrl-C to stop."
+    echo ""
+    exec bpftrace "${args[@]}" "$script"
+}
+
+run_batch() {
+    # Run multiple BPF scripts in parallel, capturing output to files.
+    # Called by run_perf.sh as a single background job.
+    # This function waits for all children internally and handles signal
+    # forwarding (TERM→INT) so bpftrace flushes its aggregation maps.
+    if [[ -z "$OUTPUT_DIR" ]]; then
+        echo "ERROR: --batch requires --output-dir" >&2
+        exit 1
+    fi
+    if [[ -z "$PID" ]]; then
+        echo "ERROR: --batch requires --pid" >&2
+        exit 1
+    fi
+
+    mkdir -p "$OUTPUT_DIR"
+
+    local pids=()
+
+    # Forward SIGTERM/SIGINT to all bpftrace children as SIGINT so they flush.
+    # Then wait for them to finish writing output before exiting.
+    trap 'for p in "${pids[@]}"; do kill -INT "$p" 2>/dev/null || true; done; wait 2>/dev/null; exit 0' TERM INT
+
+    for script_name in "${BATCH_SCRIPTS[@]}"; do
+        local bt_file="${SCRIPT_DIR}/traces/${script_name}.bt"
+        if [[ ! -f "$bt_file" ]]; then
+            echo "  [bpf] SKIP $script_name (not found)" >&2
+            continue
+        fi
+        echo "  [bpf] $script_name (duration=${DURATION}s)" >&2
+        # SIGINT (not SIGTERM) so bpftrace flushes aggregation maps before exiting
+        timeout --signal=INT "$DURATION" bpftrace -p "$PID" "$bt_file" \
+            > "$OUTPUT_DIR/${script_name}.txt" 2>&1 &
+        pids+=($!)
+    done
+
+    # Also run system-wide scripts (no --pid) that are safe
+    for script_name in tcplife tcpretrans; do
+        local bt_file="${SCRIPT_DIR}/traces/${script_name}.bt"
+        if [[ -f "$bt_file" ]]; then
+            echo "  [bpf] $script_name (system-wide, duration=${DURATION}s)" >&2
+            timeout --signal=INT "$DURATION" bpftrace "$bt_file" \
+                > "$OUTPUT_DIR/${script_name}.txt" 2>&1 &
+            pids+=($!)
+        fi
+    done
+
+    # Wait for all bpftrace processes to complete (timeout or signal).
+    # This keeps run.sh alive so the caller can kill us to stop everything.
+    for p in "${pids[@]}"; do
+        wait "$p" 2>/dev/null || true
+    done
+    trap - TERM INT
+}
+
+# Parse arguments
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --pid|-p)        PID="$2"; shift 2 ;;
+        --duration|-d)   DURATION="$2"; shift 2 ;;
+        --output-dir|-o) OUTPUT_DIR="$2"; shift 2 ;;
+        --batch)         BATCH=true; shift ;;
+        --check)         check_capabilities; exit $? ;;
+        --list|-l)       list_scripts; exit 0 ;;
+        -h|--help)
+            echo "Usage: $0 [OPTIONS] [SCRIPT_NAME]"
+            echo ""
+            echo "Single script mode:"
+            echo "  $0 --pid PID SCRIPT_NAME"
+            echo ""
+            echo "Batch mode (for capture integration):"
+            echo "  $0 --batch --pid PID --output-dir DIR [--duration SECS]"
+            echo ""
+            echo "Options:"
+            echo "  --pid PID          Attach to specific process"
+            echo "  --batch            Run all scripts in parallel"
+            echo "  --output-dir DIR   Write output files to DIR (batch mode)"
+            echo "  --duration SECS    Timeout per script (default: 30)"
+            echo "  --check            Check BPF capabilities"
+            echo "  --list             List available scripts"
+            echo ""
+            list_scripts
+            exit 0
+            ;;
+        *)  break ;;
+    esac
+done
+
+if [[ "$BATCH" == true ]]; then
+    run_batch
+    exit 0
+fi
+
+if [[ $# -eq 0 ]]; then
+    echo "ERROR: No script specified."
+    echo ""
+    list_scripts
+    exit 1
+fi
+
+run_script "$1"
--- a/benchmarks/frontend/scripts/bpf/setup.sh
+++ b/benchmarks/frontend/scripts/bpf/setup.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# BPF tracing setup script.
+#
+# Installs bpftrace, configures kernel permissions, and optionally grants
+# capabilities so bpftrace can run without sudo.
+#
+# Usage:
+#   sudo bash setup.sh              # full setup (install + kernel + capabilities)
+#   sudo bash setup.sh --check      # check current state only
+#   sudo bash setup.sh --install    # install bpftrace only
+#   sudo bash setup.sh --kernel     # configure kernel permissions only
+#   sudo bash setup.sh --caps       # grant bpftrace capabilities only
+#   sudo bash setup.sh --reset      # remove capabilities and restore kernel defaults
+
+set -euo pipefail
+
+# ─── Colors ─────────────────────────────────────────────────────────────────
+
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[0;33m'
+CYAN='\033[0;36m'
+NC='\033[0m' # No Color
+
+ok()   { echo -e "  ${GREEN}✓${NC} $*"; }
+warn() { echo -e "  ${YELLOW}⚠${NC} $*"; }
+fail() { echo -e "  ${RED}✗${NC} $*"; }
+info() { echo -e "  ${CYAN}→${NC} $*"; }
+
+# ─── Check functions ────────────────────────────────────────────────────────
+
+check_root() {
+    if [[ $EUID -ne 0 ]]; then
+        fail "This script must be run as root (sudo bash setup.sh)"
+        exit 1
+    fi
+}
+
+check_bpftrace_installed() {
+    if command -v bpftrace &>/dev/null; then
+        local ver
+        ver=$(bpftrace --version 2>/dev/null | head -1)
+        ok "bpftrace installed: $ver"
+        return 0
+    else
+        fail "bpftrace not installed"
+        return 1
+    fi
+}
+
+check_kernel_permissions() {
+    local paranoid kptr issues=0
+
+    paranoid=$(cat /proc/sys/kernel/perf_event_paranoid 2>/dev/null || echo "unknown")
+    kptr=$(cat /proc/sys/kernel/kptr_restrict 2>/dev/null || echo "unknown")
+
+    if [[ "$paranoid" == "-1" ]]; then
+        ok "perf_event_paranoid = $paranoid (unrestricted)"
+    elif [[ "$paranoid" -le 1 ]] 2>/dev/null; then
+        warn "perf_event_paranoid = $paranoid (limited — set to -1 for full access)"
+        issues=1
+    else
+        fail "perf_event_paranoid = $paranoid (restricted — BPF probes may fail)"
+        issues=1
+    fi
+
+    if [[ "$kptr" == "0" ]]; then
+        ok "kptr_restrict = $kptr (kernel symbols visible)"
+    else
+        warn "kptr_restrict = $kptr (kernel symbols hidden — set to 0 for stack symbolization)"
+        issues=1
+    fi
+
+    return $issues
+}
+
+check_capabilities() {
+    local bpftrace_path
+    bpftrace_path=$(command -v bpftrace 2>/dev/null || true)
+
+    if [[ -z "$bpftrace_path" ]]; then
+        fail "bpftrace not found — cannot check capabilities"
+        return 1
+    fi
+
+    local caps
+    caps=$(getcap "$bpftrace_path" 2>/dev/null || true)
+
+    if [[ -n "$caps" ]]; then
+        ok "Capabilities set: $caps"
+        return 0
+    else
+        warn "No capabilities set on $bpftrace_path (requires sudo to run)"
+        return 1
+    fi
+}
+
+check_debugfs() {
+    if mountpoint -q /sys/kernel/debug 2>/dev/null; then
+        ok "debugfs mounted at /sys/kernel/debug"
+        return 0
+    else
+        warn "debugfs not mounted (some tracepoints may be unavailable)"
+        return 1
+    fi
+}
+
+check_tracefs() {
+    if [[ -d /sys/kernel/tracing ]] || [[ -d /sys/kernel/debug/tracing ]]; then
+        ok "tracefs available"
+        return 0
+    else
+        warn "tracefs not found (tracepoint-based scripts may fail)"
+        return 1
+    fi
+}
+
+run_check() {
+    echo ""
+    echo "BPF Tracing Environment Check"
+    echo "=============================="
+    echo ""
+
+    echo "Installation:"
+    check_bpftrace_installed || true
+    echo ""
+
+    echo "Kernel permissions:"
+    check_kernel_permissions || true
+    echo ""
+
+    echo "Capabilities:"
+    check_capabilities || true
+    echo ""
+
+    echo "Kernel interfaces:"
+    check_debugfs || true
+    check_tracefs || true
+    echo ""
+
+    # Quick smoke test
+    echo "Smoke test:"
+    if command -v bpftrace &>/dev/null; then
+        if bpftrace -e 'BEGIN { printf("ok\n"); exit(); }' &>/dev/null; then
+            ok "bpftrace can execute probes"
+        else
+            fail "bpftrace probe execution failed (check permissions)"
+        fi
+    else
+        fail "Cannot run smoke test — bpftrace not installed"
+    fi
+    echo ""
+}
+
+# ─── Install ────────────────────────────────────────────────────────────────
+
+install_bpftrace() {
+    echo ""
+    echo "Installing bpftrace"
+    echo "==================="
+    echo ""
+
+    if command -v bpftrace &>/dev/null; then
+        ok "bpftrace already installed ($(bpftrace --version 2>/dev/null | head -1))"
+        return 0
+    fi
+
+    # Detect package manager
+    if command -v apt-get &>/dev/null; then
+        info "Using apt (Debian/Ubuntu)"
+        apt-get update -qq
+        apt-get install -y -qq bpftrace linux-headers-"$(uname -r)" 2>/dev/null || \
+            apt-get install -y -qq bpftrace
+        ok "bpftrace installed via apt"
+    elif command -v dnf &>/dev/null; then
+        info "Using dnf (Fedora/RHEL)"
+        dnf install -y bpftrace kernel-devel-"$(uname -r)" 2>/dev/null || \
+            dnf install -y bpftrace
+        ok "bpftrace installed via dnf"
+    elif command -v yum &>/dev/null; then
+        info "Using yum (CentOS/RHEL)"
+        yum install -y bpftrace kernel-devel-"$(uname -r)" 2>/dev/null || \
+            yum install -y bpftrace
+        ok "bpftrace installed via yum"
+    elif command -v pacman &>/dev/null; then
+        info "Using pacman (Arch)"
+        pacman -S --noconfirm bpftrace
+        ok "bpftrace installed via pacman"
+    else
+        fail "No supported package manager found. Install bpftrace manually:"
+        echo "    https://github.com/bpftrace/bpftrace/blob/master/INSTALL.md"
+        return 1
+    fi
+}
+
+# ─── Kernel permissions ─────────────────────────────────────────────────────
+
+configure_kernel() {
+    echo ""
+    echo "Configuring kernel permissions"
+    echo "=============================="
+    echo ""
+
+    info "These settings are temporary and revert on reboot."
+    echo ""
+
+    # perf_event_paranoid
+    local current_paranoid
+    current_paranoid=$(cat /proc/sys/kernel/perf_event_paranoid)
+    if [[ "$current_paranoid" != "-1" ]]; then
+        echo -1 > /proc/sys/kernel/perf_event_paranoid
+        ok "perf_event_paranoid: $current_paranoid → -1"
+    else
+        ok "perf_event_paranoid already -1"
+    fi
+
+    # kptr_restrict
+    local current_kptr
+    current_kptr=$(cat /proc/sys/kernel/kptr_restrict)
+    if [[ "$current_kptr" != "0" ]]; then
+        echo 0 > /proc/sys/kernel/kptr_restrict
+        ok "kptr_restrict: $current_kptr → 0"
+    else
+        ok "kptr_restrict already 0"
+    fi
+
+    # Mount debugfs if needed
+    if ! mountpoint -q /sys/kernel/debug 2>/dev/null; then
+        mount -t debugfs none /sys/kernel/debug 2>/dev/null && \
+            ok "Mounted debugfs" || \
+            warn "Could not mount debugfs"
+    fi
+
+    echo ""
+    info "To make persistent across reboots, add to /etc/sysctl.conf:"
+    echo "    kernel.perf_event_paranoid = -1"
+    echo "    kernel.kptr_restrict = 0"
+}
+
+# ─── Capabilities ───────────────────────────────────────────────────────────
+
+grant_capabilities() {
+    echo ""
+    echo "Granting bpftrace capabilities"
+    echo "==============================="
+    echo ""
+
+    local bpftrace_path
+    bpftrace_path=$(command -v bpftrace 2>/dev/null || true)
+
+    if [[ -z "$bpftrace_path" ]]; then
+        fail "bpftrace not found — install it first"
+        return 1
+    fi
+
+    # Resolve symlinks to get the real binary
+    bpftrace_path=$(readlink -f "$bpftrace_path")
+    info "Binary: $bpftrace_path"
+
+    # Required capabilities for BPF tracing
+    local caps="cap_bpf,cap_perfmon,cap_net_admin,cap_sys_ptrace+ep"
+
+    setcap "$caps" "$bpftrace_path"
+    ok "Capabilities granted: $caps"
+    echo ""
+    info "bpftrace can now run WITHOUT sudo."
+    info "Capabilities persist until the binary is updated/reinstalled."
+    echo ""
+    info "To verify:  getcap $bpftrace_path"
+    info "To remove:  sudo setcap -r $bpftrace_path"
+}
+
+# ─── Reset ──────────────────────────────────────────────────────────────────
+
+reset_all() {
+    echo ""
+    echo "Resetting BPF configuration"
+    echo "==========================="
+    echo ""
+
+    # Remove capabilities
+    local bpftrace_path
+    bpftrace_path=$(command -v bpftrace 2>/dev/null || true)
+    if [[ -n "$bpftrace_path" ]]; then
+        bpftrace_path=$(readlink -f "$bpftrace_path")
+        setcap -r "$bpftrace_path" 2>/dev/null && \
+            ok "Capabilities removed from $bpftrace_path" || \
+            warn "No capabilities to remove"
+    fi
+
+    # Restore kernel defaults
+    echo 4 > /proc/sys/kernel/perf_event_paranoid 2>/dev/null && \
+        ok "perf_event_paranoid → 4 (default)" || \
+        warn "Could not restore perf_event_paranoid"
+
+    echo 1 > /proc/sys/kernel/kptr_restrict 2>/dev/null && \
+        ok "kptr_restrict → 1 (default)" || \
+        warn "Could not restore kptr_restrict"
+
+    echo ""
+    ok "Reset complete. bpftrace now requires sudo again."
+}
+
+# ─── Full setup ─────────────────────────────────────────────────────────────
+
+full_setup() {
+    echo ""
+    echo "╔══════════════════════════════════════════════════╗"
+    echo "║       BPF Tracing Setup                         ║"
+    echo "╚══════════════════════════════════════════════════╝"
+
+    install_bpftrace
+    configure_kernel
+    grant_capabilities
+
+    echo ""
+    echo "Setup complete. Running verification..."
+    run_check
+}
+
+# ─── Usage ──────────────────────────────────────────────────────────────────
+
+usage() {
+    cat <<'EOF'
+BPF Tracing Setup
+
+Usage:
+  sudo bash setup.sh              Full setup (install + kernel + capabilities)
+  sudo bash setup.sh --check      Check current environment
+  sudo bash setup.sh --install    Install bpftrace only
+  sudo bash setup.sh --kernel     Configure kernel permissions (temporary)
+  sudo bash setup.sh --caps       Grant bpftrace capabilities (run without sudo)
+  sudo bash setup.sh --reset      Remove capabilities and restore kernel defaults
+
+After setup, run BPF scripts without sudo:
+  bpftrace -p <PID> scripts/bpf/offcputime.bt
+  bpftrace -p <PID> scripts/bpf/syscall_latency.bt
+  ./scripts/bpf/run.sh --pid <PID>
+EOF
+}
+
+# ─── Main ───────────────────────────────────────────────────────────────────
+
+main() {
+    case "${1:-}" in
+        --check)
+            run_check
+            ;;
+        --install)
+            check_root
+            install_bpftrace
+            ;;
+        --kernel)
+            check_root
+            configure_kernel
+            ;;
+        --caps)
+            check_root
+            grant_capabilities
+            ;;
+        --reset)
+            check_root
+            reset_all
+            ;;
+        -h|--help)
+            usage
+            ;;
+        "")
+            check_root
+            full_setup
+            ;;
+        *)
+            fail "Unknown option: $1"
+            echo ""
+            usage
+            exit 1
+            ;;
+    esac
+}
+
+main "$@"
--- a/benchmarks/frontend/scripts/bpf/traces/context_switches.bt
+++ b/benchmarks/frontend/scripts/bpf/traces/context_switches.bt
+#!/usr/bin/env bpftrace
+// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+//
+// Context switch histograms per thread.
+// High context switch rates on tokio workers indicate contention or starvation.
+//
+// Usage: sudo bpftrace context_switches.bt
+//        sudo bpftrace -p <PID> context_switches.bt
+
+tracepoint:sched:sched_switch
+/comm == "tokio-runtime-w" || args.prev_comm == "tokio-runtime-w"/
+{
+    if (args.prev_state == 0) {
+        // Involuntary: was TASK_RUNNING but got preempted
+        @involuntary[args.prev_comm] = count();
+    } else {
+        // Voluntary: thread yielded or blocked
+        @voluntary[args.prev_comm] = count();
+    }
+}
+
+interval:s:5
+{
+    printf("\n--- Context Switches (last 5s) ---\n");
+    printf("\nVoluntary (thread yielded/blocked):\n");
+    print(@voluntary, 10);
+    printf("\nInvoluntary (thread preempted while running):\n");
+    print(@involuntary, 10);
+    clear(@voluntary);
+    clear(@involuntary);
+}
+
+END
+{
+    clear(@voluntary);
+    clear(@involuntary);
+}
--- a/benchmarks/frontend/scripts/bpf/traces/cpudist.bt
+++ b/benchmarks/frontend/scripts/bpf/traces/cpudist.bt
+#!/usr/bin/env bpftrace
+// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+//
+// On-CPU time distribution per thread.
+// Shows how long threads run before being preempted or yielding.
+//
+// Usage: sudo bpftrace cpudist.bt
+
+tracepoint:sched:sched_switch
+{
+    // Record when new thread starts running
+    @start[args.next_pid] = nsecs;
+
+    // Calculate how long prev thread was on-CPU
+    $prev_start = @start[args.prev_pid];
+    if ($prev_start) {
+        @usecs[comm] = hist((nsecs - $prev_start) / 1000);
+        delete(@start[args.prev_pid]);
+    }
+}
+
+interval:s:10
+{
+    printf("\n--- On-CPU Time Distribution (us) by thread ---\n");
+    print(@usecs);
+    clear(@usecs);
+}
+
+END
+{
+    clear(@start);
+}
--- a/benchmarks/frontend/scripts/bpf/traces/funclatency.bt
+++ b/benchmarks/frontend/scripts/bpf/traces/funclatency.bt
+#!/usr/bin/env bpftrace
+// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+//
+// Latency of specific functions (requires debug symbols / profiling build).
+// Measures function entry-to-exit time.
+//
+// Usage:
+//   # Trace a specific function by name (requires DWARF symbols):
+//   sudo bpftrace -p <PID> -e '
+//     uprobe:/path/to/binary:function_name { @start[tid] = nsecs; }
+//     uretprobe:/path/to/binary:function_name /@start[tid]/ {
+//       @latency_us = hist((nsecs - @start[tid]) / 1000);
+//       delete(@start[tid]);
+//     }'
+//
+//   # Or use this script with BINARY and FUNC env vars:
+//   sudo BINARY=/path/to/binary FUNC=function_name bpftrace funclatency.bt
+//
+// Build target with symbols:
+//   cargo build --profile profiling --features nvtx
+
+BEGIN
+{
+    printf("Tracing function latency... Hit Ctrl-C to end.\n");
+    printf("Note: Set BINARY and FUNC environment variables, or edit this script.\n");
+    printf("Example: sudo BINARY=target/profiling/dynamo-frontend FUNC=apply_template bpftrace funclatency.bt\n");
+}
+
+// Generic uprobe placeholder — edit the probe target as needed.
+// bpftrace doesn't support env var substitution in probe specs,
+// so this serves as a template. Use the one-liner above for actual tracing.
+
+// Example for tokenizer:
+// uprobe:target/profiling/dynamo-frontend:*gather_tokens* { @start[tid] = nsecs; }
+// uretprobe:target/profiling/dynamo-frontend:*gather_tokens* /@start[tid]/ {
+//   @latency_us["gather_tokens"] = hist((nsecs - @start[tid]) / 1000);
+//   delete(@start[tid]);
+// }
+
+interval:s:1
+{
+    printf(".");
+}
+
+END
+{
+    printf("\nDone.\n");
+}
--- a/benchmarks/frontend/scripts/bpf/traces/offcputime.bt
+++ b/benchmarks/frontend/scripts/bpf/traces/offcputime.bt
+#!/usr/bin/env bpftrace
+// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+//
+// Stack traces of sleeping threads (mutex, I/O, futex, socket waits).
+// Identifies what threads are blocked on.
+//
+// Usage: sudo bpftrace offcputime.bt
+// Filter to specific PID: sudo bpftrace -p <PID> offcputime.bt
+
+tracepoint:sched:sched_switch
+{
+    if (args.prev_state != 0) {
+        // Thread is going off-CPU (not TASK_RUNNING)
+        @off[args.prev_pid, args.prev_comm] = nsecs;
+    }
+}
+
+tracepoint:sched:sched_switch
+{
+    $key = @off[args.next_pid, args.next_comm];
+    if ($key) {
+        $delta = nsecs - $key;
+        // Only record if off-CPU > 1ms (filter noise)
+        if ($delta > 1000000) {
+            // Use kstack at wake time — shows what the thread was blocked on
+            @blocked_us[args.next_comm, kstack] = hist($delta / 1000);
+        }
+        delete(@off[args.next_pid, args.next_comm]);
+    }
+}
+
+interval:s:15
+{
+    printf("\n--- Off-CPU Time (us) by thread + kernel stack ---\n");
+    print(@blocked_us, 10);
+    clear(@blocked_us);
+}
+
+END
+{
+    clear(@off);
+}
--- a/benchmarks/frontend/scripts/bpf/traces/runqlat.bt
+++ b/benchmarks/frontend/scripts/bpf/traces/runqlat.bt
+#!/usr/bin/env bpftrace
+// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+//
+// CPU run queue latency histogram.
+// Proves OS-level CPU contention. Healthy: <10us p99.
+//
+// Usage: sudo bpftrace runqlat.bt
+
+tracepoint:sched:sched_wakeup,
+tracepoint:sched:sched_wakeup_new
+{
+    @qtime[args.pid] = nsecs;
+}
+
+tracepoint:sched:sched_switch
+{
+    if (args.prev_state == 0) {
+        // voluntary context switch (prev was TASK_RUNNING)
+        @qtime[args.prev_pid] = nsecs;
+    }
+
+    $ns = @qtime[args.next_pid];
+    if ($ns) {
+        @usecs = hist((nsecs - $ns) / 1000);
+        delete(@qtime[args.next_pid]);
+    }
+}
+
+interval:s:5
+{
+    printf("\n--- Run Queue Latency (us) ---\n");
+    print(@usecs);
+    clear(@usecs);
+}
+
+END
+{
+    clear(@qtime);
+}
--- a/benchmarks/frontend/scripts/bpf/traces/syscall_latency.bt
+++ b/benchmarks/frontend/scripts/bpf/traces/syscall_latency.bt
+#!/usr/bin/env bpftrace
+// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+//
+// Top slow syscalls (write, epoll_wait, futex, sendmsg).
+// Identifies which syscalls are consuming time in the frontend.
+//
+// Usage: sudo bpftrace syscall_latency.bt
+//        sudo bpftrace -p <PID> syscall_latency.bt
+
+tracepoint:raw_syscalls:sys_enter
+/comm == "tokio-runtime-w" || comm == "dynamo-frontend"/
+{
+    @start[tid] = nsecs;
+    @syscall_id[tid] = args.id;
+}
+
+tracepoint:raw_syscalls:sys_exit
+/@start[tid]/
+{
+    $delta = nsecs - @start[tid];
+    $id = @syscall_id[tid];
+
+    // Only record syscalls taking > 10us
+    if ($delta > 10000) {
+        @latency_us[$id] = hist($delta / 1000);
+        @slow_count[$id] = count();
+    }
+
+    delete(@start[tid]);
+    delete(@syscall_id[tid]);
+}
+
+interval:s:10
+{
+    printf("\n--- Slow Syscall Latency (us) by syscall number ---\n");
+    print(@latency_us);
+    printf("\n--- Slow Syscall Count ---\n");
+    print(@slow_count, 10);
+    clear(@latency_us);
+    clear(@slow_count);
+}
+
+END
+{
+    clear(@start);
+    clear(@syscall_id);
+}
--- a/benchmarks/frontend/scripts/bpf/traces/tcplife.bt
+++ b/benchmarks/frontend/scripts/bpf/traces/tcplife.bt
+#!/usr/bin/env bpftrace
+// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+//
+// TCP connection lifetimes (pool health).
+// Short-lived connections indicate pool churn; long-lived indicate healthy pooling.
+//
+// Usage: sudo bpftrace tcplife.bt
+
+#include <net/sock.h>
+
+kprobe:tcp_set_state
+{
+    $sk = (struct sock *)arg0;
+    $newstate = arg1;
+
+    if ($newstate == 1) {
+        // TCP_ESTABLISHED
+        @birth[$sk] = nsecs;
+        @saddr[$sk] = ntop($sk->__sk_common.skc_rcv_saddr);
+        @daddr[$sk] = ntop($sk->__sk_common.skc_daddr);
+        @dport[$sk] = ($sk->__sk_common.skc_dport >> 8) |
+            (($sk->__sk_common.skc_dport & 0xff) << 8);
+    }
+
+    if ($newstate == 7) {
+        // TCP_CLOSE
+        $start = @birth[$sk];
+        if ($start) {
+            $delta_ms = (nsecs - $start) / 1000000;
+            printf("%-16s %-6d %-16s -> %-16s:%-5d lifetime=%dms\n",
+                comm, pid,
+                @saddr[$sk], @daddr[$sk], @dport[$sk],
+                $delta_ms);
+            @lifetime_ms = hist($delta_ms);
+        }
+        delete(@birth[$sk]);
+        delete(@saddr[$sk]);
+        delete(@daddr[$sk]);
+        delete(@dport[$sk]);
+    }
+}
+
+END
+{
+    printf("\n--- TCP Connection Lifetime (ms) ---\n");
+    print(@lifetime_ms);
+    clear(@birth);
+    clear(@saddr);
+    clear(@daddr);
+    clear(@dport);
+}
--- a/benchmarks/frontend/scripts/bpf/traces/tcpretrans.bt
+++ b/benchmarks/frontend/scripts/bpf/traces/tcpretrans.bt
+#!/usr/bin/env bpftrace
+// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+//
+// TCP retransmissions (network backpressure indicator).
+// High retransmit rates indicate network congestion or packet loss.
+//
+// Usage: sudo bpftrace tcpretrans.bt
+
+tracepoint:tcp:tcp_retransmit_skb
+{
+    printf("%-8d %-16s %s:%d -> %s:%d state=%d\n",
+        pid, comm,
+        ntop(args.saddr), args.sport,
+        ntop(args.daddr), args.dport,
+        args.state);
+    @retrans[ntop(args.daddr), args.dport] = count();
+    @total = count();
+}
+
+interval:s:10
+{
+    printf("\n--- TCP Retransmissions (last 10s) ---\n");
+    printf("Total: ");
+    print(@total);
+    print(@retrans);
+    clear(@retrans);
+    clear(@total);
+}
--- a/benchmarks/frontend/scripts/bpf/traces/transport_latency.bt
+++ b/benchmarks/frontend/scripts/bpf/traces/transport_latency.bt
+#!/usr/bin/env bpftrace
+// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+//
+// Socket read/write latencies (transport-agnostic: traces sendmsg/recvmsg).
+// Works for both TCP and NATS since both go through kernel sockets.
+//
+// Usage: sudo bpftrace transport_latency.bt
+//        sudo bpftrace -p <PID> transport_latency.bt
+
+tracepoint:syscalls:sys_enter_sendmsg,
+tracepoint:syscalls:sys_enter_sendto,
+tracepoint:syscalls:sys_enter_write
+/comm == "tokio-runtime-w" || comm == "dynamo-frontend"/
+{
+    @send_start[tid] = nsecs;
+}
+
+tracepoint:syscalls:sys_exit_sendmsg,
+tracepoint:syscalls:sys_exit_sendto,
+tracepoint:syscalls:sys_exit_write
+/@send_start[tid]/
+{
+    $delta = nsecs - @send_start[tid];
+    @send_latency_us = hist($delta / 1000);
+    delete(@send_start[tid]);
+}
+
+tracepoint:syscalls:sys_enter_recvmsg,
+tracepoint:syscalls:sys_enter_recvfrom,
+tracepoint:syscalls:sys_enter_read
+/comm == "tokio-runtime-w" || comm == "dynamo-frontend"/
+{
+    @recv_start[tid] = nsecs;
+}
+
+tracepoint:syscalls:sys_exit_recvmsg,
+tracepoint:syscalls:sys_exit_recvfrom,
+tracepoint:syscalls:sys_exit_read
+/@recv_start[tid]/
+{
+    $delta = nsecs - @recv_start[tid];
+    @recv_latency_us = hist($delta / 1000);
+    delete(@recv_start[tid]);
+}
+
+interval:s:5
+{
+    printf("\n--- Send Latency (us) ---\n");
+    print(@send_latency_us);
+    printf("\n--- Recv Latency (us) ---\n");
+    print(@recv_latency_us);
+    clear(@send_latency_us);
+    clear(@recv_latency_us);
+}
+
+END
+{
+    clear(@send_start);
+    clear(@recv_start);
+}