Unverified Commit 35a796a3 authored by Biswa Panda's avatar Biswa Panda Committed by GitHub
Browse files

feat(perf): add BPF for frontend perf tracing (#6737)

parent 63245237
<!-- SPDX-License-Identifier: Apache-2.0 -->
# BPF Tracing Scripts
eBPF/bpftrace scripts for low-overhead kernel-level tracing of the Dynamo frontend. These scripts attach to kernel tracepoints and kprobes to measure scheduling, syscall, TCP, and context-switch behavior without modifying the application.
## Setup
```bash
# Full setup: install bpftrace, configure kernel, grant capabilities
sudo bash setup.sh
# Or step by step:
sudo bash setup.sh --install # install bpftrace
sudo bash setup.sh --kernel # set perf_event_paranoid=-1, kptr_restrict=0
sudo bash setup.sh --caps # grant capabilities (run bpftrace without sudo)
# Check current state
sudo bash setup.sh --check
# Undo everything
sudo bash setup.sh --reset
```
After granting capabilities, bpftrace runs **without sudo**.
## Quick Start
```bash
# Get the frontend PID from a running capture
FRONTEND_PID=$(pgrep -f "dynamo.frontend" | head -1)
# Run a single script
./run.sh --pid $FRONTEND_PID offcputime
# List all available scripts
./run.sh --list
# Check if BPF environment is ready
./run.sh --check
```
## Scripts
| Script | What it measures | Attach to PID? |
|--------|-----------------|----------------|
| `offcputime.bt` | Kernel stacks of threads going off-CPU (>1ms). Shows why threads block (futex, epoll, I/O). | Yes |
| `syscall_latency.bt` | Slow syscalls (>10us) by syscall ID, filtered to Tokio workers. | Yes |
| `runqlat.bt` | Scheduler run-queue latency — how long threads wait to be scheduled after wakeup. | Yes |
| `context_switches.bt` | Context switch rate and overhead per thread. | Yes |
| `cpudist.bt` | On-CPU time distribution per thread (how long a thread runs before being preempted). | Yes |
| `funclatency.bt` | Latency histogram for a specific kernel/user function (template — edit the probe). | Yes |
| `transport_latency.bt` | Socket read/write latency via syscall tracepoints. | Yes |
| `tcplife.bt` | TCP connection lifetimes — shows short-lived connections wasting setup cost. | No (system-wide) |
| `tcpretrans.bt` | TCP retransmission events. | No (system-wide) |
## Recommended Order for Frontend Analysis
**1. Start with off-CPU analysis** — identifies what's blocking Tokio workers:
```bash
bpftrace -p $FRONTEND_PID offcputime.bt
```
Look for `futex_wait_queue` (mutex contention), `ep_poll` (normal I/O), `schedule_timeout` (timers).
**2. Syscall latency** — find the expensive syscalls:
```bash
bpftrace -p $FRONTEND_PID syscall_latency.bt
```
`futex` with high avg latency = lock contention. `writev` = TCP send overhead.
**3. Run-queue latency** — check if threads are starved for CPU:
```bash
bpftrace -p $FRONTEND_PID runqlat.bt
```
p99 > 1ms means CPU contention is contributing to tail latency.
**4. TCP connection lifetimes** — verify connection reuse:
```bash
bpftrace tcplife.bt
```
Many short-lived connections (< 100ms) to localhost = connection pooling opportunity (Part 3 bottleneck).
**5. Context switches** — quantify scheduling overhead:
```bash
bpftrace -p $FRONTEND_PID context_switches.bt
```
## Interpreting Results
### Off-CPU Stacks
```
@blocked_us[tokio-runtime-w,
futex_wait_queue ← mutex/condvar contention
futex / do_futex
__x64_sys_futex
entry_SYSCALL_64
]: [1ms, 10ms) = 4812
```
- `futex_wait_queue` → mutex blocked (check Prometheus registry, TCP endpoint table)
- `ep_poll` → epoll_wait (normal Tokio I/O loop — healthy)
- `schedule_timeout` → timer/sleep
- `do_wait` → join handle or channel receive
### Syscall Latency
```
@slow[futex]: count=6504, avg=110ms, total=717s
```
High `futex` total = lock contention dominates. Cross-reference with off-CPU stacks.
### TCP Lifetimes
```
PID COMM LADDR LPORT RADDR RPORT TX_KB RX_KB MS
12345 tokio-run 127.0.0.1 43210 127.0.0.1 8081 0 128 45
```
Many connections with lifetime < 100ms to mocker ports = no connection pooling.
## Directory Layout
```
bpf/
├── run.sh # Script runner with capability detection
├── setup.sh # Install bpftrace, configure kernel, grant caps
├── README.md
└── traces/ # bpftrace probe scripts (.bt files)
├── runqlat.bt
├── cpudist.bt
├── offcputime.bt
├── funclatency.bt
├── transport_latency.bt
├── tcplife.bt
├── tcpretrans.bt
├── syscall_latency.bt
└── context_switches.bt
```
## Integration with Capture Script
The main capture script can run BPF traces automatically:
```bash
# Include BPF traces in the full capture (requires root)
sudo bash benchmarks/frontend/scripts/full_observability_run_perf.sh \
--skip-nsys --skip-perf \
--model Qwen/Qwen3-0.6B --concurrency 64 --num-requests 4096
# Compare event planes with BPF tracing:
sudo bash full_observability_run_perf.sh \
--skip-nsys --skip-perf \
--model Qwen/Qwen3-0.6B --concurrency 64 --num-requests 4096 \
--event-plane zmq
sudo bash full_observability_run_perf.sh \
--skip-nsys --skip-perf \
--model Qwen/Qwen3-0.6B --concurrency 64 --num-requests 4096 \
--event-plane nats
# BPF output appears in artifacts/obs_<timestamp>/bpf/
```
## Requirements
- Linux kernel >= 4.18 (for BPF CO-RE support)
- `bpftrace` >= 0.16
- Root or `CAP_BPF + CAP_PERFMON + CAP_NET_ADMIN + CAP_SYS_PTRACE` capabilities
- Kernel headers (for some kprobe scripts): `apt install linux-headers-$(uname -r)`
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# BPF script runner with capability detection.
#
# Usage:
# ./run.sh --pid 12345 runqlat # run specific script
# ./run.sh --pid 12345 --batch --output-dir /tmp/bpf --duration 30
# ./run.sh --list # list available scripts
# ./run.sh --check # check capabilities
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PID=""
DURATION="30"
BATCH=false
OUTPUT_DIR=""
# Default set of scripts for batch mode (safe subset that works with --pid)
BATCH_SCRIPTS=(runqlat syscall_latency transport_latency context_switches offcputime cpudist)
# Available scripts
SCRIPTS=(
"runqlat:CPU run queue latency"
"cpudist:On-CPU time distribution"
"offcputime:Off-CPU stack traces"
"funclatency:Function latency (template)"
"transport_latency:Socket read/write latency"
"tcplife:TCP connection lifetimes"
"tcpretrans:TCP retransmissions"
"syscall_latency:Top slow syscalls"
"context_switches:Context switch histograms"
)
check_capabilities() {
echo "=== BPF Capability Check ==="
local ok=true
if ! command -v bpftrace &>/dev/null; then
echo "FAIL: bpftrace not found (install: apt install bpftrace)"
ok=false
else
echo "OK: bpftrace $(bpftrace --version 2>/dev/null | head -1)"
fi
if [[ $(id -u) -ne 0 ]]; then
echo "WARN: Not running as root. BPF scripts require CAP_BPF + CAP_PERFMON."
# Check effective capabilities from the "Current:" line.
# The IAB line uses !cap_xxx for *denied* caps — grepping the full
# output would false-positive on those negated entries.
if command -v capsh &>/dev/null; then
local current_caps
current_caps=$(capsh --print 2>/dev/null | grep '^Current:' || true)
if echo "$current_caps" | grep -q cap_bpf; then
echo "OK: CAP_BPF available"
else
echo "FAIL: CAP_BPF not available"
ok=false
fi
if echo "$current_caps" | grep -q cap_perfmon; then
echo "OK: CAP_PERFMON available"
else
echo "FAIL: CAP_PERFMON not available"
ok=false
fi
else
echo "WARN: capsh not found, cannot check capabilities"
fi
else
echo "OK: Running as root"
fi
local kernel_ver
kernel_ver=$(uname -r | cut -d. -f1-2)
local major minor
major=$(echo "$kernel_ver" | cut -d. -f1)
minor=$(echo "$kernel_ver" | cut -d. -f2)
if [[ $major -gt 4 ]] || { [[ $major -eq 4 ]] && [[ $minor -ge 18 ]]; }; then
echo "OK: Kernel $(uname -r) (>= 4.18 required)"
else
echo "FAIL: Kernel $(uname -r) (>= 4.18 required)"
ok=false
fi
if [[ "$ok" == true ]]; then
echo ""
echo "All checks passed. Ready to trace."
else
echo ""
echo "Some checks failed. Fix issues above before tracing."
return 1
fi
}
list_scripts() {
echo "Available BPF scripts:"
echo ""
for entry in "${SCRIPTS[@]}"; do
local name="${entry%%:*}"
local desc="${entry#*:}"
printf " %-25s %s\n" "$name" "$desc"
done
}
run_script() {
local name=$1
local script="${SCRIPT_DIR}/traces/${name}.bt"
if [[ ! -f "$script" ]]; then
echo "ERROR: Script not found: $script"
echo "Available scripts:"
list_scripts
return 1
fi
local args=()
if [[ -n "$PID" ]]; then
args+=(-p "$PID")
fi
echo "Running: bpftrace ${args[*]} $script"
echo "Press Ctrl-C to stop."
echo ""
exec bpftrace "${args[@]}" "$script"
}
run_batch() {
# Run multiple BPF scripts in parallel, capturing output to files.
# Called by run_perf.sh as a single background job.
# This function waits for all children internally and handles signal
# forwarding (TERM→INT) so bpftrace flushes its aggregation maps.
if [[ -z "$OUTPUT_DIR" ]]; then
echo "ERROR: --batch requires --output-dir" >&2
exit 1
fi
if [[ -z "$PID" ]]; then
echo "ERROR: --batch requires --pid" >&2
exit 1
fi
mkdir -p "$OUTPUT_DIR"
local pids=()
# Forward SIGTERM/SIGINT to all bpftrace children as SIGINT so they flush.
# Then wait for them to finish writing output before exiting.
trap 'for p in "${pids[@]}"; do kill -INT "$p" 2>/dev/null || true; done; wait 2>/dev/null; exit 0' TERM INT
for script_name in "${BATCH_SCRIPTS[@]}"; do
local bt_file="${SCRIPT_DIR}/traces/${script_name}.bt"
if [[ ! -f "$bt_file" ]]; then
echo " [bpf] SKIP $script_name (not found)" >&2
continue
fi
echo " [bpf] $script_name (duration=${DURATION}s)" >&2
# SIGINT (not SIGTERM) so bpftrace flushes aggregation maps before exiting
timeout --signal=INT "$DURATION" bpftrace -p "$PID" "$bt_file" \
> "$OUTPUT_DIR/${script_name}.txt" 2>&1 &
pids+=($!)
done
# Also run system-wide scripts (no --pid) that are safe
for script_name in tcplife tcpretrans; do
local bt_file="${SCRIPT_DIR}/traces/${script_name}.bt"
if [[ -f "$bt_file" ]]; then
echo " [bpf] $script_name (system-wide, duration=${DURATION}s)" >&2
timeout --signal=INT "$DURATION" bpftrace "$bt_file" \
> "$OUTPUT_DIR/${script_name}.txt" 2>&1 &
pids+=($!)
fi
done
# Wait for all bpftrace processes to complete (timeout or signal).
# This keeps run.sh alive so the caller can kill us to stop everything.
for p in "${pids[@]}"; do
wait "$p" 2>/dev/null || true
done
trap - TERM INT
}
# Parse arguments
while [[ $# -gt 0 ]]; do
case $1 in
--pid|-p) PID="$2"; shift 2 ;;
--duration|-d) DURATION="$2"; shift 2 ;;
--output-dir|-o) OUTPUT_DIR="$2"; shift 2 ;;
--batch) BATCH=true; shift ;;
--check) check_capabilities; exit $? ;;
--list|-l) list_scripts; exit 0 ;;
-h|--help)
echo "Usage: $0 [OPTIONS] [SCRIPT_NAME]"
echo ""
echo "Single script mode:"
echo " $0 --pid PID SCRIPT_NAME"
echo ""
echo "Batch mode (for capture integration):"
echo " $0 --batch --pid PID --output-dir DIR [--duration SECS]"
echo ""
echo "Options:"
echo " --pid PID Attach to specific process"
echo " --batch Run all scripts in parallel"
echo " --output-dir DIR Write output files to DIR (batch mode)"
echo " --duration SECS Timeout per script (default: 30)"
echo " --check Check BPF capabilities"
echo " --list List available scripts"
echo ""
list_scripts
exit 0
;;
*) break ;;
esac
done
if [[ "$BATCH" == true ]]; then
run_batch
exit 0
fi
if [[ $# -eq 0 ]]; then
echo "ERROR: No script specified."
echo ""
list_scripts
exit 1
fi
run_script "$1"
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# BPF tracing setup script.
#
# Installs bpftrace, configures kernel permissions, and optionally grants
# capabilities so bpftrace can run without sudo.
#
# Usage:
# sudo bash setup.sh # full setup (install + kernel + capabilities)
# sudo bash setup.sh --check # check current state only
# sudo bash setup.sh --install # install bpftrace only
# sudo bash setup.sh --kernel # configure kernel permissions only
# sudo bash setup.sh --caps # grant bpftrace capabilities only
# sudo bash setup.sh --reset # remove capabilities and restore kernel defaults
set -euo pipefail
# ─── Colors ─────────────────────────────────────────────────────────────────
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
CYAN='\033[0;36m'
NC='\033[0m' # No Color
ok() { echo -e " ${GREEN}${NC} $*"; }
warn() { echo -e " ${YELLOW}${NC} $*"; }
fail() { echo -e " ${RED}${NC} $*"; }
info() { echo -e " ${CYAN}${NC} $*"; }
# ─── Check functions ────────────────────────────────────────────────────────
check_root() {
if [[ $EUID -ne 0 ]]; then
fail "This script must be run as root (sudo bash setup.sh)"
exit 1
fi
}
check_bpftrace_installed() {
if command -v bpftrace &>/dev/null; then
local ver
ver=$(bpftrace --version 2>/dev/null | head -1)
ok "bpftrace installed: $ver"
return 0
else
fail "bpftrace not installed"
return 1
fi
}
check_kernel_permissions() {
local paranoid kptr issues=0
paranoid=$(cat /proc/sys/kernel/perf_event_paranoid 2>/dev/null || echo "unknown")
kptr=$(cat /proc/sys/kernel/kptr_restrict 2>/dev/null || echo "unknown")
if [[ "$paranoid" == "-1" ]]; then
ok "perf_event_paranoid = $paranoid (unrestricted)"
elif [[ "$paranoid" -le 1 ]] 2>/dev/null; then
warn "perf_event_paranoid = $paranoid (limited — set to -1 for full access)"
issues=1
else
fail "perf_event_paranoid = $paranoid (restricted — BPF probes may fail)"
issues=1
fi
if [[ "$kptr" == "0" ]]; then
ok "kptr_restrict = $kptr (kernel symbols visible)"
else
warn "kptr_restrict = $kptr (kernel symbols hidden — set to 0 for stack symbolization)"
issues=1
fi
return $issues
}
check_capabilities() {
local bpftrace_path
bpftrace_path=$(command -v bpftrace 2>/dev/null || true)
if [[ -z "$bpftrace_path" ]]; then
fail "bpftrace not found — cannot check capabilities"
return 1
fi
local caps
caps=$(getcap "$bpftrace_path" 2>/dev/null || true)
if [[ -n "$caps" ]]; then
ok "Capabilities set: $caps"
return 0
else
warn "No capabilities set on $bpftrace_path (requires sudo to run)"
return 1
fi
}
check_debugfs() {
if mountpoint -q /sys/kernel/debug 2>/dev/null; then
ok "debugfs mounted at /sys/kernel/debug"
return 0
else
warn "debugfs not mounted (some tracepoints may be unavailable)"
return 1
fi
}
check_tracefs() {
if [[ -d /sys/kernel/tracing ]] || [[ -d /sys/kernel/debug/tracing ]]; then
ok "tracefs available"
return 0
else
warn "tracefs not found (tracepoint-based scripts may fail)"
return 1
fi
}
run_check() {
echo ""
echo "BPF Tracing Environment Check"
echo "=============================="
echo ""
echo "Installation:"
check_bpftrace_installed || true
echo ""
echo "Kernel permissions:"
check_kernel_permissions || true
echo ""
echo "Capabilities:"
check_capabilities || true
echo ""
echo "Kernel interfaces:"
check_debugfs || true
check_tracefs || true
echo ""
# Quick smoke test
echo "Smoke test:"
if command -v bpftrace &>/dev/null; then
if bpftrace -e 'BEGIN { printf("ok\n"); exit(); }' &>/dev/null; then
ok "bpftrace can execute probes"
else
fail "bpftrace probe execution failed (check permissions)"
fi
else
fail "Cannot run smoke test — bpftrace not installed"
fi
echo ""
}
# ─── Install ────────────────────────────────────────────────────────────────
install_bpftrace() {
echo ""
echo "Installing bpftrace"
echo "==================="
echo ""
if command -v bpftrace &>/dev/null; then
ok "bpftrace already installed ($(bpftrace --version 2>/dev/null | head -1))"
return 0
fi
# Detect package manager
if command -v apt-get &>/dev/null; then
info "Using apt (Debian/Ubuntu)"
apt-get update -qq
apt-get install -y -qq bpftrace linux-headers-"$(uname -r)" 2>/dev/null || \
apt-get install -y -qq bpftrace
ok "bpftrace installed via apt"
elif command -v dnf &>/dev/null; then
info "Using dnf (Fedora/RHEL)"
dnf install -y bpftrace kernel-devel-"$(uname -r)" 2>/dev/null || \
dnf install -y bpftrace
ok "bpftrace installed via dnf"
elif command -v yum &>/dev/null; then
info "Using yum (CentOS/RHEL)"
yum install -y bpftrace kernel-devel-"$(uname -r)" 2>/dev/null || \
yum install -y bpftrace
ok "bpftrace installed via yum"
elif command -v pacman &>/dev/null; then
info "Using pacman (Arch)"
pacman -S --noconfirm bpftrace
ok "bpftrace installed via pacman"
else
fail "No supported package manager found. Install bpftrace manually:"
echo " https://github.com/bpftrace/bpftrace/blob/master/INSTALL.md"
return 1
fi
}
# ─── Kernel permissions ─────────────────────────────────────────────────────
configure_kernel() {
echo ""
echo "Configuring kernel permissions"
echo "=============================="
echo ""
info "These settings are temporary and revert on reboot."
echo ""
# perf_event_paranoid
local current_paranoid
current_paranoid=$(cat /proc/sys/kernel/perf_event_paranoid)
if [[ "$current_paranoid" != "-1" ]]; then
echo -1 > /proc/sys/kernel/perf_event_paranoid
ok "perf_event_paranoid: $current_paranoid → -1"
else
ok "perf_event_paranoid already -1"
fi
# kptr_restrict
local current_kptr
current_kptr=$(cat /proc/sys/kernel/kptr_restrict)
if [[ "$current_kptr" != "0" ]]; then
echo 0 > /proc/sys/kernel/kptr_restrict
ok "kptr_restrict: $current_kptr → 0"
else
ok "kptr_restrict already 0"
fi
# Mount debugfs if needed
if ! mountpoint -q /sys/kernel/debug 2>/dev/null; then
mount -t debugfs none /sys/kernel/debug 2>/dev/null && \
ok "Mounted debugfs" || \
warn "Could not mount debugfs"
fi
echo ""
info "To make persistent across reboots, add to /etc/sysctl.conf:"
echo " kernel.perf_event_paranoid = -1"
echo " kernel.kptr_restrict = 0"
}
# ─── Capabilities ───────────────────────────────────────────────────────────
grant_capabilities() {
echo ""
echo "Granting bpftrace capabilities"
echo "==============================="
echo ""
local bpftrace_path
bpftrace_path=$(command -v bpftrace 2>/dev/null || true)
if [[ -z "$bpftrace_path" ]]; then
fail "bpftrace not found — install it first"
return 1
fi
# Resolve symlinks to get the real binary
bpftrace_path=$(readlink -f "$bpftrace_path")
info "Binary: $bpftrace_path"
# Required capabilities for BPF tracing
local caps="cap_bpf,cap_perfmon,cap_net_admin,cap_sys_ptrace+ep"
setcap "$caps" "$bpftrace_path"
ok "Capabilities granted: $caps"
echo ""
info "bpftrace can now run WITHOUT sudo."
info "Capabilities persist until the binary is updated/reinstalled."
echo ""
info "To verify: getcap $bpftrace_path"
info "To remove: sudo setcap -r $bpftrace_path"
}
# ─── Reset ──────────────────────────────────────────────────────────────────
reset_all() {
echo ""
echo "Resetting BPF configuration"
echo "==========================="
echo ""
# Remove capabilities
local bpftrace_path
bpftrace_path=$(command -v bpftrace 2>/dev/null || true)
if [[ -n "$bpftrace_path" ]]; then
bpftrace_path=$(readlink -f "$bpftrace_path")
setcap -r "$bpftrace_path" 2>/dev/null && \
ok "Capabilities removed from $bpftrace_path" || \
warn "No capabilities to remove"
fi
# Restore kernel defaults
echo 4 > /proc/sys/kernel/perf_event_paranoid 2>/dev/null && \
ok "perf_event_paranoid → 4 (default)" || \
warn "Could not restore perf_event_paranoid"
echo 1 > /proc/sys/kernel/kptr_restrict 2>/dev/null && \
ok "kptr_restrict → 1 (default)" || \
warn "Could not restore kptr_restrict"
echo ""
ok "Reset complete. bpftrace now requires sudo again."
}
# ─── Full setup ─────────────────────────────────────────────────────────────
full_setup() {
echo ""
echo "╔══════════════════════════════════════════════════╗"
echo "║ BPF Tracing Setup ║"
echo "╚══════════════════════════════════════════════════╝"
install_bpftrace
configure_kernel
grant_capabilities
echo ""
echo "Setup complete. Running verification..."
run_check
}
# ─── Usage ──────────────────────────────────────────────────────────────────
usage() {
cat <<'EOF'
BPF Tracing Setup
Usage:
sudo bash setup.sh Full setup (install + kernel + capabilities)
sudo bash setup.sh --check Check current environment
sudo bash setup.sh --install Install bpftrace only
sudo bash setup.sh --kernel Configure kernel permissions (temporary)
sudo bash setup.sh --caps Grant bpftrace capabilities (run without sudo)
sudo bash setup.sh --reset Remove capabilities and restore kernel defaults
After setup, run BPF scripts without sudo:
bpftrace -p <PID> scripts/bpf/offcputime.bt
bpftrace -p <PID> scripts/bpf/syscall_latency.bt
./scripts/bpf/run.sh --pid <PID>
EOF
}
# ─── Main ───────────────────────────────────────────────────────────────────
main() {
case "${1:-}" in
--check)
run_check
;;
--install)
check_root
install_bpftrace
;;
--kernel)
check_root
configure_kernel
;;
--caps)
check_root
grant_capabilities
;;
--reset)
check_root
reset_all
;;
-h|--help)
usage
;;
"")
check_root
full_setup
;;
*)
fail "Unknown option: $1"
echo ""
usage
exit 1
;;
esac
}
main "$@"
#!/usr/bin/env bpftrace
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Context switch histograms per thread.
// High context switch rates on tokio workers indicate contention or starvation.
//
// Usage: sudo bpftrace context_switches.bt
// sudo bpftrace -p <PID> context_switches.bt
tracepoint:sched:sched_switch
/comm == "tokio-runtime-w" || args.prev_comm == "tokio-runtime-w"/
{
if (args.prev_state == 0) {
// Involuntary: was TASK_RUNNING but got preempted
@involuntary[args.prev_comm] = count();
} else {
// Voluntary: thread yielded or blocked
@voluntary[args.prev_comm] = count();
}
}
interval:s:5
{
printf("\n--- Context Switches (last 5s) ---\n");
printf("\nVoluntary (thread yielded/blocked):\n");
print(@voluntary, 10);
printf("\nInvoluntary (thread preempted while running):\n");
print(@involuntary, 10);
clear(@voluntary);
clear(@involuntary);
}
END
{
clear(@voluntary);
clear(@involuntary);
}
#!/usr/bin/env bpftrace
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
//
// On-CPU time distribution per thread.
// Shows how long threads run before being preempted or yielding.
//
// Usage: sudo bpftrace cpudist.bt
tracepoint:sched:sched_switch
{
// Record when new thread starts running
@start[args.next_pid] = nsecs;
// Calculate how long prev thread was on-CPU
$prev_start = @start[args.prev_pid];
if ($prev_start) {
@usecs[comm] = hist((nsecs - $prev_start) / 1000);
delete(@start[args.prev_pid]);
}
}
interval:s:10
{
printf("\n--- On-CPU Time Distribution (us) by thread ---\n");
print(@usecs);
clear(@usecs);
}
END
{
clear(@start);
}
#!/usr/bin/env bpftrace
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Latency of specific functions (requires debug symbols / profiling build).
// Measures function entry-to-exit time.
//
// Usage:
// # Trace a specific function by name (requires DWARF symbols):
// sudo bpftrace -p <PID> -e '
// uprobe:/path/to/binary:function_name { @start[tid] = nsecs; }
// uretprobe:/path/to/binary:function_name /@start[tid]/ {
// @latency_us = hist((nsecs - @start[tid]) / 1000);
// delete(@start[tid]);
// }'
//
// # Or use this script with BINARY and FUNC env vars:
// sudo BINARY=/path/to/binary FUNC=function_name bpftrace funclatency.bt
//
// Build target with symbols:
// cargo build --profile profiling --features nvtx
BEGIN
{
printf("Tracing function latency... Hit Ctrl-C to end.\n");
printf("Note: Set BINARY and FUNC environment variables, or edit this script.\n");
printf("Example: sudo BINARY=target/profiling/dynamo-frontend FUNC=apply_template bpftrace funclatency.bt\n");
}
// Generic uprobe placeholder — edit the probe target as needed.
// bpftrace doesn't support env var substitution in probe specs,
// so this serves as a template. Use the one-liner above for actual tracing.
// Example for tokenizer:
// uprobe:target/profiling/dynamo-frontend:*gather_tokens* { @start[tid] = nsecs; }
// uretprobe:target/profiling/dynamo-frontend:*gather_tokens* /@start[tid]/ {
// @latency_us["gather_tokens"] = hist((nsecs - @start[tid]) / 1000);
// delete(@start[tid]);
// }
interval:s:1
{
printf(".");
}
END
{
printf("\nDone.\n");
}
#!/usr/bin/env bpftrace
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Stack traces of sleeping threads (mutex, I/O, futex, socket waits).
// Identifies what threads are blocked on.
//
// Usage: sudo bpftrace offcputime.bt
// Filter to specific PID: sudo bpftrace -p <PID> offcputime.bt
tracepoint:sched:sched_switch
{
if (args.prev_state != 0) {
// Thread is going off-CPU (not TASK_RUNNING)
@off[args.prev_pid, args.prev_comm] = nsecs;
}
}
tracepoint:sched:sched_switch
{
$key = @off[args.next_pid, args.next_comm];
if ($key) {
$delta = nsecs - $key;
// Only record if off-CPU > 1ms (filter noise)
if ($delta > 1000000) {
// Use kstack at wake time — shows what the thread was blocked on
@blocked_us[args.next_comm, kstack] = hist($delta / 1000);
}
delete(@off[args.next_pid, args.next_comm]);
}
}
interval:s:15
{
printf("\n--- Off-CPU Time (us) by thread + kernel stack ---\n");
print(@blocked_us, 10);
clear(@blocked_us);
}
END
{
clear(@off);
}
#!/usr/bin/env bpftrace
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
//
// CPU run queue latency histogram.
// Proves OS-level CPU contention. Healthy: <10us p99.
//
// Usage: sudo bpftrace runqlat.bt
tracepoint:sched:sched_wakeup,
tracepoint:sched:sched_wakeup_new
{
@qtime[args.pid] = nsecs;
}
tracepoint:sched:sched_switch
{
if (args.prev_state == 0) {
// voluntary context switch (prev was TASK_RUNNING)
@qtime[args.prev_pid] = nsecs;
}
$ns = @qtime[args.next_pid];
if ($ns) {
@usecs = hist((nsecs - $ns) / 1000);
delete(@qtime[args.next_pid]);
}
}
interval:s:5
{
printf("\n--- Run Queue Latency (us) ---\n");
print(@usecs);
clear(@usecs);
}
END
{
clear(@qtime);
}
#!/usr/bin/env bpftrace
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Top slow syscalls (write, epoll_wait, futex, sendmsg).
// Identifies which syscalls are consuming time in the frontend.
//
// Usage: sudo bpftrace syscall_latency.bt
// sudo bpftrace -p <PID> syscall_latency.bt
tracepoint:raw_syscalls:sys_enter
/comm == "tokio-runtime-w" || comm == "dynamo-frontend"/
{
@start[tid] = nsecs;
@syscall_id[tid] = args.id;
}
tracepoint:raw_syscalls:sys_exit
/@start[tid]/
{
$delta = nsecs - @start[tid];
$id = @syscall_id[tid];
// Only record syscalls taking > 10us
if ($delta > 10000) {
@latency_us[$id] = hist($delta / 1000);
@slow_count[$id] = count();
}
delete(@start[tid]);
delete(@syscall_id[tid]);
}
interval:s:10
{
printf("\n--- Slow Syscall Latency (us) by syscall number ---\n");
print(@latency_us);
printf("\n--- Slow Syscall Count ---\n");
print(@slow_count, 10);
clear(@latency_us);
clear(@slow_count);
}
END
{
clear(@start);
clear(@syscall_id);
}
#!/usr/bin/env bpftrace
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
//
// TCP connection lifetimes (pool health).
// Short-lived connections indicate pool churn; long-lived indicate healthy pooling.
//
// Usage: sudo bpftrace tcplife.bt
#include <net/sock.h>
kprobe:tcp_set_state
{
$sk = (struct sock *)arg0;
$newstate = arg1;
if ($newstate == 1) {
// TCP_ESTABLISHED
@birth[$sk] = nsecs;
@saddr[$sk] = ntop($sk->__sk_common.skc_rcv_saddr);
@daddr[$sk] = ntop($sk->__sk_common.skc_daddr);
@dport[$sk] = ($sk->__sk_common.skc_dport >> 8) |
(($sk->__sk_common.skc_dport & 0xff) << 8);
}
if ($newstate == 7) {
// TCP_CLOSE
$start = @birth[$sk];
if ($start) {
$delta_ms = (nsecs - $start) / 1000000;
printf("%-16s %-6d %-16s -> %-16s:%-5d lifetime=%dms\n",
comm, pid,
@saddr[$sk], @daddr[$sk], @dport[$sk],
$delta_ms);
@lifetime_ms = hist($delta_ms);
}
delete(@birth[$sk]);
delete(@saddr[$sk]);
delete(@daddr[$sk]);
delete(@dport[$sk]);
}
}
END
{
printf("\n--- TCP Connection Lifetime (ms) ---\n");
print(@lifetime_ms);
clear(@birth);
clear(@saddr);
clear(@daddr);
clear(@dport);
}
#!/usr/bin/env bpftrace
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
//
// TCP retransmissions (network backpressure indicator).
// High retransmit rates indicate network congestion or packet loss.
//
// Usage: sudo bpftrace tcpretrans.bt
tracepoint:tcp:tcp_retransmit_skb
{
printf("%-8d %-16s %s:%d -> %s:%d state=%d\n",
pid, comm,
ntop(args.saddr), args.sport,
ntop(args.daddr), args.dport,
args.state);
@retrans[ntop(args.daddr), args.dport] = count();
@total = count();
}
interval:s:10
{
printf("\n--- TCP Retransmissions (last 10s) ---\n");
printf("Total: ");
print(@total);
print(@retrans);
clear(@retrans);
clear(@total);
}
#!/usr/bin/env bpftrace
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Socket read/write latencies (transport-agnostic: traces sendmsg/recvmsg).
// Works for both TCP and NATS since both go through kernel sockets.
//
// Usage: sudo bpftrace transport_latency.bt
// sudo bpftrace -p <PID> transport_latency.bt
tracepoint:syscalls:sys_enter_sendmsg,
tracepoint:syscalls:sys_enter_sendto,
tracepoint:syscalls:sys_enter_write
/comm == "tokio-runtime-w" || comm == "dynamo-frontend"/
{
@send_start[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_sendmsg,
tracepoint:syscalls:sys_exit_sendto,
tracepoint:syscalls:sys_exit_write
/@send_start[tid]/
{
$delta = nsecs - @send_start[tid];
@send_latency_us = hist($delta / 1000);
delete(@send_start[tid]);
}
tracepoint:syscalls:sys_enter_recvmsg,
tracepoint:syscalls:sys_enter_recvfrom,
tracepoint:syscalls:sys_enter_read
/comm == "tokio-runtime-w" || comm == "dynamo-frontend"/
{
@recv_start[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_recvmsg,
tracepoint:syscalls:sys_exit_recvfrom,
tracepoint:syscalls:sys_exit_read
/@recv_start[tid]/
{
$delta = nsecs - @recv_start[tid];
@recv_latency_us = hist($delta / 1000);
delete(@recv_start[tid]);
}
interval:s:5
{
printf("\n--- Send Latency (us) ---\n");
print(@send_latency_us);
printf("\n--- Recv Latency (us) ---\n");
print(@recv_latency_us);
clear(@send_latency_us);
clear(@recv_latency_us);
}
END
{
clear(@send_start);
clear(@recv_start);
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment