"lib/runtime/vscode:/vscode.git/clone" did not exist on "a441aaf8ea3ff1b9b2928df91fc42e7a239a00fb"
Unverified Commit 585b4df7 authored by Biswa Panda's avatar Biswa Panda Committed by GitHub
Browse files

feat(perf): add script for profiling frontend perf with nsys and flamegraph (#6748)

parent d2d97f18
<!-- SPDX-License-Identifier: Apache-2.0 -->
# Flame Graph Scripts
Scripts for generating CPU, off-CPU, and differential flame graph SVGs from the Dynamo frontend. Each script auto-detects available profiling tools and picks the best one.
## Scripts
| Script | What it does | Requires root? |
|--------|-------------|----------------|
| `cpu_flamegraph.sh` | On-CPU sampling flame graph. Tries cargo-flamegraph, samply, then falls back to `perf record` + flamegraph.pl/inferno. | No (but `perf` needs `CAP_PERFMON` or `perf_event_paranoid=-1`) |
| `offcpu_flamegraph.sh` | Off-CPU flame graph via BPF. Shows what threads block on: mutexes, I/O, futex, socket waits. | Yes (BPF requires root or `CAP_BPF`) |
| `diff_flamegraph.sh` | Differential flame graph comparing two profiles. Red = regression, blue = improvement. | No |
## Quick Start
```bash
# Get the frontend PID from a running capture
FRONTEND_PID=$(pgrep -f "dynamo.frontend" | head -1)
# CPU flame graph (30s sample)
./cpu_flamegraph.sh --pid $FRONTEND_PID --duration 30
# Off-CPU flame graph (what's blocking threads)
sudo ./offcpu_flamegraph.sh --pid $FRONTEND_PID --duration 30
# Differential: compare before/after an optimization
./diff_flamegraph.sh before.perf.data after.perf.data
```
## Tool Priority
`cpu_flamegraph.sh` tries tools in order:
1. **cargo-flamegraph** — simplest, one-step SVG (only for launching a new binary, not `--pid`)
2. **samply** — generates a Firefox Profiler-compatible JSON (supports `--pid`)
3. **perf record** + **flamegraph.pl** or **inferno** — most common fallback
`offcpu_flamegraph.sh` tries:
1. **bpftrace** — inline BPF script capturing sched_switch stacks
2. **bcc offcputime-bpfcc** — BCC tools fallback
## Options
All scripts share a common option style:
| Option | Description | Default |
|--------|-------------|---------|
| `--pid PID` | Attach to running process | — |
| `--duration N` | Capture duration in seconds | 30 |
| `--output-dir DIR` | Output directory | `.` |
| `--freq HZ` | Sampling frequency (CPU only) | 99 |
| `--min-us N` | Minimum off-CPU time in us (off-CPU only) | 1000 |
## Interpreting Results
### CPU Flame Graph
- Wide towers = functions consuming the most CPU time
- Look for hot paths in `tokio-runtime-worker` threads
- Narrow, deep stacks = normal call chains; wide, flat = optimization targets
### Off-CPU Flame Graph
- `futex_wait_queue` → mutex/condvar contention
- `ep_poll` → epoll_wait (normal Tokio I/O loop)
- `schedule_timeout` → timer/sleep
- `tcp_sendmsg` / `tcp_recvmsg` → socket I/O blocking
### Differential Flame Graph
- **Red** frames got slower (regression)
- **Blue** frames got faster (improvement)
- Width difference shows magnitude of change
## Integration with Capture Script
The main capture script generates flame graphs automatically from `perf record` data:
```bash
sudo bash benchmarks/frontend/scripts/run_perf.sh \
--skip-nsys \
--model Qwen/Qwen3-0.6B --concurrency 64 --num-requests 4096
# Flame graph SVGs appear in artifacts/obs_<timestamp>/perf/
```
## Requirements
- **CPU**: `perf` (`apt install linux-tools-$(uname -r)`) or `cargo install flamegraph` or `cargo install samply`
- **Off-CPU**: `bpftrace` >= 0.16 or `bcc-tools`
- **SVG generation**: `cargo install inferno` (provides `inferno-collapse-perf`, `inferno-flamegraph`, `inferno-diff-folded`) or Brendan Gregg's [FlameGraph](https://github.com/brendangregg/FlameGraph) scripts
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# CPU flame graph generation.
# Tries cargo-flamegraph, samply, or falls back to perf + flamegraph.pl.
#
# Usage:
# ./cpu_flamegraph.sh --pid <PID> # attach to running process
# ./cpu_flamegraph.sh --pid <PID> --duration 30 # 30 second capture
# ./cpu_flamegraph.sh -- target/profiling/binary # launch and profile
set -euo pipefail
PID=""
DURATION="${DURATION:-30}"
OUTPUT_DIR="${OUTPUT_DIR:-.}"
OUTPUT_NAME="cpu_flamegraph_$(date +%Y%m%d_%H%M%S)"
FREQ="${FREQ:-99}"
while [[ $# -gt 0 ]]; do
case $1 in
--pid|-p) PID="$2"; shift 2 ;;
--duration|-d) DURATION="$2"; shift 2 ;;
--output-dir) OUTPUT_DIR="$2"; shift 2 ;;
--output) OUTPUT_NAME="$2"; shift 2 ;;
--freq) FREQ="$2"; shift 2 ;;
--) shift; break ;;
-h|--help)
echo "Usage: $0 [OPTIONS] [-- <binary> [args...]]"
echo ""
echo "Options:"
echo " --pid PID Profile running process"
echo " --duration N Capture duration in seconds (default: 30)"
echo " --output-dir DIR Output directory (default: .)"
echo " --freq HZ Sampling frequency (default: 99)"
exit 0
;;
*) break ;;
esac
done
mkdir -p "$OUTPUT_DIR"
# Validate that we have a target to profile
if [[ -z "$PID" ]] && [[ $# -eq 0 ]]; then
echo "ERROR: Must specify --pid <PID> or provide a command after --"
echo "Usage: $0 --pid <PID> OR $0 -- <binary> [args...]"
exit 1
fi
# Try cargo-flamegraph first (simplest)
if command -v flamegraph &>/dev/null && [[ -z "$PID" ]] && [[ $# -gt 0 ]]; then
echo "Using cargo-flamegraph..."
flamegraph --freq "$FREQ" --output "${OUTPUT_DIR}/${OUTPUT_NAME}.svg" -- "$@"
echo "Flame graph: ${OUTPUT_DIR}/${OUTPUT_NAME}.svg"
exit 0
fi
# Try samply
if command -v samply &>/dev/null; then
echo "Using samply..."
if [[ -n "$PID" ]]; then
samply record --pid "$PID" --duration "$DURATION" \
--save-only --output "${OUTPUT_DIR}/${OUTPUT_NAME}.json.gz"
else
samply record --duration "$DURATION" \
--save-only --output "${OUTPUT_DIR}/${OUTPUT_NAME}.json.gz" -- "$@"
fi
echo "Profile: ${OUTPUT_DIR}/${OUTPUT_NAME}.json.gz"
echo "View with: samply load ${OUTPUT_DIR}/${OUTPUT_NAME}.json.gz"
exit 0
fi
# Fallback: perf record + flamegraph.pl
if ! command -v perf &>/dev/null; then
echo "ERROR: No profiling tool found. Install one of:"
echo " - cargo install flamegraph"
echo " - cargo install samply"
echo " - apt install linux-tools-\$(uname -r)"
exit 1
fi
echo "Using perf record..."
PERF_DATA="${OUTPUT_DIR}/${OUTPUT_NAME}.perf.data"
# perf record may exit non-zero (e.g. target exits, signal interrupts) but still
# produce valid data — don't let set -e abort before SVG generation.
if [[ -n "$PID" ]]; then
perf record -F "$FREQ" -g --pid "$PID" -o "$PERF_DATA" -- sleep "$DURATION" || true
else
perf record -F "$FREQ" -g -o "$PERF_DATA" -- "$@" || true
fi
if [[ ! -f "$PERF_DATA" ]]; then
echo "ERROR: perf record produced no data"
exit 1
fi
# Generate flamegraph SVG
if command -v flamegraph.pl &>/dev/null && command -v stackcollapse-perf.pl &>/dev/null; then
perf script -i "$PERF_DATA" 2>/dev/null | stackcollapse-perf.pl | \
flamegraph.pl > "${OUTPUT_DIR}/${OUTPUT_NAME}.svg"
echo "Flame graph: ${OUTPUT_DIR}/${OUTPUT_NAME}.svg"
elif command -v inferno-collapse-perf &>/dev/null && command -v inferno-flamegraph &>/dev/null; then
perf script -i "$PERF_DATA" 2>/dev/null | inferno-collapse-perf | \
inferno-flamegraph > "${OUTPUT_DIR}/${OUTPUT_NAME}.svg"
echo "Flame graph: ${OUTPUT_DIR}/${OUTPUT_NAME}.svg"
else
echo "Raw perf data: $PERF_DATA"
echo "Install flamegraph tools to generate SVG: cargo install inferno"
fi
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Differential flame graph (before/after comparison).
# Shows what changed between two profiles — red = regression, blue = improvement.
#
# Usage:
# ./diff_flamegraph.sh <before.perf.data> <after.perf.data>
# ./diff_flamegraph.sh <before.stacks> <after.stacks>
set -euo pipefail
OUTPUT_DIR="${OUTPUT_DIR:-.}"
OUTPUT_NAME="diff_flamegraph_$(date +%Y%m%d_%H%M%S)"
if [[ $# -lt 2 ]]; then
echo "Usage: $0 <before> <after>"
echo ""
echo "Accepts .perf.data files or pre-folded .stacks files."
echo "Output: differential SVG flamegraph (red=regression, blue=improvement)"
exit 1
fi
BEFORE="$1"
AFTER="$2"
shift 2
while [[ $# -gt 0 ]]; do
case $1 in
--output-dir) OUTPUT_DIR="$2"; shift 2 ;;
--output) OUTPUT_NAME="$2"; shift 2 ;;
*) break ;;
esac
done
mkdir -p "$OUTPUT_DIR"
# Convert perf.data to folded stacks if needed
fold_perf_data() {
local input=$1
local output=$2
if [[ "$input" == *.perf.data ]]; then
if ! command -v perf &>/dev/null; then
echo "ERROR: perf not found. Install: apt install linux-tools-$(uname -r)"
exit 1
fi
if command -v stackcollapse-perf.pl &>/dev/null; then
perf script -i "$input" | stackcollapse-perf.pl > "$output"
elif command -v inferno-collapse-perf &>/dev/null; then
perf script -i "$input" | inferno-collapse-perf > "$output"
else
echo "ERROR: Need stackcollapse-perf.pl or inferno-collapse-perf"
exit 1
fi
else
cp "$input" "$output"
fi
}
BEFORE_FOLDED=$(mktemp)
AFTER_FOLDED=$(mktemp)
trap 'rm -f "$BEFORE_FOLDED" "$AFTER_FOLDED"' EXIT
fold_perf_data "$BEFORE" "$BEFORE_FOLDED"
fold_perf_data "$AFTER" "$AFTER_FOLDED"
if command -v difffolded.pl &>/dev/null && command -v flamegraph.pl &>/dev/null; then
difffolded.pl "$BEFORE_FOLDED" "$AFTER_FOLDED" | \
flamegraph.pl --title="Differential Flame Graph" > "${OUTPUT_DIR}/${OUTPUT_NAME}.svg"
echo "Diff flame graph: ${OUTPUT_DIR}/${OUTPUT_NAME}.svg"
elif command -v inferno-diff-folded &>/dev/null && command -v inferno-flamegraph &>/dev/null; then
inferno-diff-folded "$BEFORE_FOLDED" "$AFTER_FOLDED" | \
inferno-flamegraph --title "Differential Flame Graph" > "${OUTPUT_DIR}/${OUTPUT_NAME}.svg"
echo "Diff flame graph: ${OUTPUT_DIR}/${OUTPUT_NAME}.svg"
else
echo "ERROR: Need flamegraph tools (Brendan Gregg's or inferno)"
echo " cargo install inferno"
echo " or: git clone https://github.com/brendangregg/FlameGraph"
exit 1
fi
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Off-CPU flame graph using BPF offcputime + flamegraph.
# Shows what threads are blocked on (mutexes, I/O, futex, socket waits).
#
# Usage:
# sudo ./offcpu_flamegraph.sh --pid <PID>
# sudo ./offcpu_flamegraph.sh --pid <PID> --duration 30
set -euo pipefail
PID=""
DURATION="${DURATION:-30}"
OUTPUT_DIR="${OUTPUT_DIR:-.}"
OUTPUT_NAME="offcpu_flamegraph_$(date +%Y%m%d_%H%M%S)"
MIN_US="${MIN_US:-1000}" # Minimum off-CPU time to record (1ms)
while [[ $# -gt 0 ]]; do
case $1 in
--pid|-p) PID="$2"; shift 2 ;;
--duration|-d) DURATION="$2"; shift 2 ;;
--output-dir) OUTPUT_DIR="$2"; shift 2 ;;
--output) OUTPUT_NAME="$2"; shift 2 ;;
--min-us) MIN_US="$2"; shift 2 ;;
-h|--help)
echo "Usage: sudo $0 [OPTIONS]"
echo ""
echo "Options:"
echo " --pid PID Target process (required)"
echo " --duration N Capture duration in seconds (default: 30)"
echo " --output-dir DIR Output directory (default: .)"
echo " --min-us N Minimum off-CPU microseconds to record (default: 1000)"
exit 0
;;
*) echo "Unknown option: $1"; exit 1 ;;
esac
done
if [[ -z "$PID" ]]; then
echo "ERROR: --pid is required"
exit 1
fi
mkdir -p "$OUTPUT_DIR"
RAW_STACKS="${OUTPUT_DIR}/${OUTPUT_NAME}.raw"
FOLDED_STACKS="${OUTPUT_DIR}/${OUTPUT_NAME}.stacks"
NEEDS_FOLD=false
# Try bpftrace-based offcputime
if command -v bpftrace &>/dev/null; then
echo "Capturing off-CPU stacks for PID $PID for ${DURATION}s..."
# Use timeout to limit duration; keep stderr separate from stack data
timeout "$DURATION" bpftrace -p "$PID" -e '
tracepoint:sched:sched_switch {
if (args.prev_state != 0) {
@off[tid] = nsecs;
@stack[tid] = kstack;
}
}
tracepoint:sched:sched_switch {
$start = @off[args.next_pid];
if ($start) {
$delta = (nsecs - $start) / 1000;
if ($delta > '"$MIN_US"') {
@stacks[@stack[args.next_pid], comm] = sum($delta);
}
delete(@off[args.next_pid]);
delete(@stack[args.next_pid]);
}
}
END { print(@stacks); clear(@off); clear(@stack); }
' > "$RAW_STACKS" 2>/dev/null || true
NEEDS_FOLD=true
echo "Raw stacks captured: $RAW_STACKS"
# Try bcc offcputime — outputs folded format directly with -f
elif command -v offcputime-bpfcc &>/dev/null; then
echo "Using bcc offcputime for PID $PID for ${DURATION}s..."
offcputime-bpfcc -d "$DURATION" -p "$PID" -m "$MIN_US" -f > "$FOLDED_STACKS"
else
echo "ERROR: No BPF tool found. Install bpftrace or bcc-tools."
exit 1
fi
# Convert bpftrace native format to folded stacks.
# bpftrace @stacks[kstack, comm] format:
# @stacks[
# leaf_func+offset
# ...
# root_func+offset
# , comm_name]: value
# Folded format: comm;root_func;...;leaf_func value
if [[ "$NEEDS_FOLD" == true ]] && [[ -f "$RAW_STACKS" ]]; then
awk '
/^@stacks\[/ { n=0; next }
/^[[:space:]]+[a-zA-Z_]/ {
gsub(/^[[:space:]]+/, "")
sub(/\+[0-9]+$/, "")
frames[n++] = $0
next
}
/^, / {
sub(/^, /, "")
idx = index($0, "]: ")
comm = substr($0, 1, idx-1)
val = substr($0, idx+3) + 0
if (n > 0 && val > 0) {
printf "%s", comm
for (i=n-1; i>=0; i--) printf ";%s", frames[i]
printf " %d\n", val
}
n = 0
next
}
' "$RAW_STACKS" > "$FOLDED_STACKS"
echo "Folded stacks: $FOLDED_STACKS ($(wc -l < "$FOLDED_STACKS") entries)"
fi
# Generate flamegraph SVG from folded stacks
if [[ ! -s "$FOLDED_STACKS" ]]; then
echo "WARNING: No stacks captured — SVG not generated"
exit 0
fi
if command -v flamegraph.pl &>/dev/null; then
flamegraph.pl --color=io --title="Off-CPU Flame Graph (PID $PID)" \
--countname="us" < "$FOLDED_STACKS" > "${OUTPUT_DIR}/${OUTPUT_NAME}.svg"
echo "Flame graph: ${OUTPUT_DIR}/${OUTPUT_NAME}.svg"
elif command -v inferno-flamegraph &>/dev/null; then
inferno-flamegraph --colors io --title "Off-CPU Flame Graph (PID $PID)" \
--countname "us" < "$FOLDED_STACKS" > "${OUTPUT_DIR}/${OUTPUT_NAME}.svg"
echo "Flame graph: ${OUTPUT_DIR}/${OUTPUT_NAME}.svg"
else
echo "Folded stacks: $FOLDED_STACKS"
echo "Install flamegraph tools to generate SVG: cargo install inferno"
fi
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Nsight Systems profiling wrapper for dynamo frontend.
# Captures NVTX ranges and CPU samples. Context switches are disabled
# (--cpuctxsw=none) to reduce overhead.
#
# Prerequisites:
# - nsys (Nsight Systems CLI) installed
# - Binary built with: cargo build --profile profiling --features nvtx
#
# Usage:
# ./nsys_profile.sh <binary> [args...]
# ./nsys_profile.sh --duration 60 <binary> [args...]
# DURATION=30 ./nsys_profile.sh target/profiling/dynamo-frontend
set -euo pipefail
DURATION="${DURATION:-30}"
OUTPUT_PREFIX="dynamo_frontend_$(date +%Y%m%d_%H%M%S)"
OUTPUT_DIR="${OUTPUT_DIR:-.}"
# Parse optional flags
while [[ $# -gt 0 ]]; do
case $1 in
--duration) DURATION="$2"; shift 2 ;;
--output-dir) OUTPUT_DIR="$2"; shift 2 ;;
--output) OUTPUT_PREFIX="$2"; shift 2 ;;
-h|--help)
echo "Usage: $0 [OPTIONS] <binary> [binary-args...]"
echo ""
echo "Options:"
echo " --duration N Profile duration in seconds (default: 30)"
echo " --output-dir DIR Output directory (default: .)"
echo " --output PREFIX Output file prefix (default: dynamo_frontend_<timestamp>)"
echo ""
echo "Environment:"
echo " DYN_ENABLE_NVTX=1 is set automatically"
echo ""
echo "Build the binary first:"
echo " cargo build --profile profiling --features nvtx"
exit 0
;;
*) break ;;
esac
done
if [[ $# -eq 0 ]]; then
echo "ERROR: No binary specified."
echo "Usage: $0 [OPTIONS] <binary> [binary-args...]"
exit 1
fi
BINARY="$1"
shift
if ! command -v nsys &>/dev/null; then
echo "ERROR: nsys not found. Install Nsight Systems."
exit 1
fi
if ! command -v "$BINARY" &>/dev/null && [[ ! -x "$BINARY" ]]; then
echo "ERROR: Binary not found or not executable: $BINARY"
echo "Build with: cargo build --profile profiling --features nvtx"
exit 1
fi
mkdir -p "$OUTPUT_DIR"
export DYN_ENABLE_NVTX=1
echo "Profiling: $BINARY $*"
echo "Duration: ${DURATION}s"
echo "Output: ${OUTPUT_DIR}/${OUTPUT_PREFIX}.nsys-rep"
nsys profile \
--trace=osrt,nvtx \
--sample=cpu \
--cpuctxsw=none \
--output="${OUTPUT_DIR}/${OUTPUT_PREFIX}" \
--duration="$DURATION" \
--force-overwrite=true \
"$BINARY" "$@"
echo ""
echo "Profile saved: ${OUTPUT_DIR}/${OUTPUT_PREFIX}.nsys-rep"
echo "View with: nsys-ui ${OUTPUT_DIR}/${OUTPUT_PREFIX}.nsys-rep"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment