添加优化限制，带宽查询

2ad344b2 · whlwhlwhl · 067b04c0 · 2ad344b2 · 2ad344b2 · 2ad344b2
Commit 2ad344b2 authored May 21, 2026 by whlwhlwhl
3 changed files
--- a/docs/lightop-skills.zh-CN.md
+++ b/docs/lightop-skills.zh-CN.md
@@ -164,6 +164,38 @@ HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py
 要选择 HCU 利用率低、显存占用低的卡。baseline 和 candidate 尽量使用同一张卡。
 如果没有记录设备状态，不应该把性能数字当作最终可采信证据。
+开始优化前，还要在当前选中的卡上实测一次实际读写带宽，写到
+`.humanize/lightop-agent/device-bandwidth.txt`。这个值不是理论峰值，而是当前容器、
+当前卡、当前负载状态下的 sanity baseline，用来判断算子的有效带宽目标是否合理。
+```bash
+mkdir -p .humanize/lightop-agent
+HIP_VISIBLE_DEVICES=<idle-card> python - <<'PY' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
+import time, torch
+torch.cuda.init()
+free, total = torch.cuda.mem_get_info()
+bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
+n = bytes_per_buf // 4
+a = torch.empty(n, device="cuda", dtype=torch.float32)
+b = torch.empty_like(a)
+c = torch.empty_like(a)
+a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
+def bench(name, fn, bytes_moved, iters=80, warmup=20):
+    for _ in range(warmup): fn()
+    torch.cuda.synchronize()
+    t0 = time.perf_counter()
+    for _ in range(iters): fn()
+    torch.cuda.synchronize()
+    dt = (time.perf_counter() - t0) / iters
+    print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
+bench("write_fill", lambda: a.fill_(3.0), n * 4)
+bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
+bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
+bench("read_reduce", lambda: torch.sum(a), n * 4)
+print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
+PY
+```
 ### 工作流
 Stage 1：检查和计划。
@@ -180,16 +212,22 @@ Stage 1：检查和计划。
 Stage 2：实现和验证。
 - 第一次优化前先做 baseline matrix：正确性、benchmark、卡状态、p50/p90/mean、
-  有效带宽、噪声估计。
+  当前卡实际读/写/拷贝带宽、有效带宽、噪声估计。
 - 每轮只做一个主要优化假设。
 - build 使用 `python setup.py install`。
 - 同环境做 import smoke、target correctness、benchmark。
 - 每个通过正确性的 candidate 都记录 shape、dtype、配置、带宽或延迟、对比基准、
  keep/reject/inconclusive 原因。
+- 每个正确性通过的优化 candidate，在普通 benchmark 后都必须用同一张卡、同一代表
+  shape 跑 `hipprof --pmc`。digest 要分析 cache 行为、memory/cache traffic、
+  LDS 或 bank conflict、occupancy/resource pressure，并给出一个明确的下一步修改。
+  如果当前 DTK 不支持某个 counter 或 occupancy 信号，要记录实际命令和报错。
 Stage 3：profiling 和 tuning。
 - 当首个正确 candidate 没达到性能目标时，进入 profiling/tuning，不要停止。
+- 每轮优化改动后，只要正确性通过，就先 benchmark，再跑 per-candidate
+  `hipprof --pmc`，用 profiler 证据决定下一轮，不要只凭源码直觉继续改。
 - 两个连续正确 candidate 没达标时，下一次 kernel/dispatch edit 前必须同时有：
  `lightop-kernel-knowledge` 调研结论和 `dcu-profiler-report` profile digest。
 - 第二个正确优化尝试如果相对 parent 或 baseline 提升小于 5%，下一步必须做深度
@@ -208,7 +246,10 @@ Stage 4：收尾。
 性能目标是验收契约，不是“尽量试试”。规则如下：
 - 只有 correctness 通过后，performance candidate 才算有效。
+- 开始优化前必须实测当前卡的实际读写带宽，并把结果作为 baseline matrix 的一部分。
 - 每个正确 candidate 都必须有 benchmark 数据。
+- 每个正确优化 candidate 都必须有配套 `hipprof --pmc` 证据，至少说明 cache、
+  LDS/bank conflict、occupancy/resource 这几类信号，或者记录工具不支持的原因。
 - 提升必须超过计划里定义的噪声带，噪声带内只能记为 inconclusive 或 plateau。
 - 首个正确 candidate 未达标时，必须 profile 和 tune。
 - 至少尝试 3 条有证据支持的优化 lineage，除非 profiler 证明目标不可达。
@@ -270,6 +311,24 @@ python test_<op>.py
 如果没有 benchmark，就添加小 benchmark，必须包含 warmup、固定 shape、固定 seed、
 计时前后显式 `torch.cuda.synchronize()`。
+每个正确性通过的优化 candidate 都要在 benchmark 后跑 PMC：
+```bash
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc --pmc-type 3 \
+  -o .humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc \
+  python test/<family>/benchmark_<op>.py
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-read --pmc-type 3 \
+  -o .humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc-read \
+  python test/<family>/benchmark_<op>.py
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-write --pmc-type 3 \
+  -o .humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc-write \
+  python test/<family>/benchmark_<op>.py
+```
+这些结果要写进 `.humanize/lightop-agent/profile-artifacts/<version>/digest.md` 或
+`kernel_opt_readme.md`，用于判断 cache 命中、LDS bank conflict、内核 occupancy、
+VGPR/LDS/resource 压力，并产出下一步具体 kernel edit。
 ### RLCR 启动和失败回退
 写好 refined plan 且确认 `.humanize*` 已忽略后，从 LightOp root 启动：
@@ -409,6 +468,8 @@ python3 scripts/search-index-repos.py <term1> <term2> [<term3>]
 - baseline benchmark 已通过，但还没有 baseline profile digest。
 - 正确 candidate 和 baseline 或 prior best 差距在 +/-2% 内。
+- 每个正确性通过的优化 candidate 已经跑完 benchmark，需要 mandatory
+  `hipprof --pmc` digest 来决定下一步。
 - 第二个正确优化尝试相对 parent 或 baseline 提升小于 5%。
 - candidate 在重要 shape 上回退。
 - candidate 快得异常，需要解释。
@@ -428,7 +489,7 @@ python3 scripts/search-index-repos.py <term1> <term2> [<term3>]
  device-status.txt
  benchmark.log
  hipprof.txt
-  hipprof-pmc-all/
+  hipprof-pmc-all/            # 每个正确优化 candidate 必须有
  sqtt-json/
  rocprof.csv
  rocprof-stats.csv
@@ -448,12 +509,15 @@ python3 scripts/search-index-repos.py <term1> <term2> [<term3>]
 3. 执行设备选择 gate，并用 `HIP_VISIBLE_DEVICES=<idle-card>` 固定卡。
 4. 先跑普通 benchmark，避免 profiler overhead 变成性能结论。
 5. 用 `hipprof` 做第一阶段 API/kernel/memcpy timing。
-6. 深度分析时收集 PMC、SQTT、`dccobjdump`、code-object resource、LDS/register/
+6. 每个正确优化 candidate 都收集 `hipprof --pmc`、`--pmc-read`、`--pmc-write`
+   中当前环境支持的结果，并解释 cache、memory/cache traffic、LDS/bank conflict、
+   occupancy/resource pressure。
+7. 深度分析时收集 SQTT、`dccobjdump`、code-object resource、LDS/register/
   occupancy 证据。
-7. 如果 `hipprof` 只显示热点 kernel 但解释不了原因，再用更深的 ROCm/DTK profiler。
+8. 如果 `hipprof` 只显示热点 kernel 但解释不了原因，再用更深的 ROCm/DTK profiler。
-8. 如果怀疑 codegen，检查 AMDGPU ISA 或 code-object metadata。
+9. 如果怀疑 codegen，检查 AMDGPU ISA 或 code-object metadata。
-9. candidate 要和 baseline 或 parent 比，不只看绝对数。
+10. candidate 要和 baseline 或 parent 比，不只看绝对数。
-10. 写 `digest.md`，最后必须只有一个明确的 next edit。
+11. 写 `digest.md`，最后必须只有一个明确的 next edit。
 ### 常用命令
@@ -560,4 +624,3 @@ hipprof --codeobj-analyze <binary-or-so> \
 - 优先只修改目标算子相关源码、测试、benchmark 或必要 config。
 - 直接在 Docker 内进行 install、correctness test、benchmark、profiling、tuning。
 ```
--- a/humanize/skills/humanize-kernel-agent-loop/SKILL.md
+++ b/humanize/skills/humanize-kernel-agent-loop/SKILL.md
@@ -211,18 +211,52 @@ execution environment and pin the run to an idle card.
  comparison is deliberately measuring cross-device behavior.
 - Record the chosen card, HCU utilization, VRAM use, and exact command in the
  attempt ledger or profile artifact directory.
+- Before the first optimization edit, run a device bandwidth calibration on the
+  selected card in the same execution environment. Record actual read, write,
+  copy/read-write, and simple triad bandwidth, buffer size, dtype, selected
+  card, and command in `.humanize/lightop-agent/device-bandwidth.txt`. Treat
+  this as a sanity baseline for any user-specified effective-bandwidth target.
 Example:
 ```bash
 hy-smi || rocm-smi
+mkdir -p .humanize/lightop-agent
+HIP_VISIBLE_DEVICES=<idle-card> python - <<'PY' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
+import time, torch
+torch.cuda.init()
+free, total = torch.cuda.mem_get_info()
+bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
+n = bytes_per_buf // 4
+a = torch.empty(n, device="cuda", dtype=torch.float32)
+b = torch.empty_like(a)
+c = torch.empty_like(a)
+a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
+def bench(name, fn, bytes_moved, iters=80, warmup=20):
+    for _ in range(warmup):
+        fn()
+    torch.cuda.synchronize()
+    t0 = time.perf_counter()
+    for _ in range(iters):
+        fn()
+    torch.cuda.synchronize()
+    dt = (time.perf_counter() - t0) / iters
+    print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
+bench("write_fill", lambda: a.fill_(3.0), n * 4)
+bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
+bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
+bench("read_reduce", lambda: torch.sum(a), n * 4)
+print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
+PY
 HIP_VISIBLE_DEVICES=<idle-card> python test/<family>/benchmark_<op>.py
 HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py
 ```
 Do not report a performance number as actionable unless the device-selection
-gate was recorded or the user explicitly accepts the missing card-state
+gate and device bandwidth calibration were recorded, or the user explicitly
-evidence.
+accepts the missing evidence.
 ## Workflow
@@ -252,10 +286,11 @@ evidence.
 ### Stage 2: Implement And Verify
 1. Build a baseline matrix before the first optimization edit: run correctness,
-   then benchmark the target workload on the selected idle card for enough
+   run the selected-card device bandwidth calibration, then benchmark the
-   repeats to report p50/p90 or mean, effective bandwidth, variance/noise, card
+   target workload on the selected idle card for enough repeats to report
-   status, and command line. Store it in the attempt ledger and
+   p50/p90 or mean, effective bandwidth, variance/noise, card status, device
-   `kernel_opt_readme.md`.
+   read/write/copy bandwidth, and command line. Store it in the attempt ledger
+   and `kernel_opt_readme.md`.
 2. Make the smallest LightOp source change that can satisfy the current
   task-acceptance pair.
 3. Build LightOp with the target arch.
@@ -263,9 +298,15 @@ evidence.
   the performance device gate and pin the run with `HIP_VISIBLE_DEVICES`.
 5. Record every candidate result: correctness failure, build failure,
   regression, plateau, and improvement.
-6. Invoke `dcu-profiler-report` when benchmark evidence is not enough to choose
+6. For every correctness-passing optimization candidate, run the normal
-   the next edit.
+   synchronized benchmark first, then collect `hipprof --pmc` evidence for the
-7. If the first correctness-passing candidate misses the required performance
+   same representative target shape before making the next kernel or dispatch
+   edit. Capture cache-related counters, LDS/bank-conflict clues,
+   occupancy/resource signals, and the exact command output or unsupported-tool
+   reason. Use `dcu-profiler-report` to turn this evidence into the next edit.
+7. Invoke deeper `dcu-profiler-report` analysis when the per-candidate PMC
+   evidence and benchmark still do not explain the next edit.
+8. If the first correctness-passing candidate misses the required performance
   threshold, continue into profiling and tuning instead of declaring the task
   done.
@@ -282,15 +323,19 @@ evidence.
   regimes need different choices.
 4. Re-run correctness across all touched dtypes/layouts/modes.
 5. Re-run benchmark cases that define `W` and any nearby regression guards.
-6. Reject or revert a candidate lineage in the final chosen path when it fails
+6. Re-run per-candidate `hipprof --pmc` captures after each correctness-passing
+   optimization edit and summarize cache behavior, LDS/bank conflicts,
+   occupancy/resource pressure, and one profiler-backed next action before
+   starting the next edit.
+7. Reject or revert a candidate lineage in the final chosen path when it fails
   correctness, improves less than the noise/stability threshold, helps only
   non-target shapes, or lacks required profile/resource/ISA evidence after a
   profiling gate. Record the rejected lineage instead of silently overwriting
   it.
-7. After reaching the target, run a final guard validation: targeted
+8. After reaching the target, run a final guard validation: targeted
   correctness, repeated target benchmark on the selected card, and nearby
   shape/dtype regression checks when relevant.
-8. Summarize final code paths, fallback behavior, unsupported regimes, and
+9. Summarize final code paths, fallback behavior, unsupported regimes, and
   remaining risks.
 ## Performance Target Discipline
@@ -304,6 +349,13 @@ that threshold as part of the acceptance contract.
 - For every correctness-passing candidate, record shape, dtype, layout/mode,
  kernel or dispatch configuration, measured bandwidth/latency, comparison
  baseline, and the reason it improved, regressed, or plateaued.
+- For every correctness-passing optimization candidate, record paired
+  `hipprof --pmc` evidence for the representative workload before choosing the
+  next edit. The digest must discuss cache behavior, memory/cache traffic,
+  LDS or bank-conflict evidence, occupancy/resource pressure, and exactly one
+  next LightOp kernel, launcher, dispatcher, config, or benchmark edit. If a
+  PMC counter or occupancy signal is unavailable on the installed DTK, record
+  the exact command and failure instead of guessing.
 - A performance improvement counts as effective only when it exceeds the
  benchmark noise/stability threshold defined in the plan. If the measured
  delta is inside the noise band, record it as inconclusive or plateau, not as
@@ -331,8 +383,7 @@ that threshold as part of the acceptance contract.
 - Do not claim that an optimization is effective from intuition, source
  inspection, or expected hardware behavior. Promotion to the optimization
  ledger requires measured correctness-passing benchmark data, baseline or
-  parent comparison, and, when a profiling gate applies, profiler/resource/ISA
+  parent comparison, and per-candidate PMC/profile/resource evidence.
-  evidence.
 - If the target remains unmet after the required tuning lineages, summarize the
  best candidate, failed lineages, profiler bottleneck class, unsupported
  regimes, and the most likely next engineering investment. Do not present the
@@ -368,6 +419,8 @@ Use this shape:
 - Target workload:
 - Effective-bandwidth formula:
 - Device gate:
+- Device bandwidth calibration: read/write/copy/triad bandwidth, buffer size,
+  selected card, command:
 - Build/test/benchmark commands:
 - Baseline p50/p90/mean, variance/noise:
@@ -380,7 +433,8 @@ Use this shape:
 - Device status and HIP_VISIBLE_DEVICES:
 - Benchmark table: baseline/parent/candidate, p50/p90/mean, effective BW,
  delta, noise threshold:
- Profile/resource/ISA evidence:
+- Per-candidate PMC/profile/resource evidence: cache behavior, LDS/bank
+  conflicts, occupancy/resource pressure, unavailable counters:
 - Decision: keep | reject | inconclusive
 - Reason:
 - Next step:
@@ -431,11 +485,34 @@ capture `hy-smi` or `rocm-smi`, choose a low-utilization/low-VRAM card, and run
 the benchmark with `HIP_VISIBLE_DEVICES=<idle-card>`. Reuse the same selected
 card for baseline and candidate measurements.
+After every correctness-passing optimization candidate, collect PMC evidence
+for the same representative benchmark shape before making the next edit. Use
+the same selected card, store the artifacts under
+`.humanize/lightop-agent/profile-artifacts/<version>/`, and run supported
+variants such as:
+```bash
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc --pmc-type 3 \
+  -o .humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc \
+  python test/<family>/benchmark_<op>.py
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-read --pmc-type 3 \
+  -o .humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc-read \
+  python test/<family>/benchmark_<op>.py
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-write --pmc-type 3 \
+  -o .humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc-write \
+  python test/<family>/benchmark_<op>.py
+```
+Use the PMC digest to inspect cache behavior, memory/cache traffic, LDS or bank
+conflicts, occupancy/resource pressure, and the next concrete edit. If a
+counter, `--pmc` mode, or occupancy signal is unsupported in the installed DTK,
+record the attempted command and error output in the artifact directory.
 Do not claim success from a passing build alone. A LightOp operator change is
 complete only after install, import smoke, targeted correctness, benchmark
-comparison, and, when the result is near the threshold or surprising, profiler
+comparison, and the required per-candidate PMC/profile evidence. Do not claim
-evidence. Do not claim speedups from Python wall-clock timing unless
+speedups from Python wall-clock timing unless asynchronous DCU work is
-asynchronous DCU work is synchronized.
+synchronized.
 Use the same benchmark command, selected card, workload shape(s), and effective
 bandwidth formula for baseline and candidate comparisons. If the workload
@@ -443,8 +520,14 @@ contract changes, start a new baseline matrix and explain why.
 ## Profiling
-Invoke `dcu-profiler-report` autonomously when profiler evidence is the best
+Invoke `dcu-profiler-report` autonomously after every correctness-passing
-next source of truth. These are heuristics, not user-facing gates:
+optimization candidate to interpret the mandatory `hipprof --pmc` capture and
+produce the next edit. The digest may be concise, but it must explain cache
+behavior, LDS/bank-conflict clues, occupancy/resource pressure, and exactly one
+profiler-backed next action.
+Escalate from the mandatory per-candidate PMC pass to deeper profiling when any
+of these hold:
 - Baseline benchmark has passed and no profile digest exists.
 - A correct candidate is within +/-2% of baseline or the prior best.
@@ -492,6 +575,10 @@ schema. Include acceptance criteria for:
 - Device gate: `hy-smi` or `rocm-smi` command, idle-card selection criteria,
  required `HIP_VISIBLE_DEVICES=<idle-card>` prefix for benchmark/profile, and
  where the card-state output is stored.
+- Device bandwidth calibration before the first optimization edit: exact
+  command, selected card, buffer size, dtype, measured read/write/copy/triad
+  bandwidth, artifact path, and how the result constrains or contextualizes the
+  target effective-bandwidth threshold.
 - Correctness coverage for `W`, edge cases, dtype/layout/mode boundaries, and
  baseline/reference parity.
 - Build command, ROCm/DTK/PyTorch versions, `PYTORCH_ROCM_ARCH`, and device
@@ -499,8 +586,14 @@ schema. Include acceptance criteria for:
 - Benchmark method with warmup, repeats, synchronization, per-shape timing,
  p50/p90 or mean as appropriate, variance/noise band, minimum effective delta,
  and environment metadata.
+- Per-candidate `hipprof --pmc` capture after every correctness-passing
+  optimization edit, including artifact path, selected card, representative
+  shape, cache counters or unavailable-counter reason, LDS/bank-conflict
+  evidence, occupancy/resource interpretation, and the profiler-backed next
+  edit.
 - Baseline matrix required before the first optimization edit, including card
-  status, repeated timing, effective bandwidth, and noise/stability estimate.
+  status, device bandwidth calibration, repeated timing, effective bandwidth,
+  and noise/stability estimate.
 - Iteration discipline: one primary optimization hypothesis per lineage, plus
  explicit keep/reject/inconclusive decision rules.
 - Research digest covering local LightOp patterns and any upstream/source

--- a/humanize/skills/ncu-report/SKILL.md
+++ b/humanize/skills/ncu-report/SKILL.md
@@ -38,6 +38,9 @@ the next edit.
 Invoke `dcu-profiler-report` when any of these hold:
+- A correctness-passing LightOp optimization candidate has just been benchmarked
+  and the loop needs the mandatory per-candidate `hipprof --pmc` interpretation
+  before the next edit.
 - A LightOp baseline benchmark has passed and no baseline profile digest exists.
 - A correct candidate is within +/-2% of baseline or the prior best.
 - The second correctness-passing optimization attempt improves less than 5%
@@ -66,17 +69,23 @@ same host/container environment that will run the workload.
 - Store the status output and the selected card in the artifact directory and
  digest. If the tools are unavailable, record the failed command and treat the
  resulting performance data as non-final unless the user accepts that caveat.
+- Before the first optimization profile or baseline comparison, include the
+  selected-card device bandwidth calibration from
+  `.humanize/lightop-agent/device-bandwidth.txt`. If it is missing, run the
+  calibration in the same environment and record actual read, write,
+  copy/read-write, and triad bandwidth before interpreting operator bandwidth.
 ## Required Artifacts
 Store artifacts under the LightOp loop state or user-specified evidence path:
 ```text
+.humanize/lightop-agent/device-bandwidth.txt
 .humanize/lightop-agent/profile-artifacts/<version>/
  device-status.txt           # hy-smi or rocm-smi before benchmark/profile
  benchmark.log
  hipprof.txt
-  hipprof-pmc-all/            # mandatory for deep-analysis gates
+  hipprof-pmc-all/            # mandatory for each correctness-passing optimization candidate
  sqtt-json/                  # mandatory when SQTT is available
  rocprof.csv                 # when rocprof/rocprofv3 is used
  rocprof-stats.csv           # when available
@@ -98,21 +107,30 @@ the digest.
   shape, fixed seed, and explicit synchronization around timed regions.
 3. Run the device selection gate and pin the selected idle card with
   `HIP_VISIBLE_DEVICES`.
-4. Capture normal benchmark output before profiling, so profiler overhead does
+4. Before the first optimization edit or baseline profile, run or cite the
+   selected-card device bandwidth calibration and compare the operator's
+   effective-bandwidth target against the measured read/write/copy ceiling.
+5. Capture normal benchmark output before profiling, so profiler overhead does
   not become the performance claim.
-5. Run `hipprof` for first-pass API/kernel/memcpy timing.
+6. Run `hipprof` for first-pass API/kernel/memcpy timing.
-6. For deep-analysis gates, collect `hipprof` PMC all, SQTT JSON when
+7. For every correctness-passing optimization candidate, collect `hipprof`
+   PMC evidence with supported variants such as `--pmc`, `--pmc-read`, and
+   `--pmc-write`. Interpret cache behavior, memory/cache traffic, LDS or
+   bank-conflict evidence, and occupancy/resource pressure before choosing the
+   next edit. If a counter group or occupancy signal is unavailable, record the
+   exact attempted command and error output.
+8. For deep-analysis gates, add SQTT JSON when
   supported, `dccobjdump` disassembly, code-object resource usage, and
   LDS/register/occupancy evidence before choosing the next edit.
-7. If `hipprof` shows one or a few dominant kernels but does not explain why,
+9. If `hipprof` shows one or a few dominant kernels but does not explain why,
   collect deeper counters with the installed ROCm/DTK profiler.
-8. If the issue looks codegen-sensitive, inspect AMDGPU ISA or code-object
+10. If the issue looks codegen-sensitive, inspect AMDGPU ISA or code-object
   metadata before choosing an edit.
-9. Compare candidate against baseline or parent, not only absolute metrics.
+11. Compare candidate against baseline or parent, not only absolute metrics.
-10. Diagnose using the metric groups in [metrics.md](references/metrics.md).
+12. Diagnose using the metric groups in [metrics.md](references/metrics.md).
-11. Write `digest.md` using the template below. The final section must contain
+13. Write `digest.md` using the template below. The final section must contain
   exactly one next edit.
-12. Update the LightOp loop ledger with the digest path, bottleneck class, and
+14. Update the LightOp loop ledger with the digest path, bottleneck class, and
    selected next edit.
 ## Common Commands
@@ -122,6 +140,31 @@ Load [examples.md](references/examples.md) for copyable command variants.
 Minimal first-pass capture from the LightOp root:
 ```bash
+mkdir -p .humanize/lightop-agent
+HIP_VISIBLE_DEVICES=<idle-card> python - <<'PY' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
+import time, torch
+torch.cuda.init()
+free, total = torch.cuda.mem_get_info()
+bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
+n = bytes_per_buf // 4
+a = torch.empty(n, device="cuda", dtype=torch.float32)
+b = torch.empty_like(a)
+c = torch.empty_like(a)
+a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
+def bench(name, fn, bytes_moved, iters=80, warmup=20):
+    for _ in range(warmup): fn()
+    torch.cuda.synchronize()
+    t0 = time.perf_counter()
+    for _ in range(iters): fn()
+    torch.cuda.synchronize()
+    dt = (time.perf_counter() - t0) / iters
+    print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
+bench("write_fill", lambda: a.fill_(3.0), n * 4)
+bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
+bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
+bench("read_reduce", lambda: torch.sum(a), n * 4)
+print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
+PY
 mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline
 hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt || \
  rocm-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt
@@ -137,6 +180,16 @@ hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v001_candidate/devic
  rocm-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v001_candidate/device-status.txt
 HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py 2>&1 \
  | tee .humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof.txt
+mkdir -p .humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof-pmc-all
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc --pmc-type 3 \
+  -o .humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof-pmc-all/pmc \
+  python test/<family>/benchmark_<op>.py
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-read --pmc-type 3 \
+  -o .humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof-pmc-all/pmc-read \
+  python test/<family>/benchmark_<op>.py
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-write --pmc-type 3 \
+  -o .humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof-pmc-all/pmc-write \
+  python test/<family>/benchmark_<op>.py
 ```
 Deep-analysis capture, required after a second correctness-passing