add prof 约束

f18f386a · whlwhlwhl · 91f343be · f18f386a · f18f386a
Commit f18f386a authored May 20, 2026 by whlwhlwhl
Showing with 151 additions and 12 deletions

humanize/skills/humanize-kernel-agent-loop/SKILL.md humanize/skills/humanize-kernel-agent-loop/SKILL.md +64 -1

humanize/skills/ncu-report/SKILL.md humanize/skills/ncu-report/SKILL.md +87 -11

No files found.
--- a/humanize/skills/humanize-kernel-agent-loop/SKILL.md
+++ b/humanize/skills/humanize-kernel-agent-loop/SKILL.md
@@ -82,6 +82,9 @@ Record these before the first build:
 - Container name or image tag, or `direct-host` when not using Docker.
 - LightOp path from the command's point of view, not only the host path.
 - Visible device env such as `HIP_VISIBLE_DEVICES` or `HSA_VISIBLE_DEVICES`.
+- DCU status command and selected card before performance runs: `hy-smi` or
+  `rocm-smi`, the observed HCU utilization, VRAM use, and the
+  `HIP_VISIBLE_DEVICES=<idle-card>` value used for benchmark/profile commands.
 - `PYTORCH_ROCM_ARCH`, DTK/ROCm version, PyTorch version, HIP version, device
  name, and `gcnArchName`.
 - Exact build/install, import-smoke, correctness, benchmark, and profiler
@@ -174,6 +177,33 @@ PY
 hipcc --version
 ```
+## Performance Device Gate
+Before any benchmark or profiling command, check card status in the target
+execution environment and pin the run to an idle card.
+- Run `hy-smi` or `rocm-smi` immediately before the performance command.
+- Choose the card with low HCU utilization and low VRAM occupancy. If no card
+  is idle enough for stable results, delay the performance run or record that
+  the result is noisy and not acceptable as final evidence.
+- Prefix benchmark and profiler commands with `HIP_VISIBLE_DEVICES=<card>`.
+  Keep the same card for the paired baseline/candidate comparison unless the
+  comparison is deliberately measuring cross-device behavior.
+- Record the chosen card, HCU utilization, VRAM use, and exact command in the
+  attempt ledger or profile artifact directory.
+Example:
+```bash
+hy-smi || rocm-smi
+HIP_VISIBLE_DEVICES=<idle-card> python test/<family>/benchmark_<op>.py
+HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py
+```
+Do not report a performance number as actionable unless the device-selection
+gate was recorded or the user explicitly accepts the missing card-state
+evidence.
 ## Workflow
 ### Stage 1: Inspect And Plan
@@ -193,7 +223,8 @@ hipcc --version
 1. Make the smallest LightOp source change that can satisfy the current
   task-acceptance pair.
 2. Build LightOp with the target arch.
-3. Run the targeted correctness test and then the benchmark for `W`.
+3. Run the targeted correctness test. Before the benchmark for `W`, execute
+   the performance device gate and pin the run with `HIP_VISIBLE_DEVICES`.
 4. Record every candidate result: correctness failure, build failure,
   regression, plateau, and improvement.
 5. Invoke `dcu-profiler-report` when benchmark evidence is not enough to choose
@@ -234,9 +265,20 @@ that threshold as part of the acceptance contract.
    layernorm/rmsnorm/fused-norm patterns, relevant ROCm/DCU upstream evidence,
    and any portable reduction/vectorization ideas from the bundled corpus.
  - A `dcu-profiler-report` digest for a representative target shape.
+- If the second correctness-passing optimization attempt improves less than 5%
+  over the relevant parent or baseline, run a deep `dcu-profiler-report`
+  analysis before the next optimization edit. That analysis must include
+  `hipprof` PMC all mode, SQTT JSON when available, `dccobjdump`
+  disassembly, code-object resource usage, and explicit LDS/register/occupancy
+  evidence or a recorded reason any item was unavailable.
 - The next edit after that gate must name exactly one concrete LightOp kernel,
  binding, dispatcher, config, or benchmark change and cite the knowledge and
  profiler evidence that motivated it.
+- Do not claim that an optimization is effective from intuition, source
+  inspection, or expected hardware behavior. Promotion to the optimization
+  ledger requires measured correctness-passing benchmark data, baseline or
+  parent comparison, and, when a profiling gate applies, profiler/resource/ISA
+  evidence.
 - If the target remains unmet after the required tuning lineages, summarize the
  best candidate, failed lineages, profiler bottleneck class, unsupported
  regimes, and the most likely next engineering investment. Do not present the
@@ -300,6 +342,11 @@ baseline and threshold. If no benchmark exists, add a small benchmark that uses
 warmup, fixed shapes, fixed seeds, and explicit `torch.cuda.synchronize()`
 around timed regions.
+Before every benchmark, run the performance device gate from this skill:
+capture `hy-smi` or `rocm-smi`, choose a low-utilization/low-VRAM card, and run
+the benchmark with `HIP_VISIBLE_DEVICES=<idle-card>`. Reuse the same selected
+card for baseline and candidate measurements.
 Do not claim success from a passing build alone. A LightOp operator change is
 complete only after install, import smoke, targeted correctness, benchmark
 comparison, and, when the result is near the threshold or surprising, profiler
@@ -320,6 +367,9 @@ next source of truth. These are heuristics, not user-facing gates:
 - Two consecutive correctness-passing candidates miss the target, in which case
  pair this profile with a `lightop-kernel-knowledge` research pass before the
  next kernel or dispatch edit.
+- The second correctness-passing optimization attempt improves less than 5%
+  over its parent or baseline. This is a mandatory deep-analysis gate, not a
+  heuristic.
 - A candidate is much faster than expected and needs explanation.
 - A reviewer asks for profiling evidence.
@@ -327,6 +377,12 @@ Persist profile artifacts under `.humanize/lightop-agent/profile-artifacts/`
 or the user-specified evidence directory. Each digest must end with exactly one
 concrete next kernel edit or a clear reason profiling is not actionable.
+When the <5% second-optimization gate fires, the digest must include
+`hipprof` PMC all, SQTT JSON when supported, `dccobjdump` disassembly,
+code-object resource usage, and LDS/register/occupancy evidence. If any tool is
+missing, record the exact command attempted and do not replace the missing
+evidence with a guess.
 ## Plan Requirements
 Write `.humanize/lightop-agent/refined-plan.md` using the Humanize gen-plan
@@ -339,6 +395,9 @@ schema. Include acceptance criteria for:
  the execution environment, visible device selection, install command, smoke
  command, correctness command, benchmark command, profiler command, and pass
  threshold.
+- Device gate: `hy-smi` or `rocm-smi` command, idle-card selection criteria,
+  required `HIP_VISIBLE_DEVICES=<idle-card>` prefix for benchmark/profile, and
+  where the card-state output is stored.
 - Correctness coverage for `W`, edge cases, dtype/layout/mode boundaries, and
  baseline/reference parity.
 - Build command, ROCm/DTK/PyTorch versions, `PYTORCH_ROCM_ARCH`, and device
@@ -353,6 +412,10 @@ schema. Include acceptance criteria for:
  consecutive correctness-passing misses trigger both `lightop-kernel-knowledge`
  research and `dcu-profiler-report` evidence before the next edit, and unmet
  targets cannot be reported as complete.
+- Low-gain discipline: if the second correctness-passing optimization improves
+  less than 5%, the next edit is blocked on deep profiling evidence:
+  `hipprof` PMC all, SQTT JSON if available, `dccobjdump`, code-object
+  resource usage, and LDS/register/occupancy analysis.
 - Tuning decisions and dispatcher/config updates when `W` has multiple
  regimes.
 - Final correctness matrix, benchmark matrix, fallback paths, unsupported

--- a/humanize/skills/ncu-report/SKILL.md
+++ b/humanize/skills/ncu-report/SKILL.md
@@ -40,6 +40,9 @@ Invoke `dcu-profiler-report` when any of these hold:
 - A LightOp baseline benchmark has passed and no baseline profile digest exists.
 - A correct candidate is within +/-2% of baseline or the prior best.
+- The second correctness-passing optimization attempt improves less than 5%
+  over its parent or baseline. This requires a deep analysis pass before the
+  next optimization edit.
 - A correct candidate regresses on one or more important shapes.
 - A candidate is much faster than expected and needs explanation.
 - The next edit is unclear after benchmark results.
@@ -51,19 +54,36 @@ Invoke `dcu-profiler-report` when any of these hold:
 Do not profile while correctness is failing unless the failure is itself a
 profiler collection problem. Fix correctness first.
+## Device Selection Gate
+Before any benchmark or profiling capture, record current card state in the
+same host/container environment that will run the workload.
+- Run `hy-smi` or `rocm-smi`.
+- Pick a card with low HCU utilization and low VRAM occupancy.
+- Run benchmark and profiling commands with `HIP_VISIBLE_DEVICES=<idle-card>`.
+- Keep the same selected card for paired baseline/candidate captures.
+- Store the status output and the selected card in the artifact directory and
+  digest. If the tools are unavailable, record the failed command and treat the
+  resulting performance data as non-final unless the user accepts that caveat.
 ## Required Artifacts
 Store artifacts under the LightOp loop state or user-specified evidence path:
 ```text
 .humanize/lightop-agent/profile-artifacts/<version>/
+  device-status.txt           # hy-smi or rocm-smi before benchmark/profile
  benchmark.log
  hipprof.txt
+  hipprof-pmc-all/            # mandatory for deep-analysis gates
+  sqtt-json/                  # mandatory when SQTT is available
  rocprof.csv                 # when rocprof/rocprofv3 is used
  rocprof-stats.csv           # when available
  rocprof-compute/            # when available
  code-object-metadata.txt    # when extracted
  amdgpu-isa.txt              # when extracted
+  resource-usage.txt          # VGPR/SGPR/LDS/occupancy/code-object evidence
  digest.md
 ```
@@ -76,18 +96,23 @@ the digest.
   exposes the regression, plateau, launch overhead, or suspected bottleneck.
 2. Build a focused benchmark harness with stable warmup, fixed dtype, fixed
   shape, fixed seed, and explicit synchronization around timed regions.
-3. Capture normal benchmark output before profiling, so profiler overhead does
+3. Run the device selection gate and pin the selected idle card with
+   `HIP_VISIBLE_DEVICES`.
+4. Capture normal benchmark output before profiling, so profiler overhead does
   not become the performance claim.
-4. Run `hipprof` for first-pass API/kernel/memcpy timing.
+5. Run `hipprof` for first-pass API/kernel/memcpy timing.
-5. If `hipprof` shows one or a few dominant kernels but does not explain why,
+6. For deep-analysis gates, collect `hipprof` PMC all, SQTT JSON when
+   supported, `dccobjdump` disassembly, code-object resource usage, and
+   LDS/register/occupancy evidence before choosing the next edit.
+7. If `hipprof` shows one or a few dominant kernels but does not explain why,
   collect deeper counters with the installed ROCm/DTK profiler.
-6. If the issue looks codegen-sensitive, inspect AMDGPU ISA or code-object
+8. If the issue looks codegen-sensitive, inspect AMDGPU ISA or code-object
   metadata before choosing an edit.
-7. Compare candidate against baseline or parent, not only absolute metrics.
+9. Compare candidate against baseline or parent, not only absolute metrics.
-8. Diagnose using the metric groups in [metrics.md](references/metrics.md).
+10. Diagnose using the metric groups in [metrics.md](references/metrics.md).
-9. Write `digest.md` using the template below. The final section must contain
+11. Write `digest.md` using the template below. The final section must contain
   exactly one next edit.
-10. Update the LightOp loop ledger with the digest path, bottleneck class, and
+12. Update the LightOp loop ledger with the digest path, bottleneck class, and
    selected next edit.
 ## Common Commands
@@ -98,18 +123,52 @@ Minimal first-pass capture from the LightOp root:
 ```bash
 mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline
-python test/test_<op>.py 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log
+hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt || \
-hipprof python test/test_<op>.py 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt
+  rocm-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt
+HIP_VISIBLE_DEVICES=<idle-card> python test/test_<op>.py 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log
+HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/test_<op>.py 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt
 ```
 For a benchmark script:
 ```bash
 mkdir -p .humanize/lightop-agent/profile-artifacts/v001_candidate
-hipprof python test/<family>/benchmark_<op>.py 2>&1 \
+hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v001_candidate/device-status.txt || \
+  rocm-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v001_candidate/device-status.txt
+HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py 2>&1 \
  | tee .humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof.txt
 ```
+Deep-analysis capture, required after a second correctness-passing
+optimization improves less than 5%:
+```bash
+mkdir -p .humanize/lightop-agent/profile-artifacts/v002_deep/{hipprof-pmc-all,sqtt-json}
+hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v002_deep/device-status.txt || \
+  rocm-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v002_deep/device-status.txt
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc --pmc-type 3 \
+  -o .humanize/lightop-agent/profile-artifacts/v002_deep/hipprof-pmc-all/pmc \
+  python test/<family>/benchmark_<op>.py
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-read --pmc-type 3 \
+  -o .humanize/lightop-agent/profile-artifacts/v002_deep/hipprof-pmc-all/pmc-read \
+  python test/<family>/benchmark_<op>.py
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-write --pmc-type 3 \
+  -o .humanize/lightop-agent/profile-artifacts/v002_deep/hipprof-pmc-all/pmc-write \
+  python test/<family>/benchmark_<op>.py
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --sqtt --sqtt-type 1 --output-type 0 \
+  -d .humanize/lightop-agent/profile-artifacts/v002_deep/sqtt-json/ \
+  python test/<family>/benchmark_<op>.py
+dccobjdump --inputs=<binary-or-so> --show-sass --show-instruction-encoding \
+  --separate-functions > .humanize/lightop-agent/profile-artifacts/v002_deep/amdgpu-isa.txt
+hipprof --codeobj-analyze <binary-or-so> \
+  > .humanize/lightop-agent/profile-artifacts/v002_deep/resource-usage.txt
+```
+Adjust `<binary-or-so>` to the compiled extension, code object, or extracted
+kernel binary produced by the actual build. If a command is unsupported by the
+installed DTK, keep the command output or error in the artifact directory and
+state the missing evidence in the digest.
 Optional helper:
 ```bash
@@ -131,6 +190,8 @@ full list and interpretation rules:
 - Runtime identity: LightOp commit, operator, public API, shape, dtype, gfx
  arch, DTK/ROCm/PyTorch version, build command.
+- Device selection: `hy-smi` or `rocm-smi` output, selected
+  `HIP_VISIBLE_DEVICES`, HCU utilization, and VRAM occupancy.
 - Benchmark timing: warmup, repeats, p50/p90/mean, synchronization points,
  variance, baseline and candidate deltas.
 - `hipprof` timing: total HIP API time, kernel time, memcpy/memset time,
@@ -149,6 +210,11 @@ full list and interpretation rules:
 - ISA/code object: hot instruction window, vector width, MFMA selection,
  excessive scalarization, scratch loads/stores.
+For the mandatory <5% second-optimization gate, the digest must include all of
+these sections: `hipprof` PMC all, SQTT JSON or the unavailable-tool reason,
+`dccobjdump` disassembly, code-object resource usage, and explicit
+LDS/register/occupancy interpretation.
 Validate exact tool availability on the target machine:
 ```bash
@@ -168,6 +234,9 @@ Environment
 - LightOp root:
 - Repo commit:
 - GPU / gfx:
+- Device status command:
+- Selected HIP_VISIBLE_DEVICES:
+- HCU utilization / VRAM before run:
 - DTK / ROCm / PyTorch:
 - Build command:
 - Benchmark command:
@@ -190,6 +259,13 @@ Evidence
 Profiler Hotspots
 - <kernel/API/copy/config branch>: <measured signal> -> <meaning>
+Deep Analysis Artifacts
+- hipprof PMC all:
+- SQTT JSON:
+- dccobjdump:
+- code-object resource usage:
+- LDS/register/occupancy:
 Counter / ISA Analysis
 - Counter source:
 - Hot instruction window: