添加验证约束

bb293875 · whlwhlwhl · b36a10ea · bb293875
Commit bb293875 authored May 20, 2026 by whlwhlwhl
Show whitespace changes
Inline Side-by-side

Showing with 88 additions and 13 deletions

humanize/skills/humanize-kernel-agent-loop/SKILL.md humanize/skills/humanize-kernel-agent-loop/SKILL.md +88 -13

No files found.
--- a/humanize/skills/humanize-kernel-agent-loop/SKILL.md
+++ b/humanize/skills/humanize-kernel-agent-loop/SKILL.md
@@ -23,7 +23,9 @@ K: operator semantics, input/output tensors, dtype/layout/mode constraints
 R: correctness reference, usually PyTorch native, an existing LightOp path, or
   a small literal oracle for edge cases
 W: workload distribution, including target shapes, dtypes, model scenario,
-   gfx arch, and latency/throughput comparison target
+   gfx arch, target shapes, dtype/layout/contiguity, reduction/axis details,
+   epsilon or mode flags, effective-bandwidth formula, and
+   latency/throughput comparison target
 E: execution environment, including Docker container/image or direct host,
   LightOp path inside that environment, visible DCU device, DTK/ROCm/PyTorch
   versions, build command, test command, benchmark command, and pass threshold
@@ -91,6 +93,9 @@ Record these before the first build:
  commands.
 - Pass/fail threshold: numerical tolerance for correctness and the required
  performance target versus baseline.
+- Benchmark stability rule: warmup/repeat counts, summary statistic
+  (p50/p90/mean), acceptable noise band, and the minimum delta needed to count
+  as an effective optimization.

 Do not mix a host build with container tests unless the user explicitly wants
 that split and both paths point at the same compiled extension.
@@ -219,30 +224,53 @@ evidence.
   corpus. Use this whenever it can shape the first implementation route.
 5. Write a concise research digest in the loop state before the first serious
   implementation lineage.
+6. Define the benchmark contract before editing code: exact target shape(s),
+   dtype/layout/contiguity, axis/mode/epsilon, effective-bandwidth formula,
+   warmup/repeat counts, selected summary statistic, noise band, and the
+   benchmark command that will be used for baseline and candidates.

 ### Stage 2: Implement And Verify

-1. Make the smallest LightOp source change that can satisfy the current
+1. Build a baseline matrix before the first optimization edit: run correctness,
+   then benchmark the target workload on the selected idle card for enough
+   repeats to report p50/p90 or mean, effective bandwidth, variance/noise, card
+   status, and command line. Store it in the attempt ledger and
+   `kernel_opt_readme.md`.
+2. Make the smallest LightOp source change that can satisfy the current
   task-acceptance pair.
-2. Build LightOp with the target arch.
-3. Run the targeted correctness test. Before the benchmark for `W`, execute
+3. Build LightOp with the target arch.
+4. Run the targeted correctness test. Before the benchmark for `W`, execute
   the performance device gate and pin the run with `HIP_VISIBLE_DEVICES`.
-4. Record every candidate result: correctness failure, build failure,
+5. Record every candidate result: correctness failure, build failure,
   regression, plateau, and improvement.
-5. Invoke `dcu-profiler-report` when benchmark evidence is not enough to choose
+6. Invoke `dcu-profiler-report` when benchmark evidence is not enough to choose
   the next edit.
-6. If the first correctness-passing candidate misses the required performance
+7. If the first correctness-passing candidate misses the required performance
   threshold, continue into profiling and tuning instead of declaring the task
   done.

 ### Stage 3: Tune And Integrate

 1. Build a performance map over `W`.
-2. Add shape/gfx-aware dispatch or config-table entries only when measured
+2. Keep each optimization lineage focused on one primary hypothesis, such as
+   vectorized memory access, block/thread mapping, register pressure, LDS
+   layout, occupancy, launch configuration, epilogue fusion, or shape
+   specialization. Avoid mixing multiple unrelated techniques in one candidate
+   unless the code structure makes separation impossible, and record the
+   reason.
+3. Add shape/gfx-aware dispatch or config-table entries only when measured
   regimes need different choices.
-3. Re-run correctness across all touched dtypes/layouts/modes.
-4. Re-run benchmark cases that define `W` and any nearby regression guards.
-5. Summarize final code paths, fallback behavior, unsupported regimes, and
+4. Re-run correctness across all touched dtypes/layouts/modes.
+5. Re-run benchmark cases that define `W` and any nearby regression guards.
+6. Reject or revert a candidate lineage in the final chosen path when it fails
+   correctness, improves less than the noise/stability threshold, helps only
+   non-target shapes, or lacks required profile/resource/ISA evidence after a
+   profiling gate. Record the rejected lineage instead of silently overwriting
+   it.
+7. After reaching the target, run a final guard validation: targeted
+   correctness, repeated target benchmark on the selected card, and nearby
+   shape/dtype regression checks when relevant.
+8. Summarize final code paths, fallback behavior, unsupported regimes, and
   remaining risks.

 ## Performance Target Discipline
@@ -256,6 +284,10 @@ that threshold as part of the acceptance contract.
 - For every correctness-passing candidate, record shape, dtype, layout/mode,
  kernel or dispatch configuration, measured bandwidth/latency, comparison
  baseline, and the reason it improved, regressed, or plateaued.
+- A performance improvement counts as effective only when it exceeds the
+  benchmark noise/stability threshold defined in the plan. If the measured
+  delta is inside the noise band, record it as inconclusive or plateau, not as
+  an optimization win.
 - After the first correctness-passing candidate misses the target, run a
  profiling-and-tuning loop before stopping.
 - Try at least three evidence-backed performance optimization lineages unless
@@ -294,6 +326,7 @@ Keep Humanize state local and untracked:
 .humanize/lightop-agent/refined-plan.md
 .humanize/lightop-agent/research-digest.md
 .humanize/lightop-agent/attempt-ledger.md
+.humanize/lightop-agent/kernel_opt_readme.md
 .humanize/lightop-agent/optimization-ledger.md
 .humanize/lightop-agent/lineage.jsonl
 .humanize/lightop-agent/performance-map.json
@@ -304,6 +337,34 @@ Keep Humanize state local and untracked:
 Before starting RLCR, make sure `.humanize*` is ignored. Do not commit loop
 state unless the user explicitly asks for tracked evidence artifacts.

+`kernel_opt_readme.md` must be kept current after every benchmarked candidate.
+Use this shape:
+
+```markdown
+# Kernel Optimization Log
+
+## Baseline Matrix
+- Target workload:
+- Effective-bandwidth formula:
+- Device gate:
+- Build/test/benchmark commands:
+- Baseline p50/p90/mean, variance/noise:
+
+## Iteration <N>: <one primary hypothesis>
+- Hypothesis:
+- Files changed:
+- Code/config change:
+- Build command/result:
+- Correctness command/result:
+- Device status and HIP_VISIBLE_DEVICES:
+- Benchmark table: baseline/parent/candidate, p50/p90/mean, effective BW,
+  delta, noise threshold:
+- Profile/resource/ISA evidence:
+- Decision: keep | reject | inconclusive
+- Reason:
+- Next step:
+```
+
 ## Build, Test, Benchmark

 Build from the LightOp root inside the selected execution environment. Always
@@ -355,6 +416,10 @@ comparison, and, when the result is near the threshold or surprising, profiler
 evidence. Do not claim speedups from Python wall-clock timing unless
 asynchronous DCU work is synchronized.

+Use the same benchmark command, selected card, workload shape(s), and effective
+bandwidth formula for baseline and candidate comparisons. If the workload
+contract changes, start a new baseline matrix and explain why.
+
 ## Profiling

 Invoke `dcu-profiler-report` autonomously when profiler evidence is the best
@@ -393,6 +458,9 @@ schema. Include acceptance criteria for:
 - LightOp root, target operator family, public API, and modified files.
 - Explicit `K`, `R`, `W`, target gfx arch, baseline command, comparison target,
  and hard scope exclusions.
+- Workload contract: target shape(s), dtype, layout/contiguity, axis/mode,
+  epsilon or other math flags, effective-bandwidth formula, and the exact
+  success metric.
 - Explicit `E`: Docker container/image or direct-host mode, LightOp path inside
  the execution environment, visible device selection, install command, smoke
  command, correctness command, benchmark command, profiler command, and pass
@@ -405,10 +473,17 @@ schema. Include acceptance criteria for:
 - Build command, ROCm/DTK/PyTorch versions, `PYTORCH_ROCM_ARCH`, and device
  metadata.
 - Benchmark method with warmup, repeats, synchronization, per-shape timing,
-  p50/p90 or mean as appropriate, and environment metadata.
+  p50/p90 or mean as appropriate, variance/noise band, minimum effective delta,
+  and environment metadata.
+- Baseline matrix required before the first optimization edit, including card
+  status, repeated timing, effective bandwidth, and noise/stability estimate.
+- Iteration discipline: one primary optimization hypothesis per lineage, plus
+  explicit keep/reject/inconclusive decision rules.
 - Research digest covering local LightOp patterns and any upstream/source
  evidence that materially changes the route.
 - Attempt ledger for every candidate.
+- `kernel_opt_readme.md` update after every benchmarked candidate with the
+  fixed template.
 - Optimization ledger only for correct candidates with measured improvement.
 - Performance-target discipline: the first miss triggers tuning, two
  consecutive correctness-passing misses trigger both `lightop-kernel-knowledge`
@@ -421,7 +496,7 @@ schema. Include acceptance criteria for:
 - Tuning decisions and dispatcher/config updates when `W` has multiple
  regimes.
 - Final correctness matrix, benchmark matrix, fallback paths, unsupported
-  regimes, and residual risk.
+  regimes, final target-hit guard validation, and residual risk.

 ## RLCR Startup