完善测试脚本的编写

73900188 · whlwhlwhl · 0fec721c · 73900188 · 73900188
Commit 73900188 authored May 21, 2026 by whlwhlwhl
Show whitespace changes
Inline Side-by-side

Showing with 51 additions and 9 deletions

docs/lightop-skills.zh-CN.md docs/lightop-skills.zh-CN.md +15 -3

humanize/skills/humanize-kernel-agent-loop/SKILL.md humanize/skills/humanize-kernel-agent-loop/SKILL.md +36 -6

No files found.
--- a/docs/lightop-skills.zh-CN.md
+++ b/docs/lightop-skills.zh-CN.md
@@ -118,7 +118,9 @@ lightop/config*.py
 - 只有算子需要对用户公开时，才改 `lightop/__init__.py`。
 - 只有新增 `csrc/<family>` 且 `setup.py` glob 覆盖不到时，才改 `setup.py`。
 - 需要 shape/gfx 特化时，才改 `lightop/config*.py` 或 dispatch 表。
- 添加聚焦的正确性测试和目标 workload 的 benchmark。
+- 最终测试脚本必须放在 `test/` 下，命名为 `test_<算子名>.py`。不要只把
+  `final_test.py`、`bench_baseline.py` 等脚本放在 `.humanize/lightop-agent/` 里；
+  `.humanize` 只用于过程记录和临时证据，不是 LightOp 最终测试入口。
 ### 已有算子优化 Checklist
@@ -132,6 +134,8 @@ lightop/config*.py
 - 不修改无关 operator family。
 - 优先只改目标算子的 kernel、launcher、必要 config、聚焦 test、benchmark。
 - 直接在指定环境里 install、test、benchmark、profile、tune。
+- 已有算子优化任务里，用户只需要指定目标算子和测试文件；如果用户给了测试文件，就以
+  该文件作为最终验证文件。如果没有给，就推断或创建 `test/test_<算子名>.py`。
 - 如果某条优化线回退或只对非目标 shape 有帮助，要记录 reject 原因。
 ### DCU/ROCm 默认规则
@@ -306,8 +310,16 @@ cd test
 python test_<op>.py
 ```
-如果没有 benchmark，就添加小 benchmark，必须包含 warmup、固定 shape、固定 seed、
+最终验证文件必须在 `test/` 下。新增算子必须新增 `test/test_<算子名>.py`；优化已有算子
-计时前后显式 `torch.cuda.synchronize()`。
+时使用用户指定的测试文件，没有指定时再推断或创建 `test/test_<算子名>.py`。
+测试脚本必须先做精度验证，再做性能测试。精度基准可以是 PyTorch、Triton、已有
+LightOp 组合路径，或者用户提示里指定的 oracle。性能测试必须包含 10 轮 warmup 和
+100 轮 timed iterations，计时前后显式 `torch.cuda.synchronize()`，最终报告平均时间，
+单位是 us，同时包含有效带宽。
+最终答复和 `kernel_opt_readme.md` 里要用简短 Markdown 表格呈现最终验证信息，至少包含：
+测试文件、reference、shape、dtype、精度结果、平均耗时 us、有效带宽、选卡、实际命令。
 每个正确性通过的优化 candidate 都要在 benchmark 后跑 PMC：

--- a/humanize/skills/humanize-kernel-agent-loop/SKILL.md
+++ b/humanize/skills/humanize-kernel-agent-loop/SKILL.md
@@ -154,7 +154,11 @@ Typical add-operator checklist:
  explicitly asks for legacy metadata maintenance.
 - If performance depends on shape/gfx-specific choices, update or add the
  relevant config/dispatcher table under `lightop/config*.py`.
- Add focused correctness tests under `test/` and benchmark coverage for `W`.
+- Add the final focused correctness/performance test script under `test/`,
+  named `test_<operator_name>.py` after the public operator or target kernel
+  name. Do not leave final validation scripts only under
+  `.humanize/lightop-agent/`; that directory is for records and temporary
+  evidence, not the LightOp test surface.
 Optimization-only checklist:
@@ -173,6 +177,10 @@ Optimization-only checklist:
  target.
 - Keep edits scoped to the operator family, binding, tests, benchmark, and
  tuning table needed for the target task.
+- For an existing-operator optimization task, the user only needs to specify
+  the target operator and test file. Use that test file as the required final
+  validation file. If the user does not specify one, infer or create
+  `test/test_<operator_name>.py`.
 - Record rejected lineages when an optimization regresses or only helps a
  non-target shape.
@@ -472,10 +480,24 @@ cd test
 python test_<op>.py
 ```
-Then run the relevant benchmark script and compare it against the named
+Final validation for a new operator must live in `test/test_<operator_name>.py`.
-baseline and threshold. If no benchmark exists, add a small benchmark that uses
+For an optimization task, use the user-specified test file, or infer/create
-warmup, fixed shapes, fixed seeds, and explicit `torch.cuda.synchronize()`
+`test/test_<operator_name>.py` when none is given. The final test file must
-around timed regions.
+first run numerical correctness against the chosen reference, then run a
+performance section with 10 warmup iterations and 100 timed iterations, using
+explicit `torch.cuda.synchronize()` around timed regions. Report mean time in
+microseconds and effective bandwidth for the target workload.
+The correctness reference must be PyTorch, Triton, an existing LightOp
+composition, or the exact user-provided oracle. The script may contain its own
+small benchmark harness; do not treat `.humanize/lightop-agent/final_test.py`,
+`.humanize/lightop-agent/bench_baseline.py`, or other record-only scripts as
+the final LightOp validation surface.
+The final answer and `kernel_opt_readme.md` must show the final validation in a
+concise Markdown table with at least: test file, reference, shape, dtype,
+correctness result, mean time in us, effective bandwidth, selected card, and
+command.
 Before every benchmark, run the performance device gate from this skill:
 capture `hy-smi` or `rocm-smi`, choose a low-utilization/low-VRAM card, and run
@@ -583,6 +605,13 @@ schema. Include acceptance criteria for:
 - Benchmark method with warmup, repeats, synchronization, per-shape timing,
  p50/p90 or mean as appropriate, variance/noise band, minimum effective delta,
  and environment metadata.
+- Required final test file policy: new operators must add
+  `test/test_<operator_name>.py`; optimization tasks must use the
+  user-specified test file or infer/create `test/test_<operator_name>.py`.
+  The file must run correctness first, then performance with 10 warmup and 100
+  timed iterations, report average time in us and effective bandwidth, and use
+  PyTorch, Triton, existing LightOp composition, or the user-provided oracle as
+  baseline/reference.
 - Per-candidate `hipprof --pmc` capture after every correctness-passing
  optimization edit, including artifact path, selected card, representative
  shape, cache counters or unavailable-counter reason, LDS/bank-conflict
@@ -610,7 +639,8 @@ schema. Include acceptance criteria for:
 - Tuning decisions and dispatcher/config updates when `W` has multiple
  regimes.
 - Final correctness matrix, benchmark matrix, fallback paths, unsupported
-  regimes, final target-hit guard validation, and residual risk.
+  regimes, final target-hit guard validation from the required `test/` file,
+  concise final result table, and residual risk.
 ## RLCR Startup