Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
whlwhlwhl
Lightop-SKIILS
Commits
bb293875
Commit
bb293875
authored
May 20, 2026
by
whlwhlwhl
Browse files
添加验证约束
parent
b36a10ea
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
88 additions
and
13 deletions
+88
-13
humanize/skills/humanize-kernel-agent-loop/SKILL.md
humanize/skills/humanize-kernel-agent-loop/SKILL.md
+88
-13
No files found.
humanize/skills/humanize-kernel-agent-loop/SKILL.md
View file @
bb293875
...
...
@@ -23,7 +23,9 @@ K: operator semantics, input/output tensors, dtype/layout/mode constraints
R: correctness reference, usually PyTorch native, an existing LightOp path, or
a small literal oracle for edge cases
W: workload distribution, including target shapes, dtypes, model scenario,
gfx arch, and latency/throughput comparison target
gfx arch, target shapes, dtype/layout/contiguity, reduction/axis details,
epsilon or mode flags, effective-bandwidth formula, and
latency/throughput comparison target
E: execution environment, including Docker container/image or direct host,
LightOp path inside that environment, visible DCU device, DTK/ROCm/PyTorch
versions, build command, test command, benchmark command, and pass threshold
...
...
@@ -91,6 +93,9 @@ Record these before the first build:
commands.
-
Pass/fail threshold: numerical tolerance for correctness and the required
performance target versus baseline.
-
Benchmark stability rule: warmup/repeat counts, summary statistic
(p50/p90/mean), acceptable noise band, and the minimum delta needed to count
as an effective optimization.
Do not mix a host build with container tests unless the user explicitly wants
that split and both paths point at the same compiled extension.
...
...
@@ -219,30 +224,53 @@ evidence.
corpus. Use this whenever it can shape the first implementation route.
5.
Write a concise research digest in the loop state before the first serious
implementation lineage.
6.
Define the benchmark contract before editing code: exact target shape(s),
dtype/layout/contiguity, axis/mode/epsilon, effective-bandwidth formula,
warmup/repeat counts, selected summary statistic, noise band, and the
benchmark command that will be used for baseline and candidates.
### Stage 2: Implement And Verify
1.
Make the smallest LightOp source change that can satisfy the current
1.
Build a baseline matrix before the first optimization edit: run correctness,
then benchmark the target workload on the selected idle card for enough
repeats to report p50/p90 or mean, effective bandwidth, variance/noise, card
status, and command line. Store it in the attempt ledger and
`kernel_opt_readme.md`
.
2.
Make the smallest LightOp source change that can satisfy the current
task-acceptance pair.
2
.
Build LightOp with the target arch.
3
.
Run the targeted correctness test. Before the benchmark for
`W`
, execute
3
.
Build LightOp with the target arch.
4
.
Run the targeted correctness test. Before the benchmark for
`W`
, execute
the performance device gate and pin the run with
`HIP_VISIBLE_DEVICES`
.
4
.
Record every candidate result: correctness failure, build failure,
5
.
Record every candidate result: correctness failure, build failure,
regression, plateau, and improvement.
5
.
Invoke
`dcu-profiler-report`
when benchmark evidence is not enough to choose
6
.
Invoke
`dcu-profiler-report`
when benchmark evidence is not enough to choose
the next edit.
6
.
If the first correctness-passing candidate misses the required performance
7
.
If the first correctness-passing candidate misses the required performance
threshold, continue into profiling and tuning instead of declaring the task
done.
### Stage 3: Tune And Integrate
1.
Build a performance map over
`W`
.
2.
Add shape/gfx-aware dispatch or config-table entries only when measured
2.
Keep each optimization lineage focused on one primary hypothesis, such as
vectorized memory access, block/thread mapping, register pressure, LDS
layout, occupancy, launch configuration, epilogue fusion, or shape
specialization. Avoid mixing multiple unrelated techniques in one candidate
unless the code structure makes separation impossible, and record the
reason.
3.
Add shape/gfx-aware dispatch or config-table entries only when measured
regimes need different choices.
3.
Re-run correctness across all touched dtypes/layouts/modes.
4.
Re-run benchmark cases that define
`W`
and any nearby regression guards.
5.
Summarize final code paths, fallback behavior, unsupported regimes, and
4.
Re-run correctness across all touched dtypes/layouts/modes.
5.
Re-run benchmark cases that define
`W`
and any nearby regression guards.
6.
Reject or revert a candidate lineage in the final chosen path when it fails
correctness, improves less than the noise/stability threshold, helps only
non-target shapes, or lacks required profile/resource/ISA evidence after a
profiling gate. Record the rejected lineage instead of silently overwriting
it.
7.
After reaching the target, run a final guard validation: targeted
correctness, repeated target benchmark on the selected card, and nearby
shape/dtype regression checks when relevant.
8.
Summarize final code paths, fallback behavior, unsupported regimes, and
remaining risks.
## Performance Target Discipline
...
...
@@ -256,6 +284,10 @@ that threshold as part of the acceptance contract.
-
For every correctness-passing candidate, record shape, dtype, layout/mode,
kernel or dispatch configuration, measured bandwidth/latency, comparison
baseline, and the reason it improved, regressed, or plateaued.
-
A performance improvement counts as effective only when it exceeds the
benchmark noise/stability threshold defined in the plan. If the measured
delta is inside the noise band, record it as inconclusive or plateau, not as
an optimization win.
-
After the first correctness-passing candidate misses the target, run a
profiling-and-tuning loop before stopping.
-
Try at least three evidence-backed performance optimization lineages unless
...
...
@@ -294,6 +326,7 @@ Keep Humanize state local and untracked:
.humanize/lightop-agent/refined-plan.md
.humanize/lightop-agent/research-digest.md
.humanize/lightop-agent/attempt-ledger.md
.humanize/lightop-agent/kernel_opt_readme.md
.humanize/lightop-agent/optimization-ledger.md
.humanize/lightop-agent/lineage.jsonl
.humanize/lightop-agent/performance-map.json
...
...
@@ -304,6 +337,34 @@ Keep Humanize state local and untracked:
Before starting RLCR, make sure
`.humanize*`
is ignored. Do not commit loop
state unless the user explicitly asks for tracked evidence artifacts.
`kernel_opt_readme.md`
must be kept current after every benchmarked candidate.
Use this shape:
```
markdown
# Kernel Optimization Log
## Baseline Matrix
-
Target workload:
-
Effective-bandwidth formula:
-
Device gate:
-
Build/test/benchmark commands:
-
Baseline p50/p90/mean, variance/noise:
## Iteration <N>: <one primary hypothesis>
-
Hypothesis:
-
Files changed:
-
Code/config change:
-
Build command/result:
-
Correctness command/result:
-
Device status and HIP_VISIBLE_DEVICES:
-
Benchmark table: baseline/parent/candidate, p50/p90/mean, effective BW,
delta, noise threshold:
-
Profile/resource/ISA evidence:
-
Decision: keep | reject | inconclusive
-
Reason:
-
Next step:
```
## Build, Test, Benchmark
Build from the LightOp root inside the selected execution environment. Always
...
...
@@ -355,6 +416,10 @@ comparison, and, when the result is near the threshold or surprising, profiler
evidence. Do not claim speedups from Python wall-clock timing unless
asynchronous DCU work is synchronized.
Use the same benchmark command, selected card, workload shape(s), and effective
bandwidth formula for baseline and candidate comparisons. If the workload
contract changes, start a new baseline matrix and explain why.
## Profiling
Invoke
`dcu-profiler-report`
autonomously when profiler evidence is the best
...
...
@@ -393,6 +458,9 @@ schema. Include acceptance criteria for:
-
LightOp root, target operator family, public API, and modified files.
-
Explicit
`K`
,
`R`
,
`W`
, target gfx arch, baseline command, comparison target,
and hard scope exclusions.
-
Workload contract: target shape(s), dtype, layout/contiguity, axis/mode,
epsilon or other math flags, effective-bandwidth formula, and the exact
success metric.
-
Explicit
`E`
: Docker container/image or direct-host mode, LightOp path inside
the execution environment, visible device selection, install command, smoke
command, correctness command, benchmark command, profiler command, and pass
...
...
@@ -405,10 +473,17 @@ schema. Include acceptance criteria for:
-
Build command, ROCm/DTK/PyTorch versions,
`PYTORCH_ROCM_ARCH`
, and device
metadata.
-
Benchmark method with warmup, repeats, synchronization, per-shape timing,
p50/p90 or mean as appropriate, and environment metadata.
p50/p90 or mean as appropriate, variance/noise band, minimum effective delta,
and environment metadata.
-
Baseline matrix required before the first optimization edit, including card
status, repeated timing, effective bandwidth, and noise/stability estimate.
-
Iteration discipline: one primary optimization hypothesis per lineage, plus
explicit keep/reject/inconclusive decision rules.
-
Research digest covering local LightOp patterns and any upstream/source
evidence that materially changes the route.
-
Attempt ledger for every candidate.
-
`kernel_opt_readme.md`
update after every benchmarked candidate with the
fixed template.
-
Optimization ledger only for correct candidates with measured improvement.
-
Performance-target discipline: the first miss triggers tuning, two
consecutive correctness-passing misses trigger both
`lightop-kernel-knowledge`
...
...
@@ -421,7 +496,7 @@ schema. Include acceptance criteria for:
-
Tuning decisions and dispatcher/config updates when
`W`
has multiple
regimes.
-
Final correctness matrix, benchmark matrix, fallback paths, unsupported
regimes, and residual risk.
regimes,
final target-hit guard validation,
and residual risk.
## RLCR Startup
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment