SKILL.md 29.9 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
---
name: lightop-kernel-agent-loop
description: "Run an autonomous Humanize loop for adding or optimizing LightOp fused operators on DCU/ROCm: recover K/R/W, inspect LightOp bindings and tests, implement HIP/ROCm kernels or dispatch changes, build, test, benchmark, profile with dcu-profiler-report, record evidence, and start RLCR."
---

# LightOp Kernel Agent Loop

Use this flow when the user wants to add a new LightOp operator or optimize an
existing LightOp operator for DCU/ROCm. This is the LightOp-specific wrapper
around Humanize RLCR.

Do not use this flow for generic CUDA/NVIDIA kernel tuning. Prefer HIP C++,
ROCm PyTorch extensions, existing LightOp patterns, MIOpen/rocBLAS/hipBLASLt,
Composable Kernel, or Triton AMD only when the local repo already supports that
route or the user explicitly asks for it.

## Input Contract

Recover or define these before implementation:

```text
K: operator semantics, input/output tensors, dtype/layout/mode constraints
R: correctness reference, usually PyTorch native, an existing LightOp path, or
   a small literal oracle for edge cases
W: workload distribution, including target shapes, dtypes, model scenario,
whlwhlwhl's avatar
whlwhlwhl committed
26
27
28
   gfx arch, target shapes, dtype/layout/contiguity, reduction/axis details,
   epsilon or mode flags, effective-bandwidth formula, and
   latency/throughput comparison target
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
E: execution environment, including Docker container/image or direct host,
   LightOp path inside that environment, visible DCU device, DTK/ROCm/PyTorch
   versions, build command, test command, benchmark command, and pass threshold
```

Ask the user only when `K`, `R`, `W`, target DCU/gfx arch, comparison baseline,
execution environment, pass threshold, or a hard scope constraint is missing
and cannot be inferred safely from the LightOp repo. After those inputs exist,
the loop owns tactical choices: inspection, research, implementation route,
profiling depth, benchmark expansion, and tuning lineage.

## Locate LightOp

Operate on a LightOp checkout, not on a standalone CUDA experiment by default.
The LightOp root is the directory containing:

```text
setup.py
lightop/__init__.py
lightop/csrc/export.cpp
test/
```

Search in this order:

1. User-provided path.
2. Current working directory.
3. Sibling directories named `lightop` near the current workspace.

If no LightOp checkout exists, ask for its path. If the user explicitly wants a
scratch experiment, keep it outside LightOp but record how the final candidate
will land back in the LightOp source tree.

## Execution Environment

Run build, correctness, benchmark, and profiling in one consistent environment.
If the user names a Docker container or image, treat Docker as part of the
acceptance contract. Prefer non-interactive commands, so the loop can capture
logs and repeat the same command:

```bash
docker exec <container> bash -lc 'cd /path/in/container/lightop && <command>'
```

If the user provides an image rather than a running container, ask for or infer
the host LightOp path and run with DCU devices exposed:

```bash
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --network=host \
  -v <host-lightop>:/workspace/lightop <image> \
  bash -lc 'cd /workspace/lightop && <command>'
```

Record these before the first build:

- Container name or image tag, or `direct-host` when not using Docker.
- LightOp path from the command's point of view, not only the host path.
- Visible device env such as `HIP_VISIBLE_DEVICES` or `HSA_VISIBLE_DEVICES`.
whlwhlwhl's avatar
whlwhlwhl committed
87
88
89
- DCU status command and selected card before performance runs: `hy-smi` or
  `rocm-smi`, the observed HCU utilization, VRAM use, and the
  `HIP_VISIBLE_DEVICES=<idle-card>` value used for benchmark/profile commands.
90
91
92
93
94
95
- `PYTORCH_ROCM_ARCH`, DTK/ROCm version, PyTorch version, HIP version, device
  name, and `gcnArchName`.
- Exact build/install, import-smoke, correctness, benchmark, and profiler
  commands.
- Pass/fail threshold: numerical tolerance for correctness and the required
  performance target versus baseline.
whlwhlwhl's avatar
whlwhlwhl committed
96
97
98
- Benchmark stability rule: warmup/repeat counts, summary statistic
  (p50/p90/mean), acceptable noise band, and the minimum delta needed to count
  as an effective optimization.
99
100
101
102
103
104
105
106
107

Do not mix a host build with container tests unless the user explicitly wants
that split and both paths point at the same compiled extension.

## LightOp Integration Map

For a new or modified operator, inspect the nearest existing operator family
first and follow its style.

whlwhlwhl's avatar
whlwhlwhl committed
108
109
110
111
112
113
114
115
116
For a new fused operator, first recover the fusion ingredients from the new
operator name, requested semantics, or provided implementation sketch. Search
LightOp for the pre-fusion single operators and related fused implementations
before designing a new kernel. Use those local implementations as the primary
baseline for API shape, tensor validation, dispatch/config style, correctness
reference, benchmark comparison, and performance expectations. If LightOp has
no matching local baseline, record the search terms and absence, then fall back
to a PyTorch or literal oracle reference.

117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
Core source locations:

```text
lightop/csrc/<family>/*.cu|*.cuh|*.h|*.cpp
lightop/csrc/export.cpp
lightop/<python_wrapper>.py
lightop/__init__.py
setup.py
test/test_<op>.py
test/<family>/*benchmark*.py
lightop/config*.py
```

Typical add-operator checklist:

whlwhlwhl's avatar
whlwhlwhl committed
132
133
134
135
136
137
- For fused operators, inspect each component single-op wrapper, binding,
  kernel, test, benchmark, and config path, plus any neighboring fused kernels
  with similar data movement or epilogue structure.
- Build the first correctness and benchmark baseline from the unfused LightOp
  composition when those component operators exist; otherwise use the nearest
  LightOp implementation plus a PyTorch reference.
138
139
140
141
142
143
144
- Add or modify HIP/C++ implementation under the closest `lightop/csrc/`
  family. Create a new family only when no existing family fits.
- Expose the C++ symbol in `lightop/csrc/export.cpp` with `m.def(...)`.
- Add a Python wrapper that imports `from . import op` and preserves LightOp's
  tensor validation style.
- Export public APIs from `lightop/__init__.py` only when the operator should
  be user-facing.
145
146
147
- If a new `csrc/<family>` directory is created, update `setup.py` source
  globs. Do not add a parallel `setup_torch29.py` build path unless the user
  explicitly asks for legacy metadata maintenance.
148
149
150
151
152
153
154
155
156
157
- If performance depends on shape/gfx-specific choices, update or add the
  relevant config/dispatcher table under `lightop/config*.py`.
- Add focused correctness tests under `test/` and benchmark coverage for `W`.

Optimization-only checklist:

- Identify the current Python wrapper, exported C++ binding, kernel launch
  path, config/dispatcher branch, and existing test/benchmark.
- Preserve the public API unless the user explicitly asks for a breaking
  change.
whlwhlwhl's avatar
whlwhlwhl committed
158
159
160
161
162
163
164
165
166
- Do not add a new public API for an existing-operator optimization task.
- Do not create unrelated operator files or new operator families.
- Do not modify unrelated operator families.
- Prefer edits scoped to the target operator's kernel source, launcher,
  necessary dispatcher/config entries, focused tests, and benchmark coverage.
- Run install, correctness, benchmark, profiling, and tuning directly in the
  selected execution environment; do not detour through a new-operator
  integration flow unless the existing test/benchmark surface cannot cover the
  target.
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
- Keep edits scoped to the operator family, binding, tests, benchmark, and
  tuning table needed for the target task.
- Record rejected lineages when an optimization regresses or only helps a
  non-target shape.

## DCU/ROCm Defaults

- Treat PyTorch's `torch.cuda` namespace as the ROCm runtime facade when used
  by LightOp tests.
- Prefer `hipcc`/ROCm extension builds over NVIDIA-only compile paths.
- Respect `PYTORCH_ROCM_ARCH`; if unset, derive gfx from
  `torch.cuda.get_device_properties(0).gcnArchName`.
- Prefer `gfx928;gfx936;gfx938` when the local LightOp setup already uses that
  default, unless the user provides a narrower target.
- Do not introduce CUDA-only headers, PTX/SASS assumptions, CUTLASS/CuTe,
  Nsight Compute, TMA/WGMMA/tcgen05, or NVIDIA architecture rules into LightOp
  code paths.
- When borrowing ideas from CUDA sources, translate them into ROCm concepts
  and record that the source is only cross-platform inspiration.

Useful environment probe:

```bash
python - <<'PY'
import torch
print("torch:", torch.__version__)
print("hip:", torch.version.hip)
print("device:", torch.cuda.get_device_name(0))
print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
PY
hipcc --version
```

whlwhlwhl's avatar
whlwhlwhl committed
200
201
202
203
204
205
206
207
208
209
210
211
212
213
## Performance Device Gate

Before any benchmark or profiling command, check card status in the target
execution environment and pin the run to an idle card.

- Run `hy-smi` or `rocm-smi` immediately before the performance command.
- Choose the card with low HCU utilization and low VRAM occupancy. If no card
  is idle enough for stable results, delay the performance run or record that
  the result is noisy and not acceptable as final evidence.
- Prefix benchmark and profiler commands with `HIP_VISIBLE_DEVICES=<card>`.
  Keep the same card for the paired baseline/candidate comparison unless the
  comparison is deliberately measuring cross-device behavior.
- Record the chosen card, HCU utilization, VRAM use, and exact command in the
  attempt ledger or profile artifact directory.
whlwhlwhl's avatar
whlwhlwhl committed
214
215
216
217
218
- Before the first optimization edit, run a device bandwidth calibration on the
  selected card in the same execution environment. Record actual read, write,
  copy/read-write, and simple triad bandwidth, buffer size, dtype, selected
  card, and command in `.humanize/lightop-agent/device-bandwidth.txt`. Treat
  this as a sanity baseline for any user-specified effective-bandwidth target.
whlwhlwhl's avatar
whlwhlwhl committed
219
220
221
222
223

Example:

```bash
hy-smi || rocm-smi
whlwhlwhl's avatar
whlwhlwhl committed
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
mkdir -p .humanize/lightop-agent
HIP_VISIBLE_DEVICES=<idle-card> python - <<'PY' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
import time, torch
torch.cuda.init()
free, total = torch.cuda.mem_get_info()
bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
n = bytes_per_buf // 4
a = torch.empty(n, device="cuda", dtype=torch.float32)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()

def bench(name, fn, bytes_moved, iters=80, warmup=20):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    for _ in range(iters):
        fn()
    torch.cuda.synchronize()
    dt = (time.perf_counter() - t0) / iters
    print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")

bench("write_fill", lambda: a.fill_(3.0), n * 4)
bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
bench("read_reduce", lambda: torch.sum(a), n * 4)
print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
PY
whlwhlwhl's avatar
whlwhlwhl committed
253
254
255
256
257
HIP_VISIBLE_DEVICES=<idle-card> python test/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py
```

Do not report a performance number as actionable unless the device-selection
whlwhlwhl's avatar
whlwhlwhl committed
258
259
gate and device bandwidth calibration were recorded, or the user explicitly
accepts the missing evidence.
whlwhlwhl's avatar
whlwhlwhl committed
260

261
262
263
264
265
266
267
268
269
## Workflow

### Stage 1: Inspect And Plan

1. Locate the LightOp root and target operator family.
2. Recover `K`, `R`, `W`, target gfx arch, baseline command, and success
   threshold.
3. Inspect the existing wrapper, binding, kernel, config table, tests, and
   benchmarks.
whlwhlwhl's avatar
whlwhlwhl committed
270
271
272
273
274
275
4. For a new fused operator, search by the requested operator name, name
   tokens, component op names, and semantics to find LightOp's pre-fusion
   single operators and related fused kernels. Record the chosen baseline:
   unfused LightOp composition, nearest fused LightOp implementation, PyTorch
   reference, or explicit "no local baseline found".
5. Before the first optimization edit, it is recommended to query
whlwhlwhl's avatar
whlwhlwhl committed
276
277
278
   `lightop-kernel-knowledge` for local LightOp patterns, ROCm/DCU upstream
   evidence, Hygon/DCU source references, and portable ideas from the bundled
   corpus. Use this whenever it can shape the first implementation route.
whlwhlwhl's avatar
whlwhlwhl committed
279
6. Write a concise research digest in the loop state before the first serious
280
   implementation lineage.
whlwhlwhl's avatar
whlwhlwhl committed
281
7. Define the benchmark contract before editing code: exact target shape(s),
whlwhlwhl's avatar
whlwhlwhl committed
282
283
284
   dtype/layout/contiguity, axis/mode/epsilon, effective-bandwidth formula,
   warmup/repeat counts, selected summary statistic, noise band, and the
   benchmark command that will be used for baseline and candidates.
285
286
287

### Stage 2: Implement And Verify

whlwhlwhl's avatar
whlwhlwhl committed
288
1. Build a baseline matrix before the first optimization edit: run correctness,
whlwhlwhl's avatar
whlwhlwhl committed
289
290
291
292
293
   run the selected-card device bandwidth calibration, then benchmark the
   target workload on the selected idle card for enough repeats to report
   p50/p90 or mean, effective bandwidth, variance/noise, card status, device
   read/write/copy bandwidth, and command line. Store it in the attempt ledger
   and `kernel_opt_readme.md`.
whlwhlwhl's avatar
whlwhlwhl committed
294
2. Make the smallest LightOp source change that can satisfy the current
295
   task-acceptance pair.
whlwhlwhl's avatar
whlwhlwhl committed
296
297
3. Build LightOp with the target arch.
4. Run the targeted correctness test. Before the benchmark for `W`, execute
whlwhlwhl's avatar
whlwhlwhl committed
298
   the performance device gate and pin the run with `HIP_VISIBLE_DEVICES`.
whlwhlwhl's avatar
whlwhlwhl committed
299
5. Record every candidate result: correctness failure, build failure,
300
   regression, plateau, and improvement.
whlwhlwhl's avatar
whlwhlwhl committed
301
302
303
304
305
306
307
308
309
6. For every correctness-passing optimization candidate, run the normal
   synchronized benchmark first, then collect `hipprof --pmc` evidence for the
   same representative target shape before making the next kernel or dispatch
   edit. Capture cache-related counters, LDS/bank-conflict clues,
   occupancy/resource signals, and the exact command output or unsupported-tool
   reason. Use `dcu-profiler-report` to turn this evidence into the next edit.
7. Invoke deeper `dcu-profiler-report` analysis when the per-candidate PMC
   evidence and benchmark still do not explain the next edit.
8. If the first correctness-passing candidate misses the required performance
whlwhlwhl's avatar
whlwhlwhl committed
310
311
   threshold, continue into profiling and tuning instead of declaring the task
   done.
312
313
314
315

### Stage 3: Tune And Integrate

1. Build a performance map over `W`.
whlwhlwhl's avatar
whlwhlwhl committed
316
317
318
319
320
321
322
2. Keep each optimization lineage focused on one primary hypothesis, such as
   vectorized memory access, block/thread mapping, register pressure, LDS
   layout, occupancy, launch configuration, epilogue fusion, or shape
   specialization. Avoid mixing multiple unrelated techniques in one candidate
   unless the code structure makes separation impossible, and record the
   reason.
3. Add shape/gfx-aware dispatch or config-table entries only when measured
323
   regimes need different choices.
whlwhlwhl's avatar
whlwhlwhl committed
324
325
4. Re-run correctness across all touched dtypes/layouts/modes.
5. Re-run benchmark cases that define `W` and any nearby regression guards.
whlwhlwhl's avatar
whlwhlwhl committed
326
327
328
329
330
6. Re-run per-candidate `hipprof --pmc` captures after each correctness-passing
   optimization edit and summarize cache behavior, LDS/bank conflicts,
   occupancy/resource pressure, and one profiler-backed next action before
   starting the next edit.
7. Reject or revert a candidate lineage in the final chosen path when it fails
whlwhlwhl's avatar
whlwhlwhl committed
331
332
333
334
   correctness, improves less than the noise/stability threshold, helps only
   non-target shapes, or lacks required profile/resource/ISA evidence after a
   profiling gate. Record the rejected lineage instead of silently overwriting
   it.
whlwhlwhl's avatar
whlwhlwhl committed
335
8. After reaching the target, run a final guard validation: targeted
whlwhlwhl's avatar
whlwhlwhl committed
336
337
   correctness, repeated target benchmark on the selected card, and nearby
   shape/dtype regression checks when relevant.
whlwhlwhl's avatar
whlwhlwhl committed
338
9. Summarize final code paths, fallback behavior, unsupported regimes, and
339
340
   remaining risks.

whlwhlwhl's avatar
whlwhlwhl committed
341
342
343
344
345
346
347
348
349
350
351
## Performance Target Discipline

When the user gives a bandwidth, latency, throughput, or speedup target, treat
that threshold as part of the acceptance contract.

- Do not claim completion while the best correctness-passing candidate misses
  the target. Report it only as the current best result with bottleneck
  evidence and next steps.
- For every correctness-passing candidate, record shape, dtype, layout/mode,
  kernel or dispatch configuration, measured bandwidth/latency, comparison
  baseline, and the reason it improved, regressed, or plateaued.
whlwhlwhl's avatar
whlwhlwhl committed
352
353
354
355
356
357
358
- For every correctness-passing optimization candidate, record paired
  `hipprof --pmc` evidence for the representative workload before choosing the
  next edit. The digest must discuss cache behavior, memory/cache traffic,
  LDS or bank-conflict evidence, occupancy/resource pressure, and exactly one
  next LightOp kernel, launcher, dispatcher, config, or benchmark edit. If a
  PMC counter or occupancy signal is unavailable on the installed DTK, record
  the exact command and failure instead of guessing.
whlwhlwhl's avatar
whlwhlwhl committed
359
360
361
362
- A performance improvement counts as effective only when it exceeds the
  benchmark noise/stability threshold defined in the plan. If the measured
  delta is inside the noise band, record it as inconclusive or plateau, not as
  an optimization win.
whlwhlwhl's avatar
whlwhlwhl committed
363
364
365
366
367
368
369
370
371
372
373
- After the first correctness-passing candidate misses the target, run a
  profiling-and-tuning loop before stopping.
- Try at least three evidence-backed performance optimization lineages unless
  profiler evidence shows the target is not reachable under the current `K/R/W`
  and environment constraints.
- If two consecutive correctness-passing candidates miss the target, the next
  kernel or dispatch edit must be preceded by both:
  - A `lightop-kernel-knowledge` research pass covering local LightOp
    layernorm/rmsnorm/fused-norm patterns, relevant ROCm/DCU upstream evidence,
    and any portable reduction/vectorization ideas from the bundled corpus.
  - A `dcu-profiler-report` digest for a representative target shape.
whlwhlwhl's avatar
whlwhlwhl committed
374
375
376
377
378
379
- If the second correctness-passing optimization attempt improves less than 5%
  over the relevant parent or baseline, run a deep `dcu-profiler-report`
  analysis before the next optimization edit. That analysis must include
  `hipprof` PMC all mode, SQTT JSON when available, `dccobjdump`
  disassembly, code-object resource usage, and explicit LDS/register/occupancy
  evidence or a recorded reason any item was unavailable.
whlwhlwhl's avatar
whlwhlwhl committed
380
381
382
- The next edit after that gate must name exactly one concrete LightOp kernel,
  binding, dispatcher, config, or benchmark change and cite the knowledge and
  profiler evidence that motivated it.
whlwhlwhl's avatar
whlwhlwhl committed
383
384
385
- Do not claim that an optimization is effective from intuition, source
  inspection, or expected hardware behavior. Promotion to the optimization
  ledger requires measured correctness-passing benchmark data, baseline or
whlwhlwhl's avatar
whlwhlwhl committed
386
  parent comparison, and per-candidate PMC/profile/resource evidence.
whlwhlwhl's avatar
whlwhlwhl committed
387
388
389
390
391
- If the target remains unmet after the required tuning lineages, summarize the
  best candidate, failed lineages, profiler bottleneck class, unsupported
  regimes, and the most likely next engineering investment. Do not present the
  operator as performance-complete.

392
393
394
395
396
397
398
399
## Required Loop State

Keep Humanize state local and untracked:

```text
.humanize/lightop-agent/refined-plan.md
.humanize/lightop-agent/research-digest.md
.humanize/lightop-agent/attempt-ledger.md
whlwhlwhl's avatar
whlwhlwhl committed
400
.humanize/lightop-agent/kernel_opt_readme.md
whlwhlwhl's avatar
whlwhlwhl committed
401
.humanize/lightop-agent/rlcr-fallback.md
402
403
404
405
406
407
408
409
410
411
.humanize/lightop-agent/optimization-ledger.md
.humanize/lightop-agent/lineage.jsonl
.humanize/lightop-agent/performance-map.json
.humanize/lightop-agent/tuning-decisions.md
.humanize/lightop-agent/profile-artifacts/
```

Before starting RLCR, make sure `.humanize*` is ignored. Do not commit loop
state unless the user explicitly asks for tracked evidence artifacts.

whlwhlwhl's avatar
whlwhlwhl committed
412
413
414
415
416
417
418
419
420
421
`kernel_opt_readme.md` must be kept current after every benchmarked candidate.
Use this shape:

```markdown
# Kernel Optimization Log

## Baseline Matrix
- Target workload:
- Effective-bandwidth formula:
- Device gate:
whlwhlwhl's avatar
whlwhlwhl committed
422
423
- Device bandwidth calibration: read/write/copy/triad bandwidth, buffer size,
  selected card, command:
whlwhlwhl's avatar
whlwhlwhl committed
424
425
426
427
428
429
430
431
432
433
434
435
- Build/test/benchmark commands:
- Baseline p50/p90/mean, variance/noise:

## Iteration <N>: <one primary hypothesis>
- Hypothesis:
- Files changed:
- Code/config change:
- Build command/result:
- Correctness command/result:
- Device status and HIP_VISIBLE_DEVICES:
- Benchmark table: baseline/parent/candidate, p50/p90/mean, effective BW,
  delta, noise threshold:
whlwhlwhl's avatar
whlwhlwhl committed
436
437
- Per-candidate PMC/profile/resource evidence: cache behavior, LDS/bank
  conflicts, occupancy/resource pressure, unavailable counters:
whlwhlwhl's avatar
whlwhlwhl committed
438
439
440
441
442
- Decision: keep | reject | inconclusive
- Reason:
- Next step:
```

443
444
## Build, Test, Benchmark

445
446
447
Build from the LightOp root inside the selected execution environment. Always
use `python setup.py install` for LightOp builds, regardless of the installed
PyTorch version. Do not switch to `setup_torch29.py`.
448
449
450
451
452

```bash
PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install
```

453
454
455
456
Keep the existing `build/` directory between attempts so incremental extension
builds can reuse prior compilation output. Do not delete `build/` as part of
the normal build/test/tune loop unless the user explicitly requests a clean
build or the build cache is proven to be stale or corrupt.
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482

After install, run an import smoke test in the same environment:

```bash
python - <<'PY'
import torch, lightop
print("torch:", torch.__version__)
print("hip:", torch.version.hip)
print("lightop:", getattr(lightop, "__file__", "unknown"))
print("device:", torch.cuda.get_device_name(0))
print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
PY
```

Run the narrowest relevant test first:

```bash
cd test
python test_<op>.py
```

Then run the relevant benchmark script and compare it against the named
baseline and threshold. If no benchmark exists, add a small benchmark that uses
warmup, fixed shapes, fixed seeds, and explicit `torch.cuda.synchronize()`
around timed regions.

whlwhlwhl's avatar
whlwhlwhl committed
483
484
485
486
487
Before every benchmark, run the performance device gate from this skill:
capture `hy-smi` or `rocm-smi`, choose a low-utilization/low-VRAM card, and run
the benchmark with `HIP_VISIBLE_DEVICES=<idle-card>`. Reuse the same selected
card for baseline and candidate measurements.

whlwhlwhl's avatar
whlwhlwhl committed
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
After every correctness-passing optimization candidate, collect PMC evidence
for the same representative benchmark shape before making the next edit. Use
the same selected card, store the artifacts under
`.humanize/lightop-agent/profile-artifacts/<version>/`, and run supported
variants such as:

```bash
HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc --pmc-type 3 \
  -o .humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc \
  python test/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-read --pmc-type 3 \
  -o .humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc-read \
  python test/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-write --pmc-type 3 \
  -o .humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc-write \
  python test/<family>/benchmark_<op>.py
```

Use the PMC digest to inspect cache behavior, memory/cache traffic, LDS or bank
conflicts, occupancy/resource pressure, and the next concrete edit. If a
counter, `--pmc` mode, or occupancy signal is unsupported in the installed DTK,
record the attempted command and error output in the artifact directory.

511
512
Do not claim success from a passing build alone. A LightOp operator change is
complete only after install, import smoke, targeted correctness, benchmark
whlwhlwhl's avatar
whlwhlwhl committed
513
514
515
comparison, and the required per-candidate PMC/profile evidence. Do not claim
speedups from Python wall-clock timing unless asynchronous DCU work is
synchronized.
516

whlwhlwhl's avatar
whlwhlwhl committed
517
518
519
520
Use the same benchmark command, selected card, workload shape(s), and effective
bandwidth formula for baseline and candidate comparisons. If the workload
contract changes, start a new baseline matrix and explain why.

521
522
## Profiling

whlwhlwhl's avatar
whlwhlwhl committed
523
524
525
526
527
528
529
530
Invoke `dcu-profiler-report` autonomously after every correctness-passing
optimization candidate to interpret the mandatory `hipprof --pmc` capture and
produce the next edit. The digest may be concise, but it must explain cache
behavior, LDS/bank-conflict clues, occupancy/resource pressure, and exactly one
profiler-backed next action.

Escalate from the mandatory per-candidate PMC pass to deeper profiling when any
of these hold:
531
532
533
534
535

- Baseline benchmark has passed and no profile digest exists.
- A correct candidate is within +/-2% of baseline or the prior best.
- A correct candidate regresses on an important shape.
- The benchmark plateaued and the next edit is unclear.
whlwhlwhl's avatar
whlwhlwhl committed
536
537
538
539
540
- The first correctness-passing candidate misses the user's required
  performance target.
- Two consecutive correctness-passing candidates miss the target, in which case
  pair this profile with a `lightop-kernel-knowledge` research pass before the
  next kernel or dispatch edit.
whlwhlwhl's avatar
whlwhlwhl committed
541
542
543
- The second correctness-passing optimization attempt improves less than 5%
  over its parent or baseline. This is a mandatory deep-analysis gate, not a
  heuristic.
544
545
546
547
548
549
550
- A candidate is much faster than expected and needs explanation.
- A reviewer asks for profiling evidence.

Persist profile artifacts under `.humanize/lightop-agent/profile-artifacts/`
or the user-specified evidence directory. Each digest must end with exactly one
concrete next kernel edit or a clear reason profiling is not actionable.

whlwhlwhl's avatar
whlwhlwhl committed
551
552
553
554
555
556
When the <5% second-optimization gate fires, the digest must include
`hipprof` PMC all, SQTT JSON when supported, `dccobjdump` disassembly,
code-object resource usage, and LDS/register/occupancy evidence. If any tool is
missing, record the exact command attempted and do not replace the missing
evidence with a guess.

557
558
559
560
561
562
## Plan Requirements

Write `.humanize/lightop-agent/refined-plan.md` using the Humanize gen-plan
schema. Include acceptance criteria for:

- LightOp root, target operator family, public API, and modified files.
whlwhlwhl's avatar
whlwhlwhl committed
563
564
565
- For new fused operators, the LightOp pre-fusion single-op search result,
  related fused implementation search result, and chosen baseline/reference
  path.
566
567
- Explicit `K`, `R`, `W`, target gfx arch, baseline command, comparison target,
  and hard scope exclusions.
whlwhlwhl's avatar
whlwhlwhl committed
568
569
570
- Workload contract: target shape(s), dtype, layout/contiguity, axis/mode,
  epsilon or other math flags, effective-bandwidth formula, and the exact
  success metric.
571
572
573
574
- Explicit `E`: Docker container/image or direct-host mode, LightOp path inside
  the execution environment, visible device selection, install command, smoke
  command, correctness command, benchmark command, profiler command, and pass
  threshold.
whlwhlwhl's avatar
whlwhlwhl committed
575
576
577
- Device gate: `hy-smi` or `rocm-smi` command, idle-card selection criteria,
  required `HIP_VISIBLE_DEVICES=<idle-card>` prefix for benchmark/profile, and
  where the card-state output is stored.
whlwhlwhl's avatar
whlwhlwhl committed
578
579
580
581
- Device bandwidth calibration before the first optimization edit: exact
  command, selected card, buffer size, dtype, measured read/write/copy/triad
  bandwidth, artifact path, and how the result constrains or contextualizes the
  target effective-bandwidth threshold.
582
583
584
585
586
- Correctness coverage for `W`, edge cases, dtype/layout/mode boundaries, and
  baseline/reference parity.
- Build command, ROCm/DTK/PyTorch versions, `PYTORCH_ROCM_ARCH`, and device
  metadata.
- Benchmark method with warmup, repeats, synchronization, per-shape timing,
whlwhlwhl's avatar
whlwhlwhl committed
587
588
  p50/p90 or mean as appropriate, variance/noise band, minimum effective delta,
  and environment metadata.
whlwhlwhl's avatar
whlwhlwhl committed
589
590
591
592
593
- Per-candidate `hipprof --pmc` capture after every correctness-passing
  optimization edit, including artifact path, selected card, representative
  shape, cache counters or unavailable-counter reason, LDS/bank-conflict
  evidence, occupancy/resource interpretation, and the profiler-backed next
  edit.
whlwhlwhl's avatar
whlwhlwhl committed
594
- Baseline matrix required before the first optimization edit, including card
whlwhlwhl's avatar
whlwhlwhl committed
595
596
  status, device bandwidth calibration, repeated timing, effective bandwidth,
  and noise/stability estimate.
whlwhlwhl's avatar
whlwhlwhl committed
597
598
- Iteration discipline: one primary optimization hypothesis per lineage, plus
  explicit keep/reject/inconclusive decision rules.
599
600
601
- Research digest covering local LightOp patterns and any upstream/source
  evidence that materially changes the route.
- Attempt ledger for every candidate.
whlwhlwhl's avatar
whlwhlwhl committed
602
603
- `kernel_opt_readme.md` update after every benchmarked candidate with the
  fixed template.
604
- Optimization ledger only for correct candidates with measured improvement.
whlwhlwhl's avatar
whlwhlwhl committed
605
606
607
608
- Performance-target discipline: the first miss triggers tuning, two
  consecutive correctness-passing misses trigger both `lightop-kernel-knowledge`
  research and `dcu-profiler-report` evidence before the next edit, and unmet
  targets cannot be reported as complete.
whlwhlwhl's avatar
whlwhlwhl committed
609
610
611
612
- Low-gain discipline: if the second correctness-passing optimization improves
  less than 5%, the next edit is blocked on deep profiling evidence:
  `hipprof` PMC all, SQTT JSON if available, `dccobjdump`, code-object
  resource usage, and LDS/register/occupancy analysis.
613
614
615
- Tuning decisions and dispatcher/config updates when `W` has multiple
  regimes.
- Final correctness matrix, benchmark matrix, fallback paths, unsupported
whlwhlwhl's avatar
whlwhlwhl committed
616
  regimes, final target-hit guard validation, and residual risk.
617
618
619
620
621
622
623
624
625
626
627

## RLCR Startup

After writing the refined plan and making sure `.humanize*` is ignored, start
the loop from the LightOp root:

```bash
"{{HUMANIZE_RUNTIME_ROOT}}/scripts/setup-rlcr-loop.sh" .humanize/lightop-agent/refined-plan.md --yolo
```

If setup exits non-zero, stop and report the error. Do not bypass the gate.
whlwhlwhl's avatar
whlwhlwhl committed
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
Exception: if setup fails only because the `codex` CLI is unavailable, manual
fallback mode is allowed. Before continuing, write:

```text
.humanize/lightop-agent/rlcr-fallback.md
.humanize/lightop-agent/refined-plan.md
.humanize/lightop-agent/research-digest.md
.humanize/lightop-agent/attempt-ledger.md
.humanize/lightop-agent/kernel_opt_readme.md
```

`rlcr-fallback.md` must state that Codex review gate is unavailable, include
the exact setup command and error output, name the missing dependency, and
declare that all build/test/benchmark/profile, device-selection, evidence,
performance-target, low-gain, and logging constraints from this skill still
apply. In fallback mode, proceed manually with the same optimization loop, but
do not claim that Humanize/Codex review was active.
645
646
647
648
649
650
651
652
653
654
655

After setup succeeds:

1. Read `.humanize/rlcr/<timestamp>/round-0-prompt.md`.
2. Execute the current round.
3. Commit or stage only the requested LightOp changes if the user asked for
   version-control actions.
4. Write the required round summary.
5. Stop normally so the Humanize Stop hook can review.

If the hook blocks exit, follow the generated next-round prompt exactly.