SKILL.md

---
name: lightop-kernel-agent-loop
description: "Run an autonomous Humanize loop for adding or optimizing LightOp fused operators on DCU/ROCm: recover K/R/W, inspect LightOp bindings and tests, implement HIP/ROCm kernels or dispatch changes, build, test, benchmark, profile with dcu-profiler-report, record evidence, and start RLCR."
---

# LightOp Kernel Agent Loop

Use this flow when the user wants to add a new LightOp operator or optimize an
existing LightOp operator for DCU/ROCm. This is the LightOp-specific wrapper
around Humanize RLCR.

Do not use this flow for generic CUDA/NVIDIA kernel tuning. Prefer HIP C++,
ROCm PyTorch extensions, existing LightOp patterns, MIOpen/rocBLAS/hipBLASLt,
Composable Kernel, or Triton AMD only when the local repo already supports that
route or the user explicitly asks for it.

## Input Contract

Recover or define these before implementation:

```text
K: operator semantics, input/output tensors, dtype/layout/mode constraints
R: correctness reference, usually PyTorch native, an existing LightOp path, or
   a small literal oracle for edge cases
W: workload distribution, including target shapes, dtypes, model scenario,
   gfx arch, target shapes, dtype/layout/contiguity, reduction/axis details,
   epsilon or mode flags, effective-bandwidth formula, and
   latency/throughput comparison target
E: execution environment, including Docker container/image or direct host,
   LightOp path inside that environment, visible DCU device, DTK/ROCm/PyTorch
   versions, build command, test command, benchmark command, and pass threshold
```

Ask the user only when `K`, `R`, `W`, target DCU/gfx arch, comparison baseline,
execution environment, pass threshold, or a hard scope constraint is missing
and cannot be inferred safely from the LightOp repo. After those inputs exist,
the loop owns tactical choices: inspection, research, implementation route,
profiling depth, benchmark expansion, and tuning lineage.

## Locate LightOp

Operate on a LightOp checkout, not on a standalone CUDA experiment by default.
The LightOp root is the directory containing:

```text
setup.py
lightop/__init__.py
lightop/csrc/export.cpp
test/
```

Search in this order:

1. User-provided path.
2. Current working directory.
3. Sibling directories named `lightop` near the current workspace.

If no LightOp checkout exists, ask for its path. If the user explicitly wants a
scratch experiment, keep it outside LightOp but record how the final candidate
will land back in the LightOp source tree.

## Execution Environment

Run build, correctness, benchmark, and profiling in one consistent environment.
If the user names a Docker container or image, treat Docker as part of the
acceptance contract. Prefer non-interactive commands, so the loop can capture
logs and repeat the same command:

```bash
docker exec <container> bash -lc 'cd /path/in/container/lightop && <command>'
```

If the user provides an image rather than a running container, ask for or infer
the host LightOp path and run with DCU devices exposed:

```bash
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --network=host \
  -v <host-lightop>:/workspace/lightop <image> \
  bash -lc 'cd /workspace/lightop && <command>'
```

Record these before the first build:

- Container name or image tag, or `direct-host` when not using Docker.
- LightOp path from the command's point of view, not only the host path.
- Visible device env such as `HIP_VISIBLE_DEVICES` or `HSA_VISIBLE_DEVICES`.
- DCU status command and selected card before performance runs: `hy-smi` or
  `rocm-smi`, the observed HCU utilization, VRAM use, and the
  `HIP_VISIBLE_DEVICES=<idle-card>` value used for benchmark/profile commands.
- `PYTORCH_ROCM_ARCH`, DTK/ROCm version, PyTorch version, HIP version, device
  name, and `gcnArchName`.
- Exact build/install, import-smoke, correctness, benchmark, and profiler
  commands.
- Pass/fail threshold: numerical tolerance for correctness and the required
  performance target versus baseline.
- Benchmark stability rule: warmup/repeat counts, summary statistic
  (p50/p90/mean), acceptable noise band, and the minimum delta needed to count
  as an effective optimization.

Do not mix a host build with container tests unless the user explicitly wants
that split and both paths point at the same compiled extension.

## LightOp Integration Map

For a new or modified operator, inspect the nearest existing operator family
first and follow its style.

For a new fused operator, first recover the fusion ingredients from the new
operator name, requested semantics, or provided implementation sketch. Search
LightOp for the pre-fusion single operators and related fused implementations
before designing a new kernel. Use those local implementations as the primary
baseline for API shape, tensor validation, dispatch/config style, correctness
reference, benchmark comparison, and performance expectations. If LightOp has
no matching local baseline, record the search terms and absence, then fall back
to a PyTorch or literal oracle reference.

Core source locations:

```text
lightop/csrc/<family>/*.cu|*.cuh|*.h|*.cpp
lightop/csrc/export.cpp
lightop/<python_wrapper>.py
lightop/__init__.py
setup.py
test/test_<op>.py
test/<family>/*benchmark*.py
lightop/config*.py
```

Typical add-operator checklist:

- For fused operators, inspect each component single-op wrapper, binding,
  kernel, test, benchmark, and config path, plus any neighboring fused kernels
  with similar data movement or epilogue structure.
- Build the first correctness and benchmark baseline from the unfused LightOp
  composition when those component operators exist; otherwise use the nearest
  LightOp implementation plus a PyTorch reference.
- Add or modify HIP/C++ implementation under the closest `lightop/csrc/`
  family. Create a new family only when no existing family fits.
- Expose the C++ symbol in `lightop/csrc/export.cpp` with `m.def(...)`.
- Add a Python wrapper that imports `from . import op` and preserves LightOp's
  tensor validation style.
- Export public APIs from `lightop/__init__.py` only when the operator should
  be user-facing.
- If a new `csrc/<family>` directory is created, update `setup.py` source
  globs. Do not add a parallel `setup_torch29.py` build path unless the user
  explicitly asks for legacy metadata maintenance.
- If performance depends on shape/gfx-specific choices, update or add the
  relevant config/dispatcher table under `lightop/config*.py`.
- Add focused correctness tests under `test/` and benchmark coverage for `W`.

Optimization-only checklist:

- Identify the current Python wrapper, exported C++ binding, kernel launch
  path, config/dispatcher branch, and existing test/benchmark.
- Preserve the public API unless the user explicitly asks for a breaking
  change.
- Do not add a new public API for an existing-operator optimization task.
- Do not create unrelated operator files or new operator families.
- Do not modify unrelated operator families.
- Prefer edits scoped to the target operator's kernel source, launcher,
  necessary dispatcher/config entries, focused tests, and benchmark coverage.
- Run install, correctness, benchmark, profiling, and tuning directly in the
  selected execution environment; do not detour through a new-operator
  integration flow unless the existing test/benchmark surface cannot cover the
  target.
- Keep edits scoped to the operator family, binding, tests, benchmark, and
  tuning table needed for the target task.
- Record rejected lineages when an optimization regresses or only helps a
  non-target shape.

## DCU/ROCm Defaults

- Treat PyTorch's `torch.cuda` namespace as the ROCm runtime facade when used
  by LightOp tests.
- Prefer `hipcc`/ROCm extension builds over NVIDIA-only compile paths.
- Respect `PYTORCH_ROCM_ARCH`; if unset, derive gfx from
  `torch.cuda.get_device_properties(0).gcnArchName`.
- Prefer `gfx928;gfx936;gfx938` when the local LightOp setup already uses that
  default, unless the user provides a narrower target.
- Do not introduce CUDA-only headers, PTX/SASS assumptions, CUTLASS/CuTe,
  Nsight Compute, TMA/WGMMA/tcgen05, or NVIDIA architecture rules into LightOp
  code paths.
- When borrowing ideas from CUDA sources, translate them into ROCm concepts
  and record that the source is only cross-platform inspiration.

Useful environment probe:

```bash
python - <<'PY'
import torch
print("torch:", torch.__version__)
print("hip:", torch.version.hip)
print("device:", torch.cuda.get_device_name(0))
print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
PY
hipcc --version
```

## Performance Device Gate

Before any benchmark or profiling command, check card status in the target
execution environment and pin the run to an idle card.

- Run `hy-smi` or `rocm-smi` immediately before the performance command.
- Choose the card with low HCU utilization and low VRAM occupancy. If no card
  is idle enough for stable results, delay the performance run or record that
  the result is noisy and not acceptable as final evidence.
- Prefix benchmark and profiler commands with `HIP_VISIBLE_DEVICES=<card>`.
  Keep the same card for the paired baseline/candidate comparison unless the
  comparison is deliberately measuring cross-device behavior.
- Record the chosen card, HCU utilization, VRAM use, and exact command in the
  attempt ledger or profile artifact directory.
- Before the first optimization edit, run a device bandwidth calibration on the
  selected card in the same execution environment. Record actual read, write,
  copy/read-write, and simple triad bandwidth, buffer size, dtype, selected
  card, and command in `.humanize/lightop-agent/device-bandwidth.txt`. Treat
  this as a sanity baseline for any user-specified effective-bandwidth target.

Example:

```bash
hy-smi || rocm-smi
mkdir -p .humanize/lightop-agent
HIP_VISIBLE_DEVICES=<idle-card> python - <<'PY' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
import time, torch
torch.cuda.init()
free, total = torch.cuda.mem_get_info()
bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
n = bytes_per_buf // 4
a = torch.empty(n, device="cuda", dtype=torch.float32)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()

def bench(name, fn, bytes_moved, iters=80, warmup=20):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    for _ in range(iters):
        fn()
    torch.cuda.synchronize()
    dt = (time.perf_counter() - t0) / iters
    print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")

bench("write_fill", lambda: a.fill_(3.0), n * 4)
bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
bench("read_reduce", lambda: torch.sum(a), n * 4)
print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
PY
HIP_VISIBLE_DEVICES=<idle-card> python test/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py
```

Do not report a performance number as actionable unless the device-selection
gate and device bandwidth calibration were recorded, or the user explicitly
accepts the missing evidence.

## Workflow

### Stage 1: Inspect And Plan

1. Locate the LightOp root and target operator family.
2. Recover `K`, `R`, `W`, target gfx arch, baseline command, and success
   threshold.
3. Inspect the existing wrapper, binding, kernel, config table, tests, and
   benchmarks.
4. For a new fused operator, search by the requested operator name, name
   tokens, component op names, and semantics to find LightOp's pre-fusion
   single operators and related fused kernels. Record the chosen baseline:
   unfused LightOp composition, nearest fused LightOp implementation, PyTorch
   reference, or explicit "no local baseline found".
5. Before the first optimization edit, it is recommended to query
   `lightop-kernel-knowledge` for local LightOp patterns, ROCm/DCU upstream
   evidence, Hygon/DCU source references, and portable ideas from the bundled
   corpus. Use this whenever it can shape the first implementation route.
6. Write a concise research digest in the loop state before the first serious
   implementation lineage.
7. Define the benchmark contract before editing code: exact target shape(s),
   dtype/layout/contiguity, axis/mode/epsilon, effective-bandwidth formula,
   warmup/repeat counts, selected summary statistic, noise band, and the
   benchmark command that will be used for baseline and candidates.

### Stage 2: Implement And Verify

1. Build a baseline matrix before the first optimization edit: run correctness,
   run the selected-card device bandwidth calibration, then benchmark the
   target workload on the selected idle card for enough repeats to report
   p50/p90 or mean, effective bandwidth, variance/noise, card status, device
   read/write/copy bandwidth, and command line. Store it in the attempt ledger
   and `kernel_opt_readme.md`.
2. Make the smallest LightOp source change that can satisfy the current
   task-acceptance pair.
3. Build LightOp with the target arch.
4. Run the targeted correctness test. Before the benchmark for `W`, execute
   the performance device gate and pin the run with `HIP_VISIBLE_DEVICES`.
5. Record every candidate result: correctness failure, build failure,
   regression, plateau, and improvement.
6. For every correctness-passing optimization candidate, run the normal
   synchronized benchmark first, then collect `hipprof --pmc` evidence for the
   same representative target shape before making the next kernel or dispatch
   edit. Capture cache-related counters, LDS/bank-conflict clues,
   occupancy/resource signals, and the exact command output or unsupported-tool
   reason. Use `dcu-profiler-report` to turn this evidence into the next edit.
7. Invoke deeper `dcu-profiler-report` analysis when the per-candidate PMC
   evidence and benchmark still do not explain the next edit.
8. If the first correctness-passing candidate misses the required performance
   threshold, continue into profiling and tuning instead of declaring the task
   done.

### Stage 3: Tune And Integrate

1. Build a performance map over `W`.
2. Keep each optimization lineage focused on one primary hypothesis, such as
   vectorized memory access, block/thread mapping, register pressure, LDS
   layout, occupancy, launch configuration, epilogue fusion, or shape
   specialization. Avoid mixing multiple unrelated techniques in one candidate
   unless the code structure makes separation impossible, and record the
   reason.
3. Add shape/gfx-aware dispatch or config-table entries only when measured
   regimes need different choices.
4. Re-run correctness across all touched dtypes/layouts/modes.
5. Re-run benchmark cases that define `W` and any nearby regression guards.
6. Re-run per-candidate `hipprof --pmc` captures after each correctness-passing
   optimization edit and summarize cache behavior, LDS/bank conflicts,
   occupancy/resource pressure, and one profiler-backed next action before
   starting the next edit.
7. Reject or revert a candidate lineage in the final chosen path when it fails
   correctness, improves less than the noise/stability threshold, helps only
   non-target shapes, or lacks required profile/resource/ISA evidence after a
   profiling gate. Record the rejected lineage instead of silently overwriting
   it.
8. After reaching the target, run a final guard validation: targeted
   correctness, repeated target benchmark on the selected card, and nearby
   shape/dtype regression checks when relevant.
9. Summarize final code paths, fallback behavior, unsupported regimes, and
   remaining risks.

## Performance Target Discipline

When the user gives a bandwidth, latency, throughput, or speedup target, treat
that threshold as part of the acceptance contract.

- Do not claim completion while the best correctness-passing candidate misses
  the target. Report it only as the current best result with bottleneck
  evidence and next steps.
- For every correctness-passing candidate, record shape, dtype, layout/mode,
  kernel or dispatch configuration, measured bandwidth/latency, comparison
  baseline, and the reason it improved, regressed, or plateaued.
- For every correctness-passing optimization candidate, record paired
  `hipprof --pmc` evidence for the representative workload before choosing the
  next edit. The digest must discuss cache behavior, memory/cache traffic,
  LDS or bank-conflict evidence, occupancy/resource pressure, and exactly one
  next LightOp kernel, launcher, dispatcher, config, or benchmark edit. If a
  PMC counter or occupancy signal is unavailable on the installed DTK, record
  the exact command and failure instead of guessing.
- A performance improvement counts as effective only when it exceeds the
  benchmark noise/stability threshold defined in the plan. If the measured
  delta is inside the noise band, record it as inconclusive or plateau, not as
  an optimization win.
- After the first correctness-passing candidate misses the target, run a
  profiling-and-tuning loop before stopping.
- Try at least three evidence-backed performance optimization lineages unless
  profiler evidence shows the target is not reachable under the current `K/R/W`
  and environment constraints.
- If two consecutive correctness-passing candidates miss the target, the next
  kernel or dispatch edit must be preceded by both:
  - A `lightop-kernel-knowledge` research pass covering local LightOp
    layernorm/rmsnorm/fused-norm patterns, relevant ROCm/DCU upstream evidence,
    and any portable reduction/vectorization ideas from the bundled corpus.
  - A `dcu-profiler-report` digest for a representative target shape.
- If the second correctness-passing optimization attempt improves less than 5%
  over the relevant parent or baseline, run a deep `dcu-profiler-report`
  analysis before the next optimization edit. That analysis must include
  `hipprof` PMC all mode, SQTT JSON when available, `dccobjdump`
  disassembly, code-object resource usage, and explicit LDS/register/occupancy
  evidence or a recorded reason any item was unavailable.
- The next edit after that gate must name exactly one concrete LightOp kernel,
  binding, dispatcher, config, or benchmark change and cite the knowledge and
  profiler evidence that motivated it.
- Do not claim that an optimization is effective from intuition, source
  inspection, or expected hardware behavior. Promotion to the optimization
  ledger requires measured correctness-passing benchmark data, baseline or
  parent comparison, and per-candidate PMC/profile/resource evidence.
- If the target remains unmet after the required tuning lineages, summarize the
  best candidate, failed lineages, profiler bottleneck class, unsupported
  regimes, and the most likely next engineering investment. Do not present the
  operator as performance-complete.

## Required Loop State

Keep Humanize state local and untracked:

```text
.humanize/lightop-agent/refined-plan.md
.humanize/lightop-agent/research-digest.md
.humanize/lightop-agent/attempt-ledger.md
.humanize/lightop-agent/kernel_opt_readme.md
.humanize/lightop-agent/rlcr-fallback.md
.humanize/lightop-agent/optimization-ledger.md
.humanize/lightop-agent/lineage.jsonl
.humanize/lightop-agent/performance-map.json
.humanize/lightop-agent/tuning-decisions.md
.humanize/lightop-agent/profile-artifacts/
```

Before starting RLCR, make sure `.humanize*` is ignored. Do not commit loop
state unless the user explicitly asks for tracked evidence artifacts.

`kernel_opt_readme.md` must be kept current after every benchmarked candidate.
Use this shape:

```markdown
# Kernel Optimization Log

## Baseline Matrix
- Target workload:
- Effective-bandwidth formula:
- Device gate:
- Device bandwidth calibration: read/write/copy/triad bandwidth, buffer size,
  selected card, command:
- Build/test/benchmark commands:
- Baseline p50/p90/mean, variance/noise:

## Iteration <N>: <one primary hypothesis>
- Hypothesis:
- Files changed:
- Code/config change:
- Build command/result:
- Correctness command/result:
- Device status and HIP_VISIBLE_DEVICES:
- Benchmark table: baseline/parent/candidate, p50/p90/mean, effective BW,
  delta, noise threshold:
- Per-candidate PMC/profile/resource evidence: cache behavior, LDS/bank
  conflicts, occupancy/resource pressure, unavailable counters:
- Decision: keep | reject | inconclusive
- Reason:
- Next step:
```

## Build, Test, Benchmark

Build from the LightOp root inside the selected execution environment. Always
use `python setup.py install` for LightOp builds, regardless of the installed
PyTorch version. Do not switch to `setup_torch29.py`.

```bash
PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install
```

Keep the existing `build/` directory between attempts so incremental extension
builds can reuse prior compilation output. Do not delete `build/` as part of
the normal build/test/tune loop unless the user explicitly requests a clean
build or the build cache is proven to be stale or corrupt.

After install, run an import smoke test in the same environment:

```bash
python - <<'PY'
import torch, lightop
print("torch:", torch.__version__)
print("hip:", torch.version.hip)
print("lightop:", getattr(lightop, "__file__", "unknown"))
print("device:", torch.cuda.get_device_name(0))
print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
PY
```

Run the narrowest relevant test first:

```bash
cd test
python test_<op>.py
```

Then run the relevant benchmark script and compare it against the named
baseline and threshold. If no benchmark exists, add a small benchmark that uses
warmup, fixed shapes, fixed seeds, and explicit `torch.cuda.synchronize()`
around timed regions.

Before every benchmark, run the performance device gate from this skill:
capture `hy-smi` or `rocm-smi`, choose a low-utilization/low-VRAM card, and run
the benchmark with `HIP_VISIBLE_DEVICES=<idle-card>`. Reuse the same selected
card for baseline and candidate measurements.

After every correctness-passing optimization candidate, collect PMC evidence
for the same representative benchmark shape before making the next edit. Use
the same selected card, store the artifacts under
`.humanize/lightop-agent/profile-artifacts/<version>/`, and run supported
variants such as:

```bash
HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc --pmc-type 3 \
  -o .humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc \
  python test/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-read --pmc-type 3 \
  -o .humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc-read \
  python test/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-write --pmc-type 3 \
  -o .humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc-write \
  python test/<family>/benchmark_<op>.py
```

Use the PMC digest to inspect cache behavior, memory/cache traffic, LDS or bank
conflicts, occupancy/resource pressure, and the next concrete edit. If a
counter, `--pmc` mode, or occupancy signal is unsupported in the installed DTK,
record the attempted command and error output in the artifact directory.

Do not claim success from a passing build alone. A LightOp operator change is
complete only after install, import smoke, targeted correctness, benchmark
comparison, and the required per-candidate PMC/profile evidence. Do not claim
speedups from Python wall-clock timing unless asynchronous DCU work is
synchronized.

Use the same benchmark command, selected card, workload shape(s), and effective
bandwidth formula for baseline and candidate comparisons. If the workload
contract changes, start a new baseline matrix and explain why.

## Profiling

Invoke `dcu-profiler-report` autonomously after every correctness-passing
optimization candidate to interpret the mandatory `hipprof --pmc` capture and
produce the next edit. The digest may be concise, but it must explain cache
behavior, LDS/bank-conflict clues, occupancy/resource pressure, and exactly one
profiler-backed next action.

Escalate from the mandatory per-candidate PMC pass to deeper profiling when any
of these hold:

- Baseline benchmark has passed and no profile digest exists.
- A correct candidate is within +/-2% of baseline or the prior best.
- A correct candidate regresses on an important shape.
- The benchmark plateaued and the next edit is unclear.
- The first correctness-passing candidate misses the user's required
  performance target.
- Two consecutive correctness-passing candidates miss the target, in which case
  pair this profile with a `lightop-kernel-knowledge` research pass before the
  next kernel or dispatch edit.
- The second correctness-passing optimization attempt improves less than 5%
  over its parent or baseline. This is a mandatory deep-analysis gate, not a
  heuristic.
- A candidate is much faster than expected and needs explanation.
- A reviewer asks for profiling evidence.

Persist profile artifacts under `.humanize/lightop-agent/profile-artifacts/`
or the user-specified evidence directory. Each digest must end with exactly one
concrete next kernel edit or a clear reason profiling is not actionable.

When the <5% second-optimization gate fires, the digest must include
`hipprof` PMC all, SQTT JSON when supported, `dccobjdump` disassembly,
code-object resource usage, and LDS/register/occupancy evidence. If any tool is
missing, record the exact command attempted and do not replace the missing
evidence with a guess.

## Plan Requirements

Write `.humanize/lightop-agent/refined-plan.md` using the Humanize gen-plan
schema. Include acceptance criteria for:

- LightOp root, target operator family, public API, and modified files.
- For new fused operators, the LightOp pre-fusion single-op search result,
  related fused implementation search result, and chosen baseline/reference
  path.
- Explicit `K`, `R`, `W`, target gfx arch, baseline command, comparison target,
  and hard scope exclusions.
- Workload contract: target shape(s), dtype, layout/contiguity, axis/mode,
  epsilon or other math flags, effective-bandwidth formula, and the exact
  success metric.
- Explicit `E`: Docker container/image or direct-host mode, LightOp path inside
  the execution environment, visible device selection, install command, smoke
  command, correctness command, benchmark command, profiler command, and pass
  threshold.
- Device gate: `hy-smi` or `rocm-smi` command, idle-card selection criteria,
  required `HIP_VISIBLE_DEVICES=<idle-card>` prefix for benchmark/profile, and
  where the card-state output is stored.
- Device bandwidth calibration before the first optimization edit: exact
  command, selected card, buffer size, dtype, measured read/write/copy/triad
  bandwidth, artifact path, and how the result constrains or contextualizes the
  target effective-bandwidth threshold.
- Correctness coverage for `W`, edge cases, dtype/layout/mode boundaries, and
  baseline/reference parity.
- Build command, ROCm/DTK/PyTorch versions, `PYTORCH_ROCM_ARCH`, and device
  metadata.
- Benchmark method with warmup, repeats, synchronization, per-shape timing,
  p50/p90 or mean as appropriate, variance/noise band, minimum effective delta,
  and environment metadata.
- Per-candidate `hipprof --pmc` capture after every correctness-passing
  optimization edit, including artifact path, selected card, representative
  shape, cache counters or unavailable-counter reason, LDS/bank-conflict
  evidence, occupancy/resource interpretation, and the profiler-backed next
  edit.
- Baseline matrix required before the first optimization edit, including card
  status, device bandwidth calibration, repeated timing, effective bandwidth,
  and noise/stability estimate.
- Iteration discipline: one primary optimization hypothesis per lineage, plus
  explicit keep/reject/inconclusive decision rules.
- Research digest covering local LightOp patterns and any upstream/source
  evidence that materially changes the route.
- Attempt ledger for every candidate.
- `kernel_opt_readme.md` update after every benchmarked candidate with the
  fixed template.
- Optimization ledger only for correct candidates with measured improvement.
- Performance-target discipline: the first miss triggers tuning, two
  consecutive correctness-passing misses trigger both `lightop-kernel-knowledge`
  research and `dcu-profiler-report` evidence before the next edit, and unmet
  targets cannot be reported as complete.
- Low-gain discipline: if the second correctness-passing optimization improves
  less than 5%, the next edit is blocked on deep profiling evidence:
  `hipprof` PMC all, SQTT JSON if available, `dccobjdump`, code-object
  resource usage, and LDS/register/occupancy analysis.
- Tuning decisions and dispatcher/config updates when `W` has multiple
  regimes.
- Final correctness matrix, benchmark matrix, fallback paths, unsupported
  regimes, final target-hit guard validation, and residual risk.

## RLCR Startup

After writing the refined plan and making sure `.humanize*` is ignored, start
the loop from the LightOp root:

```bash
"{{HUMANIZE_RUNTIME_ROOT}}/scripts/setup-rlcr-loop.sh" .humanize/lightop-agent/refined-plan.md --yolo
```

If setup exits non-zero, stop and report the error. Do not bypass the gate.
Exception: if setup fails only because the `codex` CLI is unavailable, manual
fallback mode is allowed. Before continuing, write:

```text
.humanize/lightop-agent/rlcr-fallback.md
.humanize/lightop-agent/refined-plan.md
.humanize/lightop-agent/research-digest.md
.humanize/lightop-agent/attempt-ledger.md
.humanize/lightop-agent/kernel_opt_readme.md
```

`rlcr-fallback.md` must state that Codex review gate is unavailable, include
the exact setup command and error output, name the missing dependency, and
declare that all build/test/benchmark/profile, device-selection, evidence,
performance-target, low-gain, and logging constraints from this skill still
apply. In fallback mode, proceed manually with the same optimization loop, but
do not claim that Humanize/Codex review was active.

After setup succeeds:

1. Read `.humanize/rlcr/<timestamp>/round-0-prompt.md`.
2. Execute the current round.
3. Commit or stage only the requested LightOp changes if the user asked for
   version-control actions.
4. Write the required round summary.
5. Stop normally so the Humanize Stop hook can review.

If the hook blocks exit, follow the generated next-round prompt exactly.