SKILL.md

---
name: lightop-kernel-agent-loop
description: "Run an autonomous Humanize loop for adding or optimizing LightOp fused operators on DCU/ROCm: recover K/R/W, inspect LightOp bindings and tests, implement HIP/ROCm kernels or dispatch changes, build, test, benchmark, profile with dcu-profiler-report, record evidence, and start RLCR."
---

# LightOp Kernel Agent Loop

Use this flow when the user wants to add a new LightOp operator or optimize an
existing LightOp operator for DCU/ROCm. This is the LightOp-specific wrapper
around Humanize RLCR.

Do not use this flow for generic CUDA/NVIDIA kernel tuning. Prefer HIP C++,
ROCm PyTorch extensions, existing LightOp patterns, MIOpen/rocBLAS/hipBLASLt,
Composable Kernel, or Triton AMD only when the local repo already supports that
route or the user explicitly asks for it.

## Input Contract

Recover or define these before implementation:

```text
K: operator semantics, input/output tensors, dtype/layout/mode constraints
R: correctness reference, usually PyTorch native, an existing LightOp path, or
   a small literal oracle for edge cases
W: workload distribution, including target shapes, dtypes, model scenario,
   gfx arch, and latency/throughput comparison target
E: execution environment, including Docker container/image or direct host,
   LightOp path inside that environment, visible DCU device, DTK/ROCm/PyTorch
   versions, build command, test command, benchmark command, and pass threshold
```

Ask the user only when `K`, `R`, `W`, target DCU/gfx arch, comparison baseline,
execution environment, pass threshold, or a hard scope constraint is missing
and cannot be inferred safely from the LightOp repo. After those inputs exist,
the loop owns tactical choices: inspection, research, implementation route,
profiling depth, benchmark expansion, and tuning lineage.

## Locate LightOp

Operate on a LightOp checkout, not on a standalone CUDA experiment by default.
The LightOp root is the directory containing:

```text
setup.py
lightop/__init__.py
lightop/csrc/export.cpp
test/
```

Search in this order:

1. User-provided path.
2. Current working directory.
3. Sibling directories named `lightop` near the current workspace.

If no LightOp checkout exists, ask for its path. If the user explicitly wants a
scratch experiment, keep it outside LightOp but record how the final candidate
will land back in the LightOp source tree.

## Execution Environment

Run build, correctness, benchmark, and profiling in one consistent environment.
If the user names a Docker container or image, treat Docker as part of the
acceptance contract. Prefer non-interactive commands, so the loop can capture
logs and repeat the same command:

```bash
docker exec <container> bash -lc 'cd /path/in/container/lightop && <command>'
```

If the user provides an image rather than a running container, ask for or infer
the host LightOp path and run with DCU devices exposed:

```bash
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --network=host \
  -v <host-lightop>:/workspace/lightop <image> \
  bash -lc 'cd /workspace/lightop && <command>'
```

Record these before the first build:

- Container name or image tag, or `direct-host` when not using Docker.
- LightOp path from the command's point of view, not only the host path.
- Visible device env such as `HIP_VISIBLE_DEVICES` or `HSA_VISIBLE_DEVICES`.
- `PYTORCH_ROCM_ARCH`, DTK/ROCm version, PyTorch version, HIP version, device
  name, and `gcnArchName`.
- Exact build/install, import-smoke, correctness, benchmark, and profiler
  commands.
- Pass/fail threshold: numerical tolerance for correctness and the required
  performance target versus baseline.

Do not mix a host build with container tests unless the user explicitly wants
that split and both paths point at the same compiled extension.

## LightOp Integration Map

For a new or modified operator, inspect the nearest existing operator family
first and follow its style.

Core source locations:

```text
lightop/csrc/<family>/*.cu|*.cuh|*.h|*.cpp
lightop/csrc/export.cpp
lightop/<python_wrapper>.py
lightop/__init__.py
setup.py
test/test_<op>.py
test/<family>/*benchmark*.py
lightop/config*.py
```

Typical add-operator checklist:

- Add or modify HIP/C++ implementation under the closest `lightop/csrc/`
  family. Create a new family only when no existing family fits.
- Expose the C++ symbol in `lightop/csrc/export.cpp` with `m.def(...)`.
- Add a Python wrapper that imports `from . import op` and preserves LightOp's
  tensor validation style.
- Export public APIs from `lightop/__init__.py` only when the operator should
  be user-facing.
- If a new `csrc/<family>` directory is created, update `setup.py` source
  globs. Do not add a parallel `setup_torch29.py` build path unless the user
  explicitly asks for legacy metadata maintenance.
- If performance depends on shape/gfx-specific choices, update or add the
  relevant config/dispatcher table under `lightop/config*.py`.
- Add focused correctness tests under `test/` and benchmark coverage for `W`.

Optimization-only checklist:

- Identify the current Python wrapper, exported C++ binding, kernel launch
  path, config/dispatcher branch, and existing test/benchmark.
- Preserve the public API unless the user explicitly asks for a breaking
  change.
- Do not add a new public API for an existing-operator optimization task.
- Do not create unrelated operator files or new operator families.
- Do not modify unrelated operator families.
- Prefer edits scoped to the target operator's kernel source, launcher,
  necessary dispatcher/config entries, focused tests, and benchmark coverage.
- Run install, correctness, benchmark, profiling, and tuning directly in the
  selected execution environment; do not detour through a new-operator
  integration flow unless the existing test/benchmark surface cannot cover the
  target.
- Keep edits scoped to the operator family, binding, tests, benchmark, and
  tuning table needed for the target task.
- Record rejected lineages when an optimization regresses or only helps a
  non-target shape.

## DCU/ROCm Defaults

- Treat PyTorch's `torch.cuda` namespace as the ROCm runtime facade when used
  by LightOp tests.
- Prefer `hipcc`/ROCm extension builds over NVIDIA-only compile paths.
- Respect `PYTORCH_ROCM_ARCH`; if unset, derive gfx from
  `torch.cuda.get_device_properties(0).gcnArchName`.
- Prefer `gfx928;gfx936;gfx938` when the local LightOp setup already uses that
  default, unless the user provides a narrower target.
- Do not introduce CUDA-only headers, PTX/SASS assumptions, CUTLASS/CuTe,
  Nsight Compute, TMA/WGMMA/tcgen05, or NVIDIA architecture rules into LightOp
  code paths.
- When borrowing ideas from CUDA sources, translate them into ROCm concepts
  and record that the source is only cross-platform inspiration.

Useful environment probe:

```bash
python - <<'PY'
import torch
print("torch:", torch.__version__)
print("hip:", torch.version.hip)
print("device:", torch.cuda.get_device_name(0))
print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
PY
hipcc --version
```

## Workflow

### Stage 1: Inspect And Plan

1. Locate the LightOp root and target operator family.
2. Recover `K`, `R`, `W`, target gfx arch, baseline command, and success
   threshold.
3. Inspect the existing wrapper, binding, kernel, config table, tests, and
   benchmarks.
4. Query `lightop-kernel-knowledge` when local source or upstream ROCm evidence
   can shape the implementation route.
5. Write a concise research digest in the loop state before the first serious
   implementation lineage.

### Stage 2: Implement And Verify

1. Make the smallest LightOp source change that can satisfy the current
   task-acceptance pair.
2. Build LightOp with the target arch.
3. Run the targeted correctness test and then the benchmark for `W`.
4. Record every candidate result: correctness failure, build failure,
   regression, plateau, and improvement.
5. Invoke `dcu-profiler-report` when benchmark evidence is not enough to choose
   the next edit.
6. If the first correctness-passing candidate misses the required performance
   threshold, continue into profiling and tuning instead of declaring the task
   done.

### Stage 3: Tune And Integrate

1. Build a performance map over `W`.
2. Add shape/gfx-aware dispatch or config-table entries only when measured
   regimes need different choices.
3. Re-run correctness across all touched dtypes/layouts/modes.
4. Re-run benchmark cases that define `W` and any nearby regression guards.
5. Summarize final code paths, fallback behavior, unsupported regimes, and
   remaining risks.

## Performance Target Discipline

When the user gives a bandwidth, latency, throughput, or speedup target, treat
that threshold as part of the acceptance contract.

- Do not claim completion while the best correctness-passing candidate misses
  the target. Report it only as the current best result with bottleneck
  evidence and next steps.
- For every correctness-passing candidate, record shape, dtype, layout/mode,
  kernel or dispatch configuration, measured bandwidth/latency, comparison
  baseline, and the reason it improved, regressed, or plateaued.
- After the first correctness-passing candidate misses the target, run a
  profiling-and-tuning loop before stopping.
- Try at least three evidence-backed performance optimization lineages unless
  profiler evidence shows the target is not reachable under the current `K/R/W`
  and environment constraints.
- If two consecutive correctness-passing candidates miss the target, the next
  kernel or dispatch edit must be preceded by both:
  - A `lightop-kernel-knowledge` research pass covering local LightOp
    layernorm/rmsnorm/fused-norm patterns, relevant ROCm/DCU upstream evidence,
    and any portable reduction/vectorization ideas from the bundled corpus.
  - A `dcu-profiler-report` digest for a representative target shape.
- The next edit after that gate must name exactly one concrete LightOp kernel,
  binding, dispatcher, config, or benchmark change and cite the knowledge and
  profiler evidence that motivated it.
- If the target remains unmet after the required tuning lineages, summarize the
  best candidate, failed lineages, profiler bottleneck class, unsupported
  regimes, and the most likely next engineering investment. Do not present the
  operator as performance-complete.

## Required Loop State

Keep Humanize state local and untracked:

```text
.humanize/lightop-agent/refined-plan.md
.humanize/lightop-agent/research-digest.md
.humanize/lightop-agent/attempt-ledger.md
.humanize/lightop-agent/optimization-ledger.md
.humanize/lightop-agent/lineage.jsonl
.humanize/lightop-agent/performance-map.json
.humanize/lightop-agent/tuning-decisions.md
.humanize/lightop-agent/profile-artifacts/
```

Before starting RLCR, make sure `.humanize*` is ignored. Do not commit loop
state unless the user explicitly asks for tracked evidence artifacts.

## Build, Test, Benchmark

Build from the LightOp root inside the selected execution environment. Always
use `python setup.py install` for LightOp builds, regardless of the installed
PyTorch version. Do not switch to `setup_torch29.py`.

```bash
PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install
```

Keep the existing `build/` directory between attempts so incremental extension
builds can reuse prior compilation output. Do not delete `build/` as part of
the normal build/test/tune loop unless the user explicitly requests a clean
build or the build cache is proven to be stale or corrupt.

After install, run an import smoke test in the same environment:

```bash
python - <<'PY'
import torch, lightop
print("torch:", torch.__version__)
print("hip:", torch.version.hip)
print("lightop:", getattr(lightop, "__file__", "unknown"))
print("device:", torch.cuda.get_device_name(0))
print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
PY
```

Run the narrowest relevant test first:

```bash
cd test
python test_<op>.py
```

Then run the relevant benchmark script and compare it against the named
baseline and threshold. If no benchmark exists, add a small benchmark that uses
warmup, fixed shapes, fixed seeds, and explicit `torch.cuda.synchronize()`
around timed regions.

Do not claim success from a passing build alone. A LightOp operator change is
complete only after install, import smoke, targeted correctness, benchmark
comparison, and, when the result is near the threshold or surprising, profiler
evidence. Do not claim speedups from Python wall-clock timing unless
asynchronous DCU work is synchronized.

## Profiling

Invoke `dcu-profiler-report` autonomously when profiler evidence is the best
next source of truth. These are heuristics, not user-facing gates:

- Baseline benchmark has passed and no profile digest exists.
- A correct candidate is within +/-2% of baseline or the prior best.
- A correct candidate regresses on an important shape.
- The benchmark plateaued and the next edit is unclear.
- The first correctness-passing candidate misses the user's required
  performance target.
- Two consecutive correctness-passing candidates miss the target, in which case
  pair this profile with a `lightop-kernel-knowledge` research pass before the
  next kernel or dispatch edit.
- A candidate is much faster than expected and needs explanation.
- A reviewer asks for profiling evidence.

Persist profile artifacts under `.humanize/lightop-agent/profile-artifacts/`
or the user-specified evidence directory. Each digest must end with exactly one
concrete next kernel edit or a clear reason profiling is not actionable.

## Plan Requirements

Write `.humanize/lightop-agent/refined-plan.md` using the Humanize gen-plan
schema. Include acceptance criteria for:

- LightOp root, target operator family, public API, and modified files.
- Explicit `K`, `R`, `W`, target gfx arch, baseline command, comparison target,
  and hard scope exclusions.
- Explicit `E`: Docker container/image or direct-host mode, LightOp path inside
  the execution environment, visible device selection, install command, smoke
  command, correctness command, benchmark command, profiler command, and pass
  threshold.
- Correctness coverage for `W`, edge cases, dtype/layout/mode boundaries, and
  baseline/reference parity.
- Build command, ROCm/DTK/PyTorch versions, `PYTORCH_ROCM_ARCH`, and device
  metadata.
- Benchmark method with warmup, repeats, synchronization, per-shape timing,
  p50/p90 or mean as appropriate, and environment metadata.
- Research digest covering local LightOp patterns and any upstream/source
  evidence that materially changes the route.
- Attempt ledger for every candidate.
- Optimization ledger only for correct candidates with measured improvement.
- Performance-target discipline: the first miss triggers tuning, two
  consecutive correctness-passing misses trigger both `lightop-kernel-knowledge`
  research and `dcu-profiler-report` evidence before the next edit, and unmet
  targets cannot be reported as complete.
- Tuning decisions and dispatcher/config updates when `W` has multiple
  regimes.
- Final correctness matrix, benchmark matrix, fallback paths, unsupported
  regimes, and residual risk.

## RLCR Startup

After writing the refined plan and making sure `.humanize*` is ignored, start
the loop from the LightOp root:

```bash
"{{HUMANIZE_RUNTIME_ROOT}}/scripts/setup-rlcr-loop.sh" .humanize/lightop-agent/refined-plan.md --yolo
```

If setup exits non-zero, stop and report the error. Do not bypass the gate.

After setup succeeds:

1. Read `.humanize/rlcr/<timestamp>/round-0-prompt.md`.
2. Execute the current round.
3. Commit or stage only the requested LightOp changes if the user asked for
   version-control actions.
4. Write the required round summary.
5. Stop normally so the Humanize Stop hook can review.

If the hook blocks exit, follow the generated next-round prompt exactly.