--- name: lightop-kernel-agent-loop description: "Run an autonomous Humanize loop for adding or optimizing LightOp fused operators on DCU/ROCm: recover K/R/W, inspect LightOp bindings and tests, implement HIP/ROCm kernels or dispatch changes, build, test, benchmark, profile with dcu-profiler-report, record evidence, and start RLCR." --- # LightOp Kernel Agent Loop Use this flow when the user wants to add a new LightOp operator or optimize an existing LightOp operator for DCU/ROCm. This is the LightOp-specific wrapper around Humanize RLCR. Do not use this flow for generic CUDA/NVIDIA kernel tuning. Prefer HIP C++, ROCm PyTorch extensions, existing LightOp patterns, MIOpen/rocBLAS/hipBLASLt, Composable Kernel, or Triton AMD only when the local repo already supports that route or the user explicitly asks for it. ## Input Contract Recover or define these before implementation: ```text K: operator semantics, input/output tensors, dtype/layout/mode constraints R: correctness reference, usually PyTorch native, an existing LightOp path, or a small literal oracle for edge cases W: workload distribution, including target shapes, dtypes, model scenario, gfx arch, and latency/throughput comparison target E: execution environment, including Docker container/image or direct host, LightOp path inside that environment, visible DCU device, DTK/ROCm/PyTorch versions, build command, test command, benchmark command, and pass threshold ``` Ask the user only when `K`, `R`, `W`, target DCU/gfx arch, comparison baseline, execution environment, pass threshold, or a hard scope constraint is missing and cannot be inferred safely from the LightOp repo. After those inputs exist, the loop owns tactical choices: inspection, research, implementation route, profiling depth, benchmark expansion, and tuning lineage. ## Locate LightOp Operate on a LightOp checkout, not on a standalone CUDA experiment by default. The LightOp root is the directory containing: ```text setup.py lightop/__init__.py lightop/csrc/export.cpp test/ ``` Search in this order: 1. User-provided path. 2. Current working directory. 3. Sibling directories named `lightop` near the current workspace. If no LightOp checkout exists, ask for its path. If the user explicitly wants a scratch experiment, keep it outside LightOp but record how the final candidate will land back in the LightOp source tree. ## Execution Environment Run build, correctness, benchmark, and profiling in one consistent environment. If the user names a Docker container or image, treat Docker as part of the acceptance contract. Prefer non-interactive commands, so the loop can capture logs and repeat the same command: ```bash docker exec bash -lc 'cd /path/in/container/lightop && ' ``` If the user provides an image rather than a running container, ask for or infer the host LightOp path and run with DCU devices exposed: ```bash docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --network=host \ -v :/workspace/lightop \ bash -lc 'cd /workspace/lightop && ' ``` Record these before the first build: - Container name or image tag, or `direct-host` when not using Docker. - LightOp path from the command's point of view, not only the host path. - Visible device env such as `HIP_VISIBLE_DEVICES` or `HSA_VISIBLE_DEVICES`. - DCU status command and selected card before performance runs: `hy-smi` or `rocm-smi`, the observed HCU utilization, VRAM use, and the `HIP_VISIBLE_DEVICES=` value used for benchmark/profile commands. - `PYTORCH_ROCM_ARCH`, DTK/ROCm version, PyTorch version, HIP version, device name, and `gcnArchName`. - Exact build/install, import-smoke, correctness, benchmark, and profiler commands. - Pass/fail threshold: numerical tolerance for correctness and the required performance target versus baseline. Do not mix a host build with container tests unless the user explicitly wants that split and both paths point at the same compiled extension. ## LightOp Integration Map For a new or modified operator, inspect the nearest existing operator family first and follow its style. Core source locations: ```text lightop/csrc//*.cu|*.cuh|*.h|*.cpp lightop/csrc/export.cpp lightop/.py lightop/__init__.py setup.py test/test_.py test//*benchmark*.py lightop/config*.py ``` Typical add-operator checklist: - Add or modify HIP/C++ implementation under the closest `lightop/csrc/` family. Create a new family only when no existing family fits. - Expose the C++ symbol in `lightop/csrc/export.cpp` with `m.def(...)`. - Add a Python wrapper that imports `from . import op` and preserves LightOp's tensor validation style. - Export public APIs from `lightop/__init__.py` only when the operator should be user-facing. - If a new `csrc/` directory is created, update `setup.py` source globs. Do not add a parallel `setup_torch29.py` build path unless the user explicitly asks for legacy metadata maintenance. - If performance depends on shape/gfx-specific choices, update or add the relevant config/dispatcher table under `lightop/config*.py`. - Add focused correctness tests under `test/` and benchmark coverage for `W`. Optimization-only checklist: - Identify the current Python wrapper, exported C++ binding, kernel launch path, config/dispatcher branch, and existing test/benchmark. - Preserve the public API unless the user explicitly asks for a breaking change. - Do not add a new public API for an existing-operator optimization task. - Do not create unrelated operator files or new operator families. - Do not modify unrelated operator families. - Prefer edits scoped to the target operator's kernel source, launcher, necessary dispatcher/config entries, focused tests, and benchmark coverage. - Run install, correctness, benchmark, profiling, and tuning directly in the selected execution environment; do not detour through a new-operator integration flow unless the existing test/benchmark surface cannot cover the target. - Keep edits scoped to the operator family, binding, tests, benchmark, and tuning table needed for the target task. - Record rejected lineages when an optimization regresses or only helps a non-target shape. ## DCU/ROCm Defaults - Treat PyTorch's `torch.cuda` namespace as the ROCm runtime facade when used by LightOp tests. - Prefer `hipcc`/ROCm extension builds over NVIDIA-only compile paths. - Respect `PYTORCH_ROCM_ARCH`; if unset, derive gfx from `torch.cuda.get_device_properties(0).gcnArchName`. - Prefer `gfx928;gfx936;gfx938` when the local LightOp setup already uses that default, unless the user provides a narrower target. - Do not introduce CUDA-only headers, PTX/SASS assumptions, CUTLASS/CuTe, Nsight Compute, TMA/WGMMA/tcgen05, or NVIDIA architecture rules into LightOp code paths. - When borrowing ideas from CUDA sources, translate them into ROCm concepts and record that the source is only cross-platform inspiration. Useful environment probe: ```bash python - <<'PY' import torch print("torch:", torch.__version__) print("hip:", torch.version.hip) print("device:", torch.cuda.get_device_name(0)) print("gcn:", torch.cuda.get_device_properties(0).gcnArchName) PY hipcc --version ``` ## Performance Device Gate Before any benchmark or profiling command, check card status in the target execution environment and pin the run to an idle card. - Run `hy-smi` or `rocm-smi` immediately before the performance command. - Choose the card with low HCU utilization and low VRAM occupancy. If no card is idle enough for stable results, delay the performance run or record that the result is noisy and not acceptable as final evidence. - Prefix benchmark and profiler commands with `HIP_VISIBLE_DEVICES=`. Keep the same card for the paired baseline/candidate comparison unless the comparison is deliberately measuring cross-device behavior. - Record the chosen card, HCU utilization, VRAM use, and exact command in the attempt ledger or profile artifact directory. Example: ```bash hy-smi || rocm-smi HIP_VISIBLE_DEVICES= python test//benchmark_.py HIP_VISIBLE_DEVICES= hipprof python test//benchmark_.py ``` Do not report a performance number as actionable unless the device-selection gate was recorded or the user explicitly accepts the missing card-state evidence. ## Workflow ### Stage 1: Inspect And Plan 1. Locate the LightOp root and target operator family. 2. Recover `K`, `R`, `W`, target gfx arch, baseline command, and success threshold. 3. Inspect the existing wrapper, binding, kernel, config table, tests, and benchmarks. 4. Before the first optimization edit, it is recommended to query `lightop-kernel-knowledge` for local LightOp patterns, ROCm/DCU upstream evidence, Hygon/DCU source references, and portable ideas from the bundled corpus. Use this whenever it can shape the first implementation route. 5. Write a concise research digest in the loop state before the first serious implementation lineage. ### Stage 2: Implement And Verify 1. Make the smallest LightOp source change that can satisfy the current task-acceptance pair. 2. Build LightOp with the target arch. 3. Run the targeted correctness test. Before the benchmark for `W`, execute the performance device gate and pin the run with `HIP_VISIBLE_DEVICES`. 4. Record every candidate result: correctness failure, build failure, regression, plateau, and improvement. 5. Invoke `dcu-profiler-report` when benchmark evidence is not enough to choose the next edit. 6. If the first correctness-passing candidate misses the required performance threshold, continue into profiling and tuning instead of declaring the task done. ### Stage 3: Tune And Integrate 1. Build a performance map over `W`. 2. Add shape/gfx-aware dispatch or config-table entries only when measured regimes need different choices. 3. Re-run correctness across all touched dtypes/layouts/modes. 4. Re-run benchmark cases that define `W` and any nearby regression guards. 5. Summarize final code paths, fallback behavior, unsupported regimes, and remaining risks. ## Performance Target Discipline When the user gives a bandwidth, latency, throughput, or speedup target, treat that threshold as part of the acceptance contract. - Do not claim completion while the best correctness-passing candidate misses the target. Report it only as the current best result with bottleneck evidence and next steps. - For every correctness-passing candidate, record shape, dtype, layout/mode, kernel or dispatch configuration, measured bandwidth/latency, comparison baseline, and the reason it improved, regressed, or plateaued. - After the first correctness-passing candidate misses the target, run a profiling-and-tuning loop before stopping. - Try at least three evidence-backed performance optimization lineages unless profiler evidence shows the target is not reachable under the current `K/R/W` and environment constraints. - If two consecutive correctness-passing candidates miss the target, the next kernel or dispatch edit must be preceded by both: - A `lightop-kernel-knowledge` research pass covering local LightOp layernorm/rmsnorm/fused-norm patterns, relevant ROCm/DCU upstream evidence, and any portable reduction/vectorization ideas from the bundled corpus. - A `dcu-profiler-report` digest for a representative target shape. - If the second correctness-passing optimization attempt improves less than 5% over the relevant parent or baseline, run a deep `dcu-profiler-report` analysis before the next optimization edit. That analysis must include `hipprof` PMC all mode, SQTT JSON when available, `dccobjdump` disassembly, code-object resource usage, and explicit LDS/register/occupancy evidence or a recorded reason any item was unavailable. - The next edit after that gate must name exactly one concrete LightOp kernel, binding, dispatcher, config, or benchmark change and cite the knowledge and profiler evidence that motivated it. - Do not claim that an optimization is effective from intuition, source inspection, or expected hardware behavior. Promotion to the optimization ledger requires measured correctness-passing benchmark data, baseline or parent comparison, and, when a profiling gate applies, profiler/resource/ISA evidence. - If the target remains unmet after the required tuning lineages, summarize the best candidate, failed lineages, profiler bottleneck class, unsupported regimes, and the most likely next engineering investment. Do not present the operator as performance-complete. ## Required Loop State Keep Humanize state local and untracked: ```text .humanize/lightop-agent/refined-plan.md .humanize/lightop-agent/research-digest.md .humanize/lightop-agent/attempt-ledger.md .humanize/lightop-agent/optimization-ledger.md .humanize/lightop-agent/lineage.jsonl .humanize/lightop-agent/performance-map.json .humanize/lightop-agent/tuning-decisions.md .humanize/lightop-agent/profile-artifacts/ ``` Before starting RLCR, make sure `.humanize*` is ignored. Do not commit loop state unless the user explicitly asks for tracked evidence artifacts. ## Build, Test, Benchmark Build from the LightOp root inside the selected execution environment. Always use `python setup.py install` for LightOp builds, regardless of the installed PyTorch version. Do not switch to `setup_torch29.py`. ```bash PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install ``` Keep the existing `build/` directory between attempts so incremental extension builds can reuse prior compilation output. Do not delete `build/` as part of the normal build/test/tune loop unless the user explicitly requests a clean build or the build cache is proven to be stale or corrupt. After install, run an import smoke test in the same environment: ```bash python - <<'PY' import torch, lightop print("torch:", torch.__version__) print("hip:", torch.version.hip) print("lightop:", getattr(lightop, "__file__", "unknown")) print("device:", torch.cuda.get_device_name(0)) print("gcn:", torch.cuda.get_device_properties(0).gcnArchName) PY ``` Run the narrowest relevant test first: ```bash cd test python test_.py ``` Then run the relevant benchmark script and compare it against the named baseline and threshold. If no benchmark exists, add a small benchmark that uses warmup, fixed shapes, fixed seeds, and explicit `torch.cuda.synchronize()` around timed regions. Before every benchmark, run the performance device gate from this skill: capture `hy-smi` or `rocm-smi`, choose a low-utilization/low-VRAM card, and run the benchmark with `HIP_VISIBLE_DEVICES=`. Reuse the same selected card for baseline and candidate measurements. Do not claim success from a passing build alone. A LightOp operator change is complete only after install, import smoke, targeted correctness, benchmark comparison, and, when the result is near the threshold or surprising, profiler evidence. Do not claim speedups from Python wall-clock timing unless asynchronous DCU work is synchronized. ## Profiling Invoke `dcu-profiler-report` autonomously when profiler evidence is the best next source of truth. These are heuristics, not user-facing gates: - Baseline benchmark has passed and no profile digest exists. - A correct candidate is within +/-2% of baseline or the prior best. - A correct candidate regresses on an important shape. - The benchmark plateaued and the next edit is unclear. - The first correctness-passing candidate misses the user's required performance target. - Two consecutive correctness-passing candidates miss the target, in which case pair this profile with a `lightop-kernel-knowledge` research pass before the next kernel or dispatch edit. - The second correctness-passing optimization attempt improves less than 5% over its parent or baseline. This is a mandatory deep-analysis gate, not a heuristic. - A candidate is much faster than expected and needs explanation. - A reviewer asks for profiling evidence. Persist profile artifacts under `.humanize/lightop-agent/profile-artifacts/` or the user-specified evidence directory. Each digest must end with exactly one concrete next kernel edit or a clear reason profiling is not actionable. When the <5% second-optimization gate fires, the digest must include `hipprof` PMC all, SQTT JSON when supported, `dccobjdump` disassembly, code-object resource usage, and LDS/register/occupancy evidence. If any tool is missing, record the exact command attempted and do not replace the missing evidence with a guess. ## Plan Requirements Write `.humanize/lightop-agent/refined-plan.md` using the Humanize gen-plan schema. Include acceptance criteria for: - LightOp root, target operator family, public API, and modified files. - Explicit `K`, `R`, `W`, target gfx arch, baseline command, comparison target, and hard scope exclusions. - Explicit `E`: Docker container/image or direct-host mode, LightOp path inside the execution environment, visible device selection, install command, smoke command, correctness command, benchmark command, profiler command, and pass threshold. - Device gate: `hy-smi` or `rocm-smi` command, idle-card selection criteria, required `HIP_VISIBLE_DEVICES=` prefix for benchmark/profile, and where the card-state output is stored. - Correctness coverage for `W`, edge cases, dtype/layout/mode boundaries, and baseline/reference parity. - Build command, ROCm/DTK/PyTorch versions, `PYTORCH_ROCM_ARCH`, and device metadata. - Benchmark method with warmup, repeats, synchronization, per-shape timing, p50/p90 or mean as appropriate, and environment metadata. - Research digest covering local LightOp patterns and any upstream/source evidence that materially changes the route. - Attempt ledger for every candidate. - Optimization ledger only for correct candidates with measured improvement. - Performance-target discipline: the first miss triggers tuning, two consecutive correctness-passing misses trigger both `lightop-kernel-knowledge` research and `dcu-profiler-report` evidence before the next edit, and unmet targets cannot be reported as complete. - Low-gain discipline: if the second correctness-passing optimization improves less than 5%, the next edit is blocked on deep profiling evidence: `hipprof` PMC all, SQTT JSON if available, `dccobjdump`, code-object resource usage, and LDS/register/occupancy analysis. - Tuning decisions and dispatcher/config updates when `W` has multiple regimes. - Final correctness matrix, benchmark matrix, fallback paths, unsupported regimes, and residual risk. ## RLCR Startup After writing the refined plan and making sure `.humanize*` is ignored, start the loop from the LightOp root: ```bash "{{HUMANIZE_RUNTIME_ROOT}}/scripts/setup-rlcr-loop.sh" .humanize/lightop-agent/refined-plan.md --yolo ``` If setup exits non-zero, stop and report the error. Do not bypass the gate. After setup succeeds: 1. Read `.humanize/rlcr//round-0-prompt.md`. 2. Execute the current round. 3. Commit or stage only the requested LightOp changes if the user asked for version-control actions. 4. Write the required round summary. 5. Stop normally so the Humanize Stop hook can review. If the hook blocks exit, follow the generated next-round prompt exactly.