--- name: lightop-kernel-agent-loop description: "Run an autonomous Humanize loop for adding or optimizing LightOp fused operators on DCU/ROCm: recover K/R/W, inspect LightOp bindings and tests, implement HIP/ROCm kernels or dispatch changes, build, test, benchmark, profile with dcu-profiler-report, record evidence, and start RLCR." --- # LightOp Kernel Agent Loop Use this flow when the user wants to add a new LightOp operator or optimize an existing LightOp operator for DCU/ROCm. This is the LightOp-specific wrapper around Humanize RLCR. Do not use this flow for generic CUDA/NVIDIA kernel tuning. Prefer HIP C++, ROCm PyTorch extensions, existing LightOp patterns, MIOpen/rocBLAS/hipBLASLt, Composable Kernel, or Triton AMD only when the local repo already supports that route or the user explicitly asks for it. ## Input Contract Recover or define these before implementation: ```text K: operator semantics, input/output tensors, dtype/layout/mode constraints R: correctness reference, usually PyTorch native, an existing LightOp path, or a small literal oracle for edge cases W: workload distribution, including target shapes, dtypes, model scenario, gfx arch, and latency/throughput comparison target E: execution environment, including Docker container/image or direct host, LightOp path inside that environment, visible DCU device, DTK/ROCm/PyTorch versions, build command, test command, benchmark command, and pass threshold ``` Ask the user only when `K`, `R`, `W`, target DCU/gfx arch, comparison baseline, execution environment, pass threshold, or a hard scope constraint is missing and cannot be inferred safely from the LightOp repo. After those inputs exist, the loop owns tactical choices: inspection, research, implementation route, profiling depth, benchmark expansion, and tuning lineage. ## Locate LightOp Operate on a LightOp checkout, not on a standalone CUDA experiment by default. The LightOp root is the directory containing: ```text setup.py lightop/__init__.py lightop/csrc/export.cpp test/ ``` Search in this order: 1. User-provided path. 2. Current working directory. 3. Sibling directories named `lightop` near the current workspace. If no LightOp checkout exists, ask for its path. If the user explicitly wants a scratch experiment, keep it outside LightOp but record how the final candidate will land back in the LightOp source tree. ## Execution Environment Run build, correctness, benchmark, and profiling in one consistent environment. If the user names a Docker container or image, treat Docker as part of the acceptance contract. Prefer non-interactive commands, so the loop can capture logs and repeat the same command: ```bash docker exec bash -lc 'cd /path/in/container/lightop && ' ``` If the user provides an image rather than a running container, ask for or infer the host LightOp path and run with DCU devices exposed: ```bash docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --network=host \ -v :/workspace/lightop \ bash -lc 'cd /workspace/lightop && ' ``` Record these before the first build: - Container name or image tag, or `direct-host` when not using Docker. - LightOp path from the command's point of view, not only the host path. - Visible device env such as `HIP_VISIBLE_DEVICES` or `HSA_VISIBLE_DEVICES`. - `PYTORCH_ROCM_ARCH`, DTK/ROCm version, PyTorch version, HIP version, device name, and `gcnArchName`. - Exact build/install, import-smoke, correctness, benchmark, and profiler commands. - Pass/fail threshold: numerical tolerance for correctness and the required performance target versus baseline. Do not mix a host build with container tests unless the user explicitly wants that split and both paths point at the same compiled extension. ## LightOp Integration Map For a new or modified operator, inspect the nearest existing operator family first and follow its style. Core source locations: ```text lightop/csrc//*.cu|*.cuh|*.h|*.cpp lightop/csrc/export.cpp lightop/.py lightop/__init__.py setup.py test/test_.py test//*benchmark*.py lightop/config*.py ``` Typical add-operator checklist: - Add or modify HIP/C++ implementation under the closest `lightop/csrc/` family. Create a new family only when no existing family fits. - Expose the C++ symbol in `lightop/csrc/export.cpp` with `m.def(...)`. - Add a Python wrapper that imports `from . import op` and preserves LightOp's tensor validation style. - Export public APIs from `lightop/__init__.py` only when the operator should be user-facing. - If a new `csrc/` directory is created, update `setup.py` source globs. Do not add a parallel `setup_torch29.py` build path unless the user explicitly asks for legacy metadata maintenance. - If performance depends on shape/gfx-specific choices, update or add the relevant config/dispatcher table under `lightop/config*.py`. - Add focused correctness tests under `test/` and benchmark coverage for `W`. Optimization-only checklist: - Identify the current Python wrapper, exported C++ binding, kernel launch path, config/dispatcher branch, and existing test/benchmark. - Preserve the public API unless the user explicitly asks for a breaking change. - Keep edits scoped to the operator family, binding, tests, benchmark, and tuning table needed for the target task. - Record rejected lineages when an optimization regresses or only helps a non-target shape. ## DCU/ROCm Defaults - Treat PyTorch's `torch.cuda` namespace as the ROCm runtime facade when used by LightOp tests. - Prefer `hipcc`/ROCm extension builds over NVIDIA-only compile paths. - Respect `PYTORCH_ROCM_ARCH`; if unset, derive gfx from `torch.cuda.get_device_properties(0).gcnArchName`. - Prefer `gfx928;gfx936;gfx938` when the local LightOp setup already uses that default, unless the user provides a narrower target. - Do not introduce CUDA-only headers, PTX/SASS assumptions, CUTLASS/CuTe, Nsight Compute, TMA/WGMMA/tcgen05, or NVIDIA architecture rules into LightOp code paths. - When borrowing ideas from CUDA sources, translate them into ROCm concepts and record that the source is only cross-platform inspiration. Useful environment probe: ```bash python - <<'PY' import torch print("torch:", torch.__version__) print("hip:", torch.version.hip) print("device:", torch.cuda.get_device_name(0)) print("gcn:", torch.cuda.get_device_properties(0).gcnArchName) PY hipcc --version ``` ## Workflow ### Stage 1: Inspect And Plan 1. Locate the LightOp root and target operator family. 2. Recover `K`, `R`, `W`, target gfx arch, baseline command, and success threshold. 3. Inspect the existing wrapper, binding, kernel, config table, tests, and benchmarks. 4. Query `lightop-kernel-knowledge` when local source or upstream ROCm evidence can shape the implementation route. 5. Write a concise research digest in the loop state before the first serious implementation lineage. ### Stage 2: Implement And Verify 1. Make the smallest LightOp source change that can satisfy the current task-acceptance pair. 2. Build LightOp with the target arch. 3. Run the targeted correctness test and then the benchmark for `W`. 4. Record every candidate result: correctness failure, build failure, regression, plateau, and improvement. 5. Invoke `dcu-profiler-report` when benchmark evidence is not enough to choose the next edit. ### Stage 3: Tune And Integrate 1. Build a performance map over `W`. 2. Add shape/gfx-aware dispatch or config-table entries only when measured regimes need different choices. 3. Re-run correctness across all touched dtypes/layouts/modes. 4. Re-run benchmark cases that define `W` and any nearby regression guards. 5. Summarize final code paths, fallback behavior, unsupported regimes, and remaining risks. ## Required Loop State Keep Humanize state local and untracked: ```text .humanize/lightop-agent/refined-plan.md .humanize/lightop-agent/research-digest.md .humanize/lightop-agent/attempt-ledger.md .humanize/lightop-agent/optimization-ledger.md .humanize/lightop-agent/lineage.jsonl .humanize/lightop-agent/performance-map.json .humanize/lightop-agent/tuning-decisions.md .humanize/lightop-agent/profile-artifacts/ ``` Before starting RLCR, make sure `.humanize*` is ignored. Do not commit loop state unless the user explicitly asks for tracked evidence artifacts. ## Build, Test, Benchmark Build from the LightOp root inside the selected execution environment. Always use `python setup.py install` for LightOp builds, regardless of the installed PyTorch version. Do not switch to `setup_torch29.py`. ```bash PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install ``` Keep the existing `build/` directory between attempts so incremental extension builds can reuse prior compilation output. Do not delete `build/` as part of the normal build/test/tune loop unless the user explicitly requests a clean build or the build cache is proven to be stale or corrupt. After install, run an import smoke test in the same environment: ```bash python - <<'PY' import torch, lightop print("torch:", torch.__version__) print("hip:", torch.version.hip) print("lightop:", getattr(lightop, "__file__", "unknown")) print("device:", torch.cuda.get_device_name(0)) print("gcn:", torch.cuda.get_device_properties(0).gcnArchName) PY ``` Run the narrowest relevant test first: ```bash cd test python test_.py ``` Then run the relevant benchmark script and compare it against the named baseline and threshold. If no benchmark exists, add a small benchmark that uses warmup, fixed shapes, fixed seeds, and explicit `torch.cuda.synchronize()` around timed regions. Do not claim success from a passing build alone. A LightOp operator change is complete only after install, import smoke, targeted correctness, benchmark comparison, and, when the result is near the threshold or surprising, profiler evidence. Do not claim speedups from Python wall-clock timing unless asynchronous DCU work is synchronized. ## Profiling Invoke `dcu-profiler-report` autonomously when profiler evidence is the best next source of truth. These are heuristics, not user-facing gates: - Baseline benchmark has passed and no profile digest exists. - A correct candidate is within +/-2% of baseline or the prior best. - A correct candidate regresses on an important shape. - The benchmark plateaued and the next edit is unclear. - A candidate is much faster than expected and needs explanation. - A reviewer asks for profiling evidence. Persist profile artifacts under `.humanize/lightop-agent/profile-artifacts/` or the user-specified evidence directory. Each digest must end with exactly one concrete next kernel edit or a clear reason profiling is not actionable. ## Plan Requirements Write `.humanize/lightop-agent/refined-plan.md` using the Humanize gen-plan schema. Include acceptance criteria for: - LightOp root, target operator family, public API, and modified files. - Explicit `K`, `R`, `W`, target gfx arch, baseline command, comparison target, and hard scope exclusions. - Explicit `E`: Docker container/image or direct-host mode, LightOp path inside the execution environment, visible device selection, install command, smoke command, correctness command, benchmark command, profiler command, and pass threshold. - Correctness coverage for `W`, edge cases, dtype/layout/mode boundaries, and baseline/reference parity. - Build command, ROCm/DTK/PyTorch versions, `PYTORCH_ROCM_ARCH`, and device metadata. - Benchmark method with warmup, repeats, synchronization, per-shape timing, p50/p90 or mean as appropriate, and environment metadata. - Research digest covering local LightOp patterns and any upstream/source evidence that materially changes the route. - Attempt ledger for every candidate. - Optimization ledger only for correct candidates with measured improvement. - Tuning decisions and dispatcher/config updates when `W` has multiple regimes. - Final correctness matrix, benchmark matrix, fallback paths, unsupported regimes, and residual risk. ## RLCR Startup After writing the refined plan and making sure `.humanize*` is ignored, start the loop from the LightOp root: ```bash "{{HUMANIZE_RUNTIME_ROOT}}/scripts/setup-rlcr-loop.sh" .humanize/lightop-agent/refined-plan.md --yolo ``` If setup exits non-zero, stop and report the error. Do not bypass the gate. After setup succeeds: 1. Read `.humanize/rlcr//round-0-prompt.md`. 2. Execute the current round. 3. Commit or stage only the requested LightOp changes if the user asked for version-control actions. 4. Write the required round summary. 5. Stop normally so the Humanize Stop hook can review. If the hook blocks exit, follow the generated next-round prompt exactly.