# LightOp KernelPilot
**A Humanize-powered skill pack for adding and optimizing LightOp fused
operators on DCU/ROCm.**
LightOp KernelPilot adapts the original KernelPilot idea to the LightOp DCU
operator library. It keeps the useful parts of the autonomous loop: explicit
operator semantics, correctness references, workload distributions, benchmark
evidence, profiling digests, ledgers, and review-gated iteration. It removes
the NVIDIA-first assumptions around Nsight Compute, CUTLASS/CuTe, PTX/SASS,
Blackwell/Hopper, TMA, WGMMA, and tcgen05.
## Skills
| Skill | Role |
| --- | --- |
| [`lightop-kernel-agent-loop`](humanize/skills/humanize-kernel-agent-loop/) | Adds or optimizes LightOp operators by recovering `K/R/W`, inspecting LightOp wrappers/bindings/kernels/tests/configs, implementing HIP/ROCm changes, building, testing, benchmarking, profiling, tuning, and starting Humanize RLCR. |
| [`lightop-kernel-knowledge`](knowledge/SKILL.md) | Gathers evidence from local LightOp source first, then ROCm/DCU upstream and official docs, then the bundled CUDA PR corpus only as cross-platform inspiration. |
| [`dcu-profiler-report`](humanize/skills/ncu-report/) | Turns `hipprof`, ROCm/DTK profiler output, benchmark logs, and optional AMDGPU ISA evidence into a reproducible digest with exactly one next LightOp edit. |
The on-disk folder names `humanize-kernel-agent-loop` and `ncu-report` are
kept for compatibility with the upstream Humanize installer, but the skill
frontmatter exposes the LightOp/DCU names above.
## Request Shape
A useful request names the operator, correctness reference, workload
distribution, execution environment, target DCU/gfx arch, baseline, benchmark
method, and success threshold:
```text
[$lightop-kernel-agent-loop] Add a LightOp fused rmsnorm + rope + fp8 kv-cache
store operator for gfx936. Use PyTorch/native LightOp composition as the
correctness reference, cover batch/token/head_dim shapes from Qwen decode, and
run in Docker container lightop-dtk with /workspace/lightop as the repo path.
Beat the existing unfused path by 15% p50 latency.
```
For optimization:
```text
[$lightop-kernel-agent-loop] Optimize lightop.moe_gemm_w8a8 on gfx938 for the
DeepSeek EP8 decode workload. Keep the public Python API unchanged, compare
against the current LightOp baseline, and use hipprof evidence when benchmark
results plateau.
```
## LightOp Integration
The loop operates on a LightOp checkout containing:
```text
setup.py
setup_torch29.py
lightop/__init__.py
lightop/csrc/export.cpp
lightop/csrc//
lightop/config*.py
test/
```
When adding an operator, the agent normally touches:
- `lightop/csrc//` for HIP/C++ implementation.
- `lightop/csrc/export.cpp` for `m.def(...)` binding.
- `lightop/.py` for the Python wrapper.
- `lightop/__init__.py` for public API exports.
- `setup.py` and `setup_torch29.py` only when a new `csrc` family needs glob
coverage.
- `test/test_.py` and benchmark scripts for correctness/performance.
- `lightop/config*.py` when shape/gfx-aware dispatch is needed.
## DCU Profiling
`dcu-profiler-report` uses the SourceFind DCU performance analysis guide as
the official first-pass reference:
```text
https://developer.sourcefind.cn/gitbook//dcu_developer/DeveloperGuide/dcu_programming/DCU_programming_chapter3_7.html
```
Default first-pass profile:
```bash
cd /path/to/lightop
mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline
python test/test_.py 2>&1 \
| tee .humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log
hipprof python test/test_.py 2>&1 \
| tee .humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt
```
Deeper profiling can use `rocprof`, `rocprofv3`, `rocprof-compute`, or
AMDGPU ISA/code-object inspection when `hipprof` and benchmark logs are not
enough.
## Build And Test Defaults
Run build, install, tests, benchmark, and profiling in one consistent
environment. If Docker is the target environment, prefer repeatable
non-interactive commands:
```bash
docker exec bash -lc 'cd /workspace/lightop && '
```
From the LightOp root inside that environment:
```bash
python - <<'PY'
import torch
print("torch:", torch.__version__)
print("hip:", torch.version.hip)
print("device:", torch.cuda.get_device_name(0))
print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
PY
hipcc --version
PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install
python - <<'PY'
import torch, lightop
print("lightop:", getattr(lightop, "__file__", "unknown"))
print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
PY
```
Run the narrowest targeted test first:
```bash
cd test
python test_.py
```
Benchmark scripts must use warmup and explicit `torch.cuda.synchronize()` or
HIP synchronization around timed regions before claiming speedups. A change is
not done at build success: it needs install, import smoke, correctness,
benchmark comparison against the baseline, and profiler evidence when the
result is close to the threshold or surprising.
## Install
Install the Humanize skill pack from this checkout:
```bash
cd /path/to/kernel-pilot
./humanize/scripts/install-skill.sh --target codex --kernelpilot-root "$PWD"
```
Claude Code users can use:
```bash
./humanize/scripts/install-skills-claude.sh --kernelpilot-root "$PWD"
```
After installation, the relevant skills are:
```text
lightop-kernel-agent-loop
lightop-kernel-knowledge
dcu-profiler-report
```
## Evidence Rules
- Local LightOp source, tests, configs, and benchmarks are the primary
evidence.
- ROCm/DCU official docs and upstream source are the next evidence tier.
- The bundled CUDA PR corpus is allowed only as cross-platform inspiration
unless the idea is translated and validated on DCU.
- Any copied or adapted external source must record source path/URL, commit or
version, license/notice, and the optimization delta.
- A profile digest must end with exactly one next edit or explain why profiling
is not actionable.