# LightOp KernelPilot **A Humanize-powered skill pack for adding and optimizing LightOp fused operators on DCU/ROCm.**
LightOp KernelPilot adapts the original KernelPilot idea to the LightOp DCU operator library. It keeps the useful parts of the autonomous loop: explicit operator semantics, correctness references, workload distributions, benchmark evidence, profiling digests, ledgers, and review-gated iteration. It removes the NVIDIA-first assumptions around Nsight Compute, CUTLASS/CuTe, PTX/SASS, Blackwell/Hopper, TMA, WGMMA, and tcgen05. ## Skills | Skill | Role | | --- | --- | | [`lightop-kernel-agent-loop`](humanize/skills/humanize-kernel-agent-loop/) | Adds or optimizes LightOp operators by recovering `K/R/W`, inspecting LightOp wrappers/bindings/kernels/tests/configs, implementing HIP/ROCm changes, building, testing, benchmarking, profiling, tuning, and starting Humanize RLCR. | | [`lightop-kernel-knowledge`](knowledge/SKILL.md) | Gathers evidence from local LightOp source first, then ROCm/DCU upstream and official docs, then the bundled CUDA PR corpus only as cross-platform inspiration. | | [`dcu-profiler-report`](humanize/skills/ncu-report/) | Turns `hipprof`, ROCm/DTK profiler output, benchmark logs, and optional AMDGPU ISA evidence into a reproducible digest with exactly one next LightOp edit. | The on-disk folder names `humanize-kernel-agent-loop` and `ncu-report` are kept for compatibility with the upstream Humanize installer, but the skill frontmatter exposes the LightOp/DCU names above. ## Request Shape A useful request names the operator, correctness reference, workload distribution, execution environment, target DCU/gfx arch, baseline, benchmark method, and success threshold: ```text [$lightop-kernel-agent-loop] Add a LightOp fused rmsnorm + rope + fp8 kv-cache store operator for gfx936. Use PyTorch/native LightOp composition as the correctness reference, cover batch/token/head_dim shapes from Qwen decode, and run in Docker container lightop-dtk with /workspace/lightop as the repo path. Beat the existing unfused path by 15% p50 latency. ``` For optimization: ```text [$lightop-kernel-agent-loop] Optimize lightop.moe_gemm_w8a8 on gfx938 for the DeepSeek EP8 decode workload. Keep the public Python API unchanged, compare against the current LightOp baseline, and use hipprof evidence when benchmark results plateau. ``` ## LightOp Integration The loop operates on a LightOp checkout containing: ```text setup.py setup_torch29.py lightop/__init__.py lightop/csrc/export.cpp lightop/csrc// lightop/config*.py test/ ``` When adding an operator, the agent normally touches: - `lightop/csrc//` for HIP/C++ implementation. - `lightop/csrc/export.cpp` for `m.def(...)` binding. - `lightop/.py` for the Python wrapper. - `lightop/__init__.py` for public API exports. - `setup.py` and `setup_torch29.py` only when a new `csrc` family needs glob coverage. - `test/test_.py` and benchmark scripts for correctness/performance. - `lightop/config*.py` when shape/gfx-aware dispatch is needed. ## DCU Profiling `dcu-profiler-report` uses the SourceFind DCU performance analysis guide as the official first-pass reference: ```text https://developer.sourcefind.cn/gitbook//dcu_developer/DeveloperGuide/dcu_programming/DCU_programming_chapter3_7.html ``` Default first-pass profile: ```bash cd /path/to/lightop mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline python test/test_.py 2>&1 \ | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log hipprof python test/test_.py 2>&1 \ | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt ``` Deeper profiling can use `rocprof`, `rocprofv3`, `rocprof-compute`, or AMDGPU ISA/code-object inspection when `hipprof` and benchmark logs are not enough. ## Build And Test Defaults Run build, install, tests, benchmark, and profiling in one consistent environment. If Docker is the target environment, prefer repeatable non-interactive commands: ```bash docker exec bash -lc 'cd /workspace/lightop && ' ``` From the LightOp root inside that environment: ```bash python - <<'PY' import torch print("torch:", torch.__version__) print("hip:", torch.version.hip) print("device:", torch.cuda.get_device_name(0)) print("gcn:", torch.cuda.get_device_properties(0).gcnArchName) PY hipcc --version PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install python - <<'PY' import torch, lightop print("lightop:", getattr(lightop, "__file__", "unknown")) print("gcn:", torch.cuda.get_device_properties(0).gcnArchName) PY ``` Run the narrowest targeted test first: ```bash cd test python test_.py ``` Benchmark scripts must use warmup and explicit `torch.cuda.synchronize()` or HIP synchronization around timed regions before claiming speedups. A change is not done at build success: it needs install, import smoke, correctness, benchmark comparison against the baseline, and profiler evidence when the result is close to the threshold or surprising. ## Install Install the Humanize skill pack from this checkout: ```bash cd /path/to/kernel-pilot ./humanize/scripts/install-skill.sh --target codex --kernelpilot-root "$PWD" ``` Claude Code users can use: ```bash ./humanize/scripts/install-skills-claude.sh --kernelpilot-root "$PWD" ``` After installation, the relevant skills are: ```text lightop-kernel-agent-loop lightop-kernel-knowledge dcu-profiler-report ``` ## Evidence Rules - Local LightOp source, tests, configs, and benchmarks are the primary evidence. - ROCm/DCU official docs and upstream source are the next evidence tier. - The bundled CUDA PR corpus is allowed only as cross-platform inspiration unless the idea is translated and validated on DCU. - Any copied or adapted external source must record source path/URL, commit or version, license/notice, and the optimization delta. - A profile digest must end with exactly one next edit or explain why profiling is not actionable.