README.md

<div align="center">

# LightOp KernelPilot

**A Humanize-powered skill pack for adding and optimizing LightOp fused
operators on DCU/ROCm.**

</div>

LightOp KernelPilot adapts the original KernelPilot idea to the LightOp DCU
operator library. It keeps the useful parts of the autonomous loop: explicit
operator semantics, correctness references, workload distributions, benchmark
evidence, profiling digests, ledgers, and review-gated iteration. It removes
the NVIDIA-first assumptions around Nsight Compute, CUTLASS/CuTe, PTX/SASS,
Blackwell/Hopper, TMA, WGMMA, and tcgen05.

## Skills

| Skill | Role |
| --- | --- |
| [`lightop-kernel-agent-loop`](humanize/skills/humanize-kernel-agent-loop/) | Adds or optimizes LightOp operators by recovering `K/R/W`, inspecting LightOp wrappers/bindings/kernels/tests/configs, implementing HIP/ROCm changes, building, testing, benchmarking, profiling, tuning, and starting Humanize RLCR. |
| [`lightop-kernel-knowledge`](knowledge/SKILL.md) | Gathers evidence from local LightOp source first, then ROCm/DCU upstream and official docs, then the bundled CUDA PR corpus only as cross-platform inspiration. |
| [`dcu-profiler-report`](humanize/skills/ncu-report/) | Turns `hipprof`, ROCm/DTK profiler output, benchmark logs, and optional AMDGPU ISA evidence into a reproducible digest with exactly one next LightOp edit. |

The on-disk folder names `humanize-kernel-agent-loop` and `ncu-report` are
kept for compatibility with the upstream Humanize installer, but the skill
frontmatter exposes the LightOp/DCU names above.

## Request Shape

A useful request names the operator, correctness reference, workload
distribution, execution environment, target DCU/gfx arch, baseline, benchmark
method, and success threshold:

```text
[$lightop-kernel-agent-loop] Add a LightOp fused rmsnorm + rope + fp8 kv-cache
store operator for gfx936. Use PyTorch/native LightOp composition as the
correctness reference, cover batch/token/head_dim shapes from Qwen decode, and
run in Docker container lightop-dtk with /workspace/lightop as the repo path.
Beat the existing unfused path by 15% p50 latency.
```

For optimization:

```text
[$lightop-kernel-agent-loop] Optimize lightop.moe_gemm_w8a8 on gfx938 for the
DeepSeek EP8 decode workload. Keep the public Python API unchanged, compare
against the current LightOp baseline, and use hipprof evidence when benchmark
results plateau.
```

## LightOp Integration

The loop operates on a LightOp checkout containing:

```text
setup.py
setup_torch29.py
lightop/__init__.py
lightop/csrc/export.cpp
lightop/csrc/<family>/
lightop/config*.py
test/
```

When adding an operator, the agent normally touches:

- `lightop/csrc/<family>/` for HIP/C++ implementation.
- `lightop/csrc/export.cpp` for `m.def(...)` binding.
- `lightop/<op>.py` for the Python wrapper.
- `lightop/__init__.py` for public API exports.
- `setup.py` and `setup_torch29.py` only when a new `csrc` family needs glob
  coverage.
- `test/test_<op>.py` and benchmark scripts for correctness/performance.
- `lightop/config*.py` when shape/gfx-aware dispatch is needed.

## DCU Profiling

`dcu-profiler-report` uses the SourceFind DCU performance analysis guide as
the official first-pass reference:

```text
https://developer.sourcefind.cn/gitbook//dcu_developer/DeveloperGuide/dcu_programming/DCU_programming_chapter3_7.html
```

Default first-pass profile:

```bash
cd /path/to/lightop
mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline
python test/test_<op>.py 2>&1 \
  | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log
hipprof python test/test_<op>.py 2>&1 \
  | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt
```

Deeper profiling can use `rocprof`, `rocprofv3`, `rocprof-compute`, or
AMDGPU ISA/code-object inspection when `hipprof` and benchmark logs are not
enough.

## Build And Test Defaults

Run build, install, tests, benchmark, and profiling in one consistent
environment. If Docker is the target environment, prefer repeatable
non-interactive commands:

```bash
docker exec <container> bash -lc 'cd /workspace/lightop && <command>'
```

From the LightOp root inside that environment:

```bash
python - <<'PY'
import torch
print("torch:", torch.__version__)
print("hip:", torch.version.hip)
print("device:", torch.cuda.get_device_name(0))
print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
PY
hipcc --version
PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install
python - <<'PY'
import torch, lightop
print("lightop:", getattr(lightop, "__file__", "unknown"))
print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
PY
```

Run the narrowest targeted test first:

```bash
cd test
python test_<op>.py
```

Benchmark scripts must use warmup and explicit `torch.cuda.synchronize()` or
HIP synchronization around timed regions before claiming speedups. A change is
not done at build success: it needs install, import smoke, correctness,
benchmark comparison against the baseline, and profiler evidence when the
result is close to the threshold or surprising.

## Install

Install the Humanize skill pack from this checkout:

```bash
cd /path/to/kernel-pilot
./humanize/scripts/install-skill.sh --target codex --kernelpilot-root "$PWD"
```

Claude Code users can use:

```bash
./humanize/scripts/install-skills-claude.sh --kernelpilot-root "$PWD"
```

After installation, the relevant skills are:

```text
lightop-kernel-agent-loop
lightop-kernel-knowledge
dcu-profiler-report
```

## Evidence Rules

- Local LightOp source, tests, configs, and benchmarks are the primary
  evidence.
- ROCm/DCU official docs and upstream source are the next evidence tier.
- The bundled CUDA PR corpus is allowed only as cross-platform inspiration
  unless the idea is translated and validated on DCU.
- Any copied or adapted external source must record source path/URL, commit or
  version, license/notice, and the optimization delta.
- A profile digest must end with exactly one next edit or explain why profiling
  is not actionable.