README.md 5.92 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
<div align="center">

# LightOp KernelPilot

**A Humanize-powered skill pack for adding and optimizing LightOp fused
operators on DCU/ROCm.**

</div>

LightOp KernelPilot adapts the original KernelPilot idea to the LightOp DCU
operator library. It keeps the useful parts of the autonomous loop: explicit
operator semantics, correctness references, workload distributions, benchmark
evidence, profiling digests, ledgers, and review-gated iteration. It removes
the NVIDIA-first assumptions around Nsight Compute, CUTLASS/CuTe, PTX/SASS,
Blackwell/Hopper, TMA, WGMMA, and tcgen05.

## Skills

| Skill | Role |
| --- | --- |
| [`lightop-kernel-agent-loop`](humanize/skills/humanize-kernel-agent-loop/) | Adds or optimizes LightOp operators by recovering `K/R/W`, inspecting LightOp wrappers/bindings/kernels/tests/configs, implementing HIP/ROCm changes, building, testing, benchmarking, profiling, tuning, and starting Humanize RLCR. |
| [`lightop-kernel-knowledge`](knowledge/SKILL.md) | Gathers evidence from local LightOp source first, then ROCm/DCU upstream and official docs, then the bundled CUDA PR corpus only as cross-platform inspiration. |
| [`dcu-profiler-report`](humanize/skills/ncu-report/) | Turns `hipprof`, ROCm/DTK profiler output, benchmark logs, and optional AMDGPU ISA evidence into a reproducible digest with exactly one next LightOp edit. |

The on-disk folder names `humanize-kernel-agent-loop` and `ncu-report` are
kept for compatibility with the upstream Humanize installer, but the skill
frontmatter exposes the LightOp/DCU names above.

## Request Shape

A useful request names the operator, correctness reference, workload
distribution, execution environment, target DCU/gfx arch, baseline, benchmark
method, and success threshold:

```text
[$lightop-kernel-agent-loop] Add a LightOp fused rmsnorm + rope + fp8 kv-cache
store operator for gfx936. Use PyTorch/native LightOp composition as the
correctness reference, cover batch/token/head_dim shapes from Qwen decode, and
run in Docker container lightop-dtk with /workspace/lightop as the repo path.
Beat the existing unfused path by 15% p50 latency.
```

For optimization:

```text
[$lightop-kernel-agent-loop] Optimize lightop.moe_gemm_w8a8 on gfx938 for the
DeepSeek EP8 decode workload. Keep the public Python API unchanged, compare
against the current LightOp baseline, and use hipprof evidence when benchmark
results plateau.
```

## LightOp Integration

The loop operates on a LightOp checkout containing:

```text
setup.py
setup_torch29.py
lightop/__init__.py
lightop/csrc/export.cpp
lightop/csrc/<family>/
lightop/config*.py
test/
```

When adding an operator, the agent normally touches:

- `lightop/csrc/<family>/` for HIP/C++ implementation.
- `lightop/csrc/export.cpp` for `m.def(...)` binding.
- `lightop/<op>.py` for the Python wrapper.
- `lightop/__init__.py` for public API exports.
- `setup.py` and `setup_torch29.py` only when a new `csrc` family needs glob
  coverage.
- `test/test_<op>.py` and benchmark scripts for correctness/performance.
- `lightop/config*.py` when shape/gfx-aware dispatch is needed.

## DCU Profiling

`dcu-profiler-report` uses the SourceFind DCU performance analysis guide as
the official first-pass reference:

```text
https://developer.sourcefind.cn/gitbook//dcu_developer/DeveloperGuide/dcu_programming/DCU_programming_chapter3_7.html
```

Default first-pass profile:

```bash
cd /path/to/lightop
mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline
python test/test_<op>.py 2>&1 \
  | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log
hipprof python test/test_<op>.py 2>&1 \
  | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt
```

Deeper profiling can use `rocprof`, `rocprofv3`, `rocprof-compute`, or
AMDGPU ISA/code-object inspection when `hipprof` and benchmark logs are not
enough.

## Build And Test Defaults

Run build, install, tests, benchmark, and profiling in one consistent
environment. If Docker is the target environment, prefer repeatable
non-interactive commands:

```bash
docker exec <container> bash -lc 'cd /workspace/lightop && <command>'
```

From the LightOp root inside that environment:

```bash
python - <<'PY'
import torch
print("torch:", torch.__version__)
print("hip:", torch.version.hip)
print("device:", torch.cuda.get_device_name(0))
print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
PY
hipcc --version
PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install
python - <<'PY'
import torch, lightop
print("lightop:", getattr(lightop, "__file__", "unknown"))
print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
PY
```

Run the narrowest targeted test first:

```bash
cd test
python test_<op>.py
```

Benchmark scripts must use warmup and explicit `torch.cuda.synchronize()` or
HIP synchronization around timed regions before claiming speedups. A change is
not done at build success: it needs install, import smoke, correctness,
benchmark comparison against the baseline, and profiler evidence when the
result is close to the threshold or surprising.

## Install

Install the Humanize skill pack from this checkout:

```bash
cd /path/to/kernel-pilot
./humanize/scripts/install-skill.sh --target codex --kernelpilot-root "$PWD"
```

Claude Code users can use:

```bash
./humanize/scripts/install-skills-claude.sh --kernelpilot-root "$PWD"
```

After installation, the relevant skills are:

```text
lightop-kernel-agent-loop
lightop-kernel-knowledge
dcu-profiler-report
```

## Evidence Rules

- Local LightOp source, tests, configs, and benchmarks are the primary
  evidence.
- ROCm/DCU official docs and upstream source are the next evidence tier.
- The bundled CUDA PR corpus is allowed only as cross-platform inspiration
  unless the idea is translated and validated on DCU.
- Any copied or adapted external source must record source path/URL, commit or
  version, license/notice, and the optimization delta.
- A profile digest must end with exactly one next edit or explain why profiling
  is not actionable.