Commit c842f8f1 authored by whlwhlwhl's avatar whlwhlwhl
Browse files

Add manual LightOp skill install docs

parent f6fe8355
Pipeline #3633 canceled with stages
......@@ -2,6 +2,7 @@ __pycache__/
.pytest_cache/
.kernel-knowledge/
knowledge/external-repos/
knowledge/evidence/pull-bundles/
.humanize/
.humanize-*/
*.py[cod]
......
......@@ -2,56 +2,101 @@
# LightOp KernelPilot
**A Humanize-powered skill pack for adding and optimizing LightOp fused
operators on DCU/ROCm.**
**基于 Humanize 的 LightOp/DCU 算子开发与优化 skill 包。**
</div>
LightOp KernelPilot adapts the original KernelPilot idea to the LightOp DCU
operator library. It keeps the useful parts of the autonomous loop: explicit
operator semantics, correctness references, workload distributions, benchmark
evidence, profiling digests, ledgers, and review-gated iteration. It removes
the NVIDIA-first assumptions around Nsight Compute, CUTLASS/CuTe, PTX/SASS,
Blackwell/Hopper, TMA, WGMMA, and tcgen05.
LightOp KernelPilot 是面向 LightOp DCU 算子库的 KernelPilot 工作流改造版。
它保留了自主算子开发循环里真正有用的部分:明确算子语义、正确性参考、
workload 分布、benchmark 证据、profile digest、尝试记录、优化记录,以及
带 review gate 的迭代。
它移除了 NVIDIA 优先的假设,例如 Nsight Compute、CUTLASS/CuTe、PTX/SASS、
Blackwell/Hopper、TMA、WGMMA、tcgen05 等。默认面向 DCU/ROCm/HIP/DTK 环境。
## Skills
| Skill | Role |
| Skill | 作用 |
| --- | --- |
| [`lightop-kernel-agent-loop`](humanize/skills/humanize-kernel-agent-loop/) | Adds or optimizes LightOp operators by recovering `K/R/W`, inspecting LightOp wrappers/bindings/kernels/tests/configs, implementing HIP/ROCm changes, building, testing, benchmarking, profiling, tuning, and starting Humanize RLCR. |
| [`lightop-kernel-knowledge`](knowledge/SKILL.md) | Gathers evidence from local LightOp source first, then ROCm/DCU upstream and official docs, then the bundled CUDA PR corpus only as cross-platform inspiration. |
| [`dcu-profiler-report`](humanize/skills/ncu-report/) | Turns `hipprof`, ROCm/DTK profiler output, benchmark logs, and optional AMDGPU ISA evidence into a reproducible digest with exactly one next LightOp edit. |
| [`lightop-kernel-agent-loop`](humanize/skills/humanize-kernel-agent-loop/) | 主循环。用于新增或优化 LightOp 算子:恢复 `K/R/W/E`,检查 wrapper/binding/kernel/test/config/benchmark,实现 HIP/ROCm 改动,build、test、benchmark、profile、tune,并启动 Humanize RLCR。 |
| [`lightop-kernel-knowledge`](knowledge/SKILL.md) | 证据检索。优先查本地 LightOp 源码,其次查 ROCm/DCU 上游和官方文档,最后才把 bundled CUDA PR corpus 当作跨平台灵感。 |
| [`dcu-profiler-report`](humanize/skills/ncu-report/) | 性能剖析。把 `hipprof`、ROCm/DTK profiler、benchmark log 和可选 AMDGPU ISA 证据整理成可复现 digest,并给出一个明确的下一步 LightOp 修改。 |
磁盘上的目录名 `humanize-kernel-agent-loop``ncu-report` 是为了兼容上游
Humanize installer;真正暴露给 agent 的 skill 名称以上表 frontmatter 为准。
## 手动安装
如果没有 Claude 或 Codex CLI installer,可以直接安装这三个 LightOp/DCU
skills:
```bash
./install-lightop-skills-manual.sh --target both
```
只安装 Claude:
```bash
./install-lightop-skills-manual.sh --target claude
```
The on-disk folder names `humanize-kernel-agent-loop` and `ncu-report` are
kept for compatibility with the upstream Humanize installer, but the skill
frontmatter exposes the LightOp/DCU names above.
只安装 Codex:
## Request Shape
```bash
./install-lightop-skills-manual.sh --target codex
```
A useful request names the operator, correctness reference, workload
distribution, execution environment, target DCU/gfx arch, baseline, benchmark
method, and success threshold:
默认路径:
```text
[$lightop-kernel-agent-loop] Add a LightOp fused rmsnorm + rope + fp8 kv-cache
store operator for gfx936. Use PyTorch/native LightOp composition as the
correctness reference, cover batch/token/head_dim shapes from Qwen decode, and
run in Docker container lightop-dtk with /workspace/lightop as the repo path.
Beat the existing unfused path by 15% p50 latency.
Claude: ~/.claude/skills
Codex: ${CODEX_HOME:-~/.codex}/skills
```
For optimization:
脚本会:
- symlink `lightop-kernel-knowledge`
- symlink `dcu-profiler-report`
- hydrate `lightop-kernel-agent-loop` 中的 `{{HUMANIZE_RUNTIME_ROOT}}`
`{{KERNELPILOT_ROOT}}`
- 可选安装 `knowledge/requirements.txt`
## 请求格式
一个清晰的请求最好包含:算子名、正确性参考、workload、执行环境、目标
DCU/gfx arch、baseline、benchmark 方法、成功阈值。
示例:
```text
[$lightop-kernel-agent-loop] Optimize lightop.moe_gemm_w8a8 on gfx938 for the
DeepSeek EP8 decode workload. Keep the public Python API unchanged, compare
against the current LightOp baseline, and use hipprof evidence when benchmark
results plateau.
[$lightop-kernel-agent-loop] 给 LightOp 添加 fused rmsnorm + rope + fp8
kv-cache store 算子,目标 gfx936。正确性参考使用 PyTorch/native LightOp
组合路径,覆盖 Qwen decode 的 batch/token/head_dim shape。验证环境使用
Docker 容器 lightop-dtk,容器内 repo 路径是 /workspace/lightop。
性能要求:p50 latency 比现有 unfused 路径快 15%。
```
## LightOp Integration
优化已有算子的示例:
```text
[$lightop-kernel-agent-loop] 优化 gfx938 上的 lightop.moe_gemm_w8a8,
目标 workload 是 DeepSeek EP8 decode。保持现有 Python API 不变,
和当前 LightOp baseline 对比;benchmark plateau 时使用 hipprof 证据继续分析。
```
The loop operates on a LightOp checkout containing:
如果宿主机路径和容器路径不同,建议直接写清楚:
```text
宿主机 LightOp 路径:/public/wanghl6/lightop
验证容器:wanghl_lightop209
容器内 LightOp 路径:/home/lightop
所有 build、correctness test、benchmark、profiling 都必须在容器内执行:
docker exec wanghl_lightop209 bash -lc 'cd /home/lightop && <command>'
```
## LightOp 接入位置
主循环会在一个 LightOp checkout 中工作。该目录通常包含:
```text
setup.py
......@@ -63,27 +108,36 @@ lightop/config*.py
test/
```
When adding an operator, the agent normally touches:
新增算子时,agent 通常会检查或修改:
- `lightop/csrc/<family>/`:HIP/C++ kernel 和 launcher
- `lightop/csrc/export.cpp``m.def(...)` binding
- `lightop/<op>.py`:Python wrapper
- `lightop/__init__.py`:公开 API export
- `setup.py`:只有新增 `csrc` family 需要 glob 覆盖时才改
- `test/test_<op>.py`:正确性测试
- benchmark 脚本:性能测试
- `lightop/config*.py`:需要 shape/gfx-aware dispatch 时才改
LightOp KernelPilot 的 build 规则固定为:
```bash
python setup.py install
```
- `lightop/csrc/<family>/` for HIP/C++ implementation.
- `lightop/csrc/export.cpp` for `m.def(...)` binding.
- `lightop/<op>.py` for the Python wrapper.
- `lightop/__init__.py` for public API exports.
- `setup.py` and `setup_torch29.py` only when a new `csrc` family needs glob
coverage.
- `test/test_<op>.py` and benchmark scripts for correctness/performance.
- `lightop/config*.py` when shape/gfx-aware dispatch is needed.
无论 PyTorch 版本是什么,都不切到 `setup_torch29.py`。正常调优循环中也不删除
`build/`,以便复用增量编译结果;只有用户明确要求 clean build,或证明 cache
损坏时才清理。
## DCU Profiling
`dcu-profiler-report` uses the SourceFind DCU performance analysis guide as
the official first-pass reference:
`dcu-profiler-report` 使用 SourceFind DCU 性能分析工具指南作为一阶段官方参考:
```text
https://developer.sourcefind.cn/gitbook//dcu_developer/DeveloperGuide/dcu_programming/DCU_programming_chapter3_7.html
```
Default first-pass profile:
默认一阶段 profiling:
```bash
cd /path/to/lightop
......@@ -94,21 +148,31 @@ hipprof python test/test_<op>.py 2>&1 \
| tee .humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt
```
Deeper profiling can use `rocprof`, `rocprofv3`, `rocprof-compute`, or
AMDGPU ISA/code-object inspection when `hipprof` and benchmark logs are not
enough.
`hipprof` 和 benchmark log 不够解释问题时,可以进一步使用:
## Build And Test Defaults
```text
rocprof
rocprofv3
rocprof-compute
AMDGPU ISA / code-object inspection
```
profile digest 不是只写一句 “memory-bound”。它必须给出:
```text
测到什么信号 -> 可能机制是什么 -> 为什么其他假设较弱 -> 下一步具体改哪里
```
Run build, install, tests, benchmark, and profiling in one consistent
environment. If Docker is the target environment, prefer repeatable
non-interactive commands:
## Build And Test 默认规则
build、install、test、benchmark、profiling 必须在同一个环境里完成。如果目标环境是
Docker,优先使用可重复的非交互命令:
```bash
docker exec <container> bash -lc 'cd /workspace/lightop && <command>'
```
From the LightOp root inside that environment:
在该环境的 LightOp root 下先探测版本和设备:
```bash
python - <<'PY'
......@@ -119,7 +183,17 @@ print("device:", torch.cuda.get_device_name(0))
print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
PY
hipcc --version
```
然后 build:
```bash
PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install
```
再做 import smoke test:
```bash
python - <<'PY'
import torch, lightop
print("lightop:", getattr(lightop, "__file__", "unknown"))
......@@ -127,35 +201,41 @@ print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
PY
```
Run the narrowest targeted test first:
优先跑最窄的目标测试:
```bash
cd test
python test_<op>.py
```
Benchmark scripts must use warmup and explicit `torch.cuda.synchronize()` or
HIP synchronization around timed regions before claiming speedups. A change is
not done at build success: it needs install, import smoke, correctness,
benchmark comparison against the baseline, and profiler evidence when the
result is close to the threshold or surprising.
benchmark 必须包含 warmup,并且在计时区间前后显式使用
`torch.cuda.synchronize()` 或 HIP 同步。不能只凭 Python wall-clock 异步计时宣布
加速。
一个 LightOp 算子改动不能只靠 build 通过就算完成。完成条件至少包括:
- install 成功
- import smoke test 成功
- targeted correctness test 成功
- benchmark 和 baseline 对比完成
- 当结果接近阈值或异常时,有 profiler 证据
## Install
## 常规安装
Install the Humanize skill pack from this checkout:
从当前 checkout 安装 Humanize skill pack
```bash
cd /path/to/kernel-pilot
./humanize/scripts/install-skill.sh --target codex --kernelpilot-root "$PWD"
```
Claude Code users can use:
Claude Code 用户可以使用:
```bash
./humanize/scripts/install-skills-claude.sh --kernelpilot-root "$PWD"
```
After installation, the relevant skills are:
安装后相关 skill 名称是:
```text
lightop-kernel-agent-loop
......@@ -163,14 +243,12 @@ lightop-kernel-knowledge
dcu-profiler-report
```
## Evidence Rules
## 证据规则
- Local LightOp source, tests, configs, and benchmarks are the primary
evidence.
- ROCm/DCU official docs and upstream source are the next evidence tier.
- The bundled CUDA PR corpus is allowed only as cross-platform inspiration
unless the idea is translated and validated on DCU.
- Any copied or adapted external source must record source path/URL, commit or
version, license/notice, and the optimization delta.
- A profile digest must end with exactly one next edit or explain why profiling
is not actionable.
- 本地 LightOp 源码、测试、配置和 benchmark 是第一优先级证据。
- ROCm/DCU 官方文档和上游源码是第二优先级证据。
- bundled CUDA PR corpus 只能作为跨平台灵感,除非已经明确翻译并在 DCU 上验证。
- 复制或改写外部源码时,必须记录来源路径/URL、commit 或 version、
license/notice,以及优化收益。
- profile digest 最后必须给出一个具体的下一步修改;如果 profiling 不可执行,
要说明原因。
......@@ -105,7 +105,6 @@ lightop/csrc/export.cpp
lightop/<python_wrapper>.py
lightop/__init__.py
setup.py
setup_torch29.py
test/test_<op>.py
test/<family>/*benchmark*.py
lightop/config*.py
......@@ -120,8 +119,9 @@ Typical add-operator checklist:
tensor validation style.
- Export public APIs from `lightop/__init__.py` only when the operator should
be user-facing.
- If a new `csrc/<family>` directory is created, update both `setup.py` and
`setup_torch29.py` source globs.
- If a new `csrc/<family>` directory is created, update `setup.py` source
globs. Do not add a parallel `setup_torch29.py` build path unless the user
explicitly asks for legacy metadata maintenance.
- If performance depends on shape/gfx-specific choices, update or add the
relevant config/dispatcher table under `lightop/config*.py`.
- Add focused correctness tests under `test/` and benchmark coverage for `W`.
......@@ -220,18 +220,18 @@ state unless the user explicitly asks for tracked evidence artifacts.
## Build, Test, Benchmark
Build from the LightOp root inside the selected execution environment. Prefer
the repo's existing command; otherwise use:
Build from the LightOp root inside the selected execution environment. Always
use `python setup.py install` for LightOp builds, regardless of the installed
PyTorch version. Do not switch to `setup_torch29.py`.
```bash
PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install
```
For PyTorch 2.9-specific work, inspect whether the repo expects:
```bash
python setup_torch29.py install
```
Keep the existing `build/` directory between attempts so incremental extension
builds can reuse prior compilation output. Do not delete `build/` as part of
the normal build/test/tune loop unless the user explicitly requests a clean
build or the build cache is proven to be stale or corrupt.
After install, run an import smoke test in the same environment:
......
#!/usr/bin/env bash
#
# Manual install script for LightOp KernelPilot skills.
# Installs the three LightOp/DCU skills directly into Claude and/or Codex
# skill directories without requiring the Claude or Codex CLI tools.
#
# Usage:
# ./install-lightop-skills-manual.sh
# ./install-lightop-skills-manual.sh --target claude
# ./install-lightop-skills-manual.sh --target codex
# ./install-lightop-skills-manual.sh --target both
#
# Optional:
# KERNELPILOT_ROOT=/path/to/lightop-skills ./install-lightop-skills-manual.sh
# CLAUDE_SKILLS_DIR=/path/to/skills ./install-lightop-skills-manual.sh --target claude
# CODEX_SKILLS_DIR=/path/to/skills ./install-lightop-skills-manual.sh --target codex
#
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]:-$0}")" && pwd)"
KERNELPILOT_ROOT="${KERNELPILOT_ROOT:-$SCRIPT_DIR}"
TARGET="both"
INSTALL_PIP="true"
DRY_RUN="false"
CLAUDE_SKILLS_DIR="${CLAUDE_SKILLS_DIR:-${HOME}/.claude/skills}"
CODEX_SKILLS_DIR="${CODEX_SKILLS_DIR:-${CODEX_HOME:-${HOME}/.codex}/skills}"
# ---- Paths within the KernelPilot checkout ----
KNOWLEDGE_SRC="${KERNELPILOT_ROOT}/knowledge"
NCUREPORT_SRC="${KERNELPILOT_ROOT}/humanize/skills/ncu-report"
AGENTLOOP_SRC="${KERNELPILOT_ROOT}/humanize/skills/humanize-kernel-agent-loop"
HUMANIZE_RUNTIME="${KERNELPILOT_ROOT}/humanize"
# ---- Skill definitions ----
# Each entry: "name|source_type|source_path|extra"
# source_type: "symlink" or "hydrate"
SKILLS=(
"lightop-kernel-knowledge|symlink|${KNOWLEDGE_SRC}|"
"dcu-profiler-report|symlink|${NCUREPORT_SRC}|"
"lightop-kernel-agent-loop|hydrate|${AGENTLOOP_SRC}/SKILL.md|${HUMANIZE_RUNTIME}"
)
usage() {
cat <<'EOF'
Install LightOp KernelPilot skills manually.
Usage:
install-lightop-skills-manual.sh [options]
Options:
--target MODE claude|codex|both (default: both)
--kernelpilot-root PATH KernelPilot checkout root (default: script dir)
--claude-skills-dir PATH Claude skills dir (default: ~/.claude/skills)
--codex-skills-dir PATH Codex skills dir (default: ${CODEX_HOME:-~/.codex}/skills)
--skip-pip Do not install knowledge/requirements.txt
--dry-run Print actions without writing
-h, --help Show this help
EOF
}
log() { printf '[install-lightop-skills] %s\n' "$*"; }
die() { printf '[install-lightop-skills] Error: %s\n' "$*" >&2; exit 1; }
resolve_kernelpilot_root() {
KERNELPILOT_ROOT="$(cd "$KERNELPILOT_ROOT" 2>/dev/null && pwd || true)"
[[ -n "$KERNELPILOT_ROOT" ]] || die "could not resolve KernelPilot root"
KNOWLEDGE_SRC="${KERNELPILOT_ROOT}/knowledge"
NCUREPORT_SRC="${KERNELPILOT_ROOT}/humanize/skills/ncu-report"
AGENTLOOP_SRC="${KERNELPILOT_ROOT}/humanize/skills/humanize-kernel-agent-loop"
HUMANIZE_RUNTIME="${KERNELPILOT_ROOT}/humanize"
SKILLS=(
"lightop-kernel-knowledge|symlink|${KNOWLEDGE_SRC}|"
"dcu-profiler-report|symlink|${NCUREPORT_SRC}|"
"lightop-kernel-agent-loop|hydrate|${AGENTLOOP_SRC}/SKILL.md|${HUMANIZE_RUNTIME}"
)
}
preflight() {
local path
for path in "$KNOWLEDGE_SRC/SKILL.md" "$NCUREPORT_SRC/SKILL.md" "$AGENTLOOP_SRC/SKILL.md"; do
[[ -e "$path" ]] || die "not found: $path"
done
}
install_skill_dir() {
local skills_dir="$1"
local label="$2"
local entry name kind src runtime target
log "target: $label"
log "skills dir: $skills_dir"
if [[ "$DRY_RUN" != "true" ]]; then
mkdir -p "$skills_dir"
fi
for entry in "${SKILLS[@]}"; do
IFS='|' read -r name kind src runtime <<< "$entry"
target="${skills_dir}/${name}"
if [[ "$DRY_RUN" == "true" ]]; then
log "DRY-RUN remove existing: ${target}"
elif [[ -L "$target" ]] || [[ -d "$target" ]]; then
log "removing existing: ${target}"
rm -rf "$target"
elif [[ -e "$target" ]]; then
die "$target exists and is not a symlink or directory"
fi
case "$kind" in
symlink)
if [[ "$DRY_RUN" == "true" ]]; then
log "DRY-RUN link ${name} -> ${src}"
else
log "linking ${name} -> ${src}"
ln -sf "$src" "$target"
fi
;;
hydrate)
if [[ "$DRY_RUN" == "true" ]]; then
log "DRY-RUN create ${name} with hydrated paths"
else
log "creating ${name} with hydrated paths"
mkdir -p "$target"
sed \
"s|{{HUMANIZE_RUNTIME_ROOT}}|${runtime}|g; s|{{KERNELPILOT_ROOT}}|${KERNELPILOT_ROOT}|g" \
"$src" > "${target}/SKILL.md"
fi
;;
*)
die "unknown install kind: ${kind}"
;;
esac
done
}
install_python_deps() {
if [[ "$INSTALL_PIP" != "true" ]]; then
return
fi
if [[ ! -f "${KNOWLEDGE_SRC}/requirements.txt" ]]; then
return
fi
if [[ "$DRY_RUN" == "true" ]]; then
log "DRY-RUN python3 -m pip install -r ${KNOWLEDGE_SRC}/requirements.txt"
else
log "installing Python dependencies..."
python3 -m pip install -r "${KNOWLEDGE_SRC}/requirements.txt"
fi
}
while [[ $# -gt 0 ]]; do
case "$1" in
--target)
[[ -n "${2:-}" ]] || die "--target requires a value"
case "$2" in
claude|codex|both) TARGET="$2" ;;
*) die "--target must be one of: claude, codex, both" ;;
esac
shift 2
;;
--kernelpilot-root)
[[ -n "${2:-}" ]] || die "--kernelpilot-root requires a value"
KERNELPILOT_ROOT="$2"
shift 2
;;
--claude-skills-dir)
[[ -n "${2:-}" ]] || die "--claude-skills-dir requires a value"
CLAUDE_SKILLS_DIR="$2"
shift 2
;;
--codex-skills-dir)
[[ -n "${2:-}" ]] || die "--codex-skills-dir requires a value"
CODEX_SKILLS_DIR="$2"
shift 2
;;
--skip-pip)
INSTALL_PIP="false"
shift
;;
--dry-run)
DRY_RUN="true"
shift
;;
-h|--help)
usage
exit 0
;;
*)
die "unknown option: $1"
;;
esac
done
resolve_kernelpilot_root
preflight
log "kernelpilot root: $KERNELPILOT_ROOT"
log "target mode: $TARGET"
case "$TARGET" in
claude)
install_skill_dir "$CLAUDE_SKILLS_DIR" "claude"
;;
codex)
install_skill_dir "$CODEX_SKILLS_DIR" "codex"
;;
both)
install_skill_dir "$CLAUDE_SKILLS_DIR" "claude"
install_skill_dir "$CODEX_SKILLS_DIR" "codex"
;;
esac
install_python_deps
cat <<EOF
Done. Installed LightOp skills:
$(for entry in "${SKILLS[@]}"; do
IFS='|' read -r name kind _ <<< "$entry"
case "$kind" in
symlink)
printf " - %-30s (symlink)\n" "$name"
;;
hydrate)
printf " - %-30s (hydrated SKILL.md)\n" "$name"
;;
esac
done)
Installed targets:
EOF
if [[ "$TARGET" == "claude" || "$TARGET" == "both" ]]; then
printf ' - claude: %s\n' "$CLAUDE_SKILLS_DIR"
fi
if [[ "$TARGET" == "codex" || "$TARGET" == "both" ]]; then
printf ' - codex: %s\n' "$CODEX_SKILLS_DIR"
fi
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment