添加编译约束，完善prompt

0fec721c · whlwhlwhl · 2ad344b2 · 0fec721c · 0fec721c · 0fec721c
Commit 0fec721c authored May 21, 2026 by whlwhlwhl
5 changed files
--- a/README.md
+++ b/README.md
@@ -11,9 +11,6 @@ LightOp KernelPilot 是面向 LightOp DCU 算子库的 KernelPilot 工作流改
 workload 分布、benchmark 证据、profile digest、尝试记录、优化记录，以及
 带 review gate 的迭代。

-它移除了 NVIDIA 优先的假设，例如 Nsight Compute、CUTLASS/CuTe、PTX/SASS、
-Blackwell/Hopper、TMA、WGMMA、tcgen05 等。默认面向 DCU/ROCm/HIP/DTK 环境。
-
 ## Skills

 | Skill | 作用 |
@@ -25,6 +22,9 @@ Blackwell/Hopper、TMA、WGMMA、tcgen05 等。默认面向 DCU/ROCm/HIP/DTK 环
 磁盘上的目录名 `humanize-kernel-agent-loop` 和 `ncu-report` 是为了兼容上游
 Humanize installer；真正暴露给 agent 的 skill 名称以上表 frontmatter 为准。

+如果只是想人工阅读这三个 skill 的中文说明，可以看
+[`docs/lightop-skills.zh-CN.md`](docs/lightop-skills.zh-CN.md)。
+
 ## 手动安装

 如果没有 Claude 或 Codex CLI installer，可以直接安装这三个 LightOp/DCU
@@ -66,32 +66,37 @@ Codex:  ${CODEX_HOME:-~/.codex}/skills
 一个清晰的请求最好包含：算子名、正确性参考、workload、执行环境、目标
 DCU/gfx arch、baseline、benchmark 方法、成功阈值。

-示例：
-
-```text
-[$lightop-kernel-agent-loop] 给 LightOp 添加 fused rmsnorm + rope + fp8
-kv-cache store 算子，目标 gfx936。正确性参考使用 PyTorch/native LightOp
-组合路径，覆盖 Qwen decode 的 batch/token/head_dim shape。验证环境使用
-Docker 容器 lightop-dtk，容器内 repo 路径是 /workspace/lightop。
-性能要求：p50 latency 比现有 unfused 路径快 15%。
-```
-
-优化已有算子的示例：
-
-```text
-[$lightop-kernel-agent-loop] 优化 gfx938 上的 lightop.moe_gemm_w8a8，
-目标 workload 是 DeepSeek EP8 decode。保持现有 Python API 不变，
-和当前 LightOp baseline 对比；benchmark plateau 时使用 hipprof 证据继续分析。
-```
-
-如果宿主机路径和容器路径不同，建议直接写清楚：
+prompt示例：

 ```text
-宿主机 LightOp 路径：/public/wanghl6/lightop
-验证容器：wanghl_lightop209
-容器内 LightOp 路径：/home/lightop
-所有 build、correctness test、benchmark、profiling 都必须在容器内执行：
-docker exec wanghl_lightop209 bash -lc 'cd /home/lightop && <command>'
+@lightop-kernel-agent-loop
+- 宿主机 LightOp 路径：/path/to/lightop
+- 验证 Docker：<lightop-container>
+- 容器内 LightOp 路径：/workspace/lightop
+
+任务：添加 1-pass layer norm 算子。
+- 正确性参考：PyTorch layer_norm。
+- 性能目标：最高有效带宽达到 1.1 TB/s。
+- 所有 build/test/benchmark/profile 必须在容器内 /workspace/lightop 执行。
+- 使用命令格式：
+  docker exec <lightop-container> bash -lc 'cd /workspace/lightop && <command>'
+- 缺少 shape、dtype、API、是否返回 mean/rstd 时，先从现有 LightOp layernorm/rmsnorm
+  测试和 benchmark 推断，无法推断再问。
+
+必须先创建：
+- .humanize/lightop-agent/refined-plan.md
+- .humanize/lightop-agent/research-digest.md
+- .humanize/lightop-agent/attempt-ledger.md
+并启动或记录 Humanize loop state。
+
+授权范围：
+- 允许修改该 LightOp 仓库内与本任务相关的源码、测试、benchmark、必要 config。
+- 允许创建/写入 .humanize/lightop-agent/ 记录文件。
+- 允许执行上述 docker exec 命令进行 build、test、benchmark、profiling。
+- 允许执行 python setup.py install、hipprof/rocprof。
+- 不允许删除 build/。
+- 不允许 git reset、清理仓库、force push、删除大目录。
+- 不需要 git commit/push。
 ```

 ## LightOp 接入位置
@@ -125,6 +130,12 @@ LightOp KernelPilot 的 build 规则固定为：
 python setup.py install
 ```

+如果用户指定 Docker 容器，编译也必须进容器执行：
+
+```bash
+docker exec <container> bash -lc 'cd <container-lightop> && python setup.py install'
+```
+
 无论 PyTorch 版本是什么，都不切到 `setup_torch29.py`。正常调优循环中也不删除
 `build/`，以便复用增量编译结果；只有用户明确要求 clean build，或证明 cache
 损坏时才清理。
@@ -141,6 +152,11 @@ https://developer.sourcefind.cn/gitbook//dcu_developer/DeveloperGuide/dcu_progra

 ```bash
 cd /path/to/lightop
+bash /path/to/lightop-skiils/humanize/scripts/measure-device-bandwidth.sh \
+  --docker <container> \
+  --workdir <container-lightop> \
+  --hip-visible-devices <idle-card> \
+  --output .humanize/lightop-agent/device-bandwidth.txt
 mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline
 python test/test_<op>.py 2>&1 \
  | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log
@@ -148,6 +164,13 @@ hipprof python test/test_<op>.py 2>&1 \
  | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt
 ```

+开始优化前必须先测当前选中卡的实际读写/拷贝带宽，作为后续算子有效带宽目标的参照。
+
+优化循环里，每个正确性通过的 candidate 在普通 benchmark 后都要补一轮
+`hipprof --pmc`，用于查看 cache 行为、LDS/bank conflict、occupancy/resource
+压力，并把结果转成下一步明确的 kernel edit；如果当前 DTK 不支持某个 counter，
+要记录实际命令和报错，不能用猜测替代。
+
 当 `hipprof` 和 benchmark log 不够解释问题时，可以进一步使用：

 ```text

--- a/docs/lightop-skills.zh-CN.md
+++ b/docs/lightop-skills.zh-CN.md
@@ -70,6 +70,11 @@ build、correctness test、benchmark、profiling 必须在同一个环境里做
 docker exec <container> bash -lc 'cd /path/in/container/lightop && <command>'
 ```

+如果用户指定了 Docker 容器，build/install 也必须使用这个形式，不能在宿主机直接编译。
+容器里的安装命令固定是 `python setup.py install`，可以按需加
+`PYTORCH_ROCM_ARCH=...` 前缀；不要使用 `setup_torch29.py`，正常调优循环里也不要删除
+`build/`。
+
 开工前需要记录：

 - 容器名或镜像名；不用 Docker 时记录 `direct-host`。
@@ -169,33 +174,16 @@ HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py
 当前卡、当前负载状态下的 sanity baseline，用来判断算子的有效带宽目标是否合理。

 ```bash
-mkdir -p .humanize/lightop-agent
-HIP_VISIBLE_DEVICES=<idle-card> python - <<'PY' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
-import time, torch
-torch.cuda.init()
-free, total = torch.cuda.mem_get_info()
-bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
-n = bytes_per_buf // 4
-a = torch.empty(n, device="cuda", dtype=torch.float32)
-b = torch.empty_like(a)
-c = torch.empty_like(a)
-a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
-def bench(name, fn, bytes_moved, iters=80, warmup=20):
-    for _ in range(warmup): fn()
-    torch.cuda.synchronize()
-    t0 = time.perf_counter()
-    for _ in range(iters): fn()
-    torch.cuda.synchronize()
-    dt = (time.perf_counter() - t0) / iters
-    print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
-bench("write_fill", lambda: a.fill_(3.0), n * 4)
-bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
-bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
-bench("read_reduce", lambda: torch.sum(a), n * 4)
-print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
-PY
+bash /path/to/lightop-skiils/humanize/scripts/measure-device-bandwidth.sh \
+  --docker <container> \
+  --workdir <container-lightop> \
+  --hip-visible-devices <idle-card> \
+  --output .humanize/lightop-agent/device-bandwidth.txt
 ```

+如果不使用 Docker，就去掉 `--docker`，并用 `--workdir <lightop-root>` 指定宿主机
+LightOp 路径，或者在 LightOp root 里直接运行脚本。
+
 ### 工作流

 Stage 1：检查和计划。
@@ -284,10 +272,20 @@ LightOp build 固定使用：
 PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install
 ```

+如果执行环境是 Docker，必须包在用户指定的容器路径里：
+
+```bash
+docker exec <container> bash -lc 'cd <container-lightop> && PYTORCH_ROCM_ARCH="gfx928;gfx936;gfx938" python setup.py install'
+```
+
 无论 PyTorch 版本是什么，都不切到 `setup_torch29.py`。正常调优循环里不删除
 `build/`，这样能复用增量编译结果。只有用户明确要求 clean build，或者证明 build
 cache 损坏时，才清理。

+也就是说，普通 LightOp 调优任务里不要执行 `rm -rf build`，不要执行
+`python setup_torch29.py install`，Docker 任务也不要在宿主机直接
+`python setup.py install`。
+
 安装后做 import smoke：

 ```bash
@@ -579,17 +577,17 @@ hipprof --codeobj-analyze <binary-or-so> \

 ```text
 @lightop-kernel-agent-loop
-宿主机 LightOp 路径：/public/home/wanghl6/pr/lightop
-验证 Docker：wanghl_lightop209
-容器内 LightOp 路径：/home/pr/lightop
+宿主机 LightOp 路径：/path/to/lightop
+验证 Docker：<lightop-container>
+容器内 LightOp 路径：/workspace/lightop

 任务：在 LightOp 中添加 <operator> 算子。
 正确性参考：<PyTorch/native LightOp reference>
 性能目标：<目标 shape/dtype 的 latency 或有效带宽>

 要求：
- 所有 build/test/benchmark/profile 都必须在容器内 /home/pr/lightop 执行。
- 使用命令：docker exec wanghl_lightop209 bash -lc 'cd /home/pr/lightop && <command>'
+- 所有 build/test/benchmark/profile 都必须在容器内 /workspace/lightop 执行。
+- 使用命令：docker exec <lightop-container> bash -lc 'cd /workspace/lightop && <command>'
 - 遵循现有 LightOp wrapper、export.cpp、csrc、test、benchmark 风格。
 - 如果首版正确但未达性能目标，进入 profiling/tuning，不要停止。
 - 完成前给出 build、smoke test、correctness test、benchmark 的实际命令和结果。
@@ -599,9 +597,9 @@ hipprof --codeobj-analyze <binary-or-so> \

 ```text
 @lightop-kernel-agent-loop
-宿主机 LightOp 路径：/public/home/wanghl6/pr/lightop
-验证 Docker：wanghl_lightop209
-容器内 LightOp 路径：/home/pr/lightop
+宿主机 LightOp 路径：/path/to/lightop
+验证 Docker：<lightop-container>
+容器内 LightOp 路径：/workspace/lightop

 任务：优化已有算子 <operator>。
 目标源码：lightop/csrc/<family>/<file>.cu

--- a/humanize/scripts/measure-device-bandwidth.sh
+++ b/humanize/scripts/measure-device-bandwidth.sh
+#!/usr/bin/env bash
+#
+# Measure selected-device memory bandwidth before LightOp tuning.
+#
+# Direct mode:
+#   bash measure-device-bandwidth.sh --output .humanize/lightop-agent/device-bandwidth.txt --hip-visible-devices 0
+#
+# Docker mode:
+#   bash measure-device-bandwidth.sh --docker wanghl_lightop209 --workdir /home/lightop \
+#     --output .humanize/lightop-agent/device-bandwidth.txt --hip-visible-devices 0
+
+set -euo pipefail
+
+CONTAINER=""
+WORKDIR=""
+OUTPUT=".humanize/lightop-agent/device-bandwidth.txt"
+HIP_VISIBLE=""
+MAX_MIB="512"
+MIN_MIB="16"
+ITERS="80"
+WARMUP="20"
+DTYPE="float32"
+PYTHON_BIN="${PYTHON:-python}"
+
+usage() {
+    sed -n '2,10p' "$0" >&2
+    cat >&2 <<'EOF'
+
+Options:
+  --docker <container>          Run the measurement inside this Docker container.
+  --workdir <path>              Container or local LightOp root. Required with --docker.
+  --output <path>               Output path relative to workdir/cwd unless absolute.
+  --hip-visible-devices <id>    Value for HIP_VISIBLE_DEVICES during measurement.
+  --max-mib <n>                 Maximum bytes per buffer in MiB. Default: 512.
+  --min-mib <n>                 Minimum bytes per buffer in MiB. Default: 16.
+  --iters <n>                   Timed iterations. Default: 80.
+  --warmup <n>                  Warmup iterations. Default: 20.
+  --dtype <torch dtype>         float32, float16, bfloat16. Default: float32.
+EOF
+}
+
+shell_quote() {
+    printf '%q' "$1"
+}
+
+emit_python() {
+    cat <<'PY'
+import argparse
+import datetime as _dt
+import os
+import platform
+import sys
+import time
+
+import torch
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--output", default=".humanize/lightop-agent/device-bandwidth.txt")
+    parser.add_argument("--max-mib", type=int, default=512)
+    parser.add_argument("--min-mib", type=int, default=16)
+    parser.add_argument("--iters", type=int, default=80)
+    parser.add_argument("--warmup", type=int, default=20)
+    parser.add_argument("--dtype", default="float32")
+    return parser.parse_args()
+
+
+def dtype_from_name(name):
+    mapping = {
+        "float32": torch.float32,
+        "fp32": torch.float32,
+        "float": torch.float32,
+        "float16": torch.float16,
+        "fp16": torch.float16,
+        "half": torch.float16,
+        "bfloat16": torch.bfloat16,
+        "bf16": torch.bfloat16,
+    }
+    if name not in mapping:
+        raise SystemExit(f"unsupported dtype: {name}")
+    return mapping[name]
+
+
+def main():
+    args = parse_args()
+    if not torch.cuda.is_available():
+        raise SystemExit("torch.cuda is not available in this environment")
+
+    dtype = dtype_from_name(args.dtype)
+    torch.cuda.init()
+    device_index = torch.cuda.current_device()
+    props = torch.cuda.get_device_properties(device_index)
+    free, total = torch.cuda.mem_get_info()
+
+    min_bytes = max(1, args.min_mib) << 20
+    max_bytes = max(args.min_mib, args.max_mib) << 20
+    bytes_per_buf = max(min_bytes, min(max_bytes, int(free // 5)))
+    elem_size = torch.empty((), device="cuda", dtype=dtype).element_size()
+    n = max(1, bytes_per_buf // elem_size)
+    bytes_per_buf = n * elem_size
+
+    a = torch.empty(n, device="cuda", dtype=dtype)
+    b = torch.empty_like(a)
+    c = torch.empty_like(a)
+    a.fill_(1.0)
+    b.fill_(2.0)
+    c.zero_()
+    torch.cuda.synchronize()
+
+    def bench(name, fn, bytes_moved):
+        for _ in range(args.warmup):
+            fn()
+        torch.cuda.synchronize()
+        t0 = time.perf_counter()
+        for _ in range(args.iters):
+            fn()
+        torch.cuda.synchronize()
+        seconds = (time.perf_counter() - t0) / args.iters
+        tbps = bytes_moved / seconds / 1e12
+        return name, tbps, seconds, bytes_moved
+
+    rows = [
+        bench("write_fill", lambda: a.fill_(3.0), bytes_per_buf),
+        bench("copy_read_write", lambda: c.copy_(a), bytes_per_buf * 2),
+        bench("triad_2read_1write", lambda: torch.add(a, b, out=c), bytes_per_buf * 3),
+        bench("read_reduce", lambda: torch.sum(a), bytes_per_buf),
+    ]
+
+    lines = [
+        "device_bandwidth_calibration:",
+        f"  timestamp_utc: {_dt.datetime.utcnow().isoformat(timespec='seconds')}Z",
+        f"  host: {platform.node()}",
+        f"  cwd: {os.getcwd()}",
+        f"  python: {sys.version.split()[0]}",
+        f"  torch: {torch.__version__}",
+        f"  hip: {getattr(torch.version, 'hip', None)}",
+        f"  hip_visible_devices: {os.environ.get('HIP_VISIBLE_DEVICES', '')}",
+        f"  device_index: {device_index}",
+        f"  device_name: {torch.cuda.get_device_name(device_index)}",
+        f"  gcn_arch: {getattr(props, 'gcnArchName', '')}",
+        f"  dtype: {args.dtype}",
+        f"  buffer_bytes: {bytes_per_buf}",
+        f"  total_mem_bytes: {total}",
+        f"  free_mem_bytes_at_start: {free}",
+        f"  warmup: {args.warmup}",
+        f"  iters: {args.iters}",
+        "  results:",
+    ]
+    for name, tbps, seconds, bytes_moved in rows:
+        lines.extend([
+            f"    {name}:",
+            f"      tbps: {tbps:.6f}",
+            f"      us_per_iter: {seconds * 1e6:.3f}",
+            f"      bytes_moved: {bytes_moved}",
+        ])
+
+    text = "\n".join(lines) + "\n"
+    print(text, end="")
+    if args.output:
+        output_dir = os.path.dirname(os.path.abspath(args.output))
+        os.makedirs(output_dir, exist_ok=True)
+        with open(args.output, "w", encoding="utf-8") as fh:
+            fh.write(text)
+
+
+if __name__ == "__main__":
+    main()
+PY
+}
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --docker) CONTAINER="$2"; shift 2 ;;
+        --workdir) WORKDIR="$2"; shift 2 ;;
+        --output) OUTPUT="$2"; shift 2 ;;
+        --hip-visible-devices) HIP_VISIBLE="$2"; shift 2 ;;
+        --max-mib) MAX_MIB="$2"; shift 2 ;;
+        --min-mib) MIN_MIB="$2"; shift 2 ;;
+        --iters) ITERS="$2"; shift 2 ;;
+        --warmup) WARMUP="$2"; shift 2 ;;
+        --dtype) DTYPE="$2"; shift 2 ;;
+        -h|--help) usage; exit 0 ;;
+        *) echo "Error: unknown argument: $1" >&2; usage; exit 1 ;;
+    esac
+done
+
+if [[ -n "$CONTAINER" ]]; then
+    if [[ -z "$WORKDIR" ]]; then
+        echo "Error: --workdir is required with --docker" >&2
+        exit 1
+    fi
+    q_workdir="$(shell_quote "$WORKDIR")"
+    q_output="$(shell_quote "$OUTPUT")"
+    q_max="$(shell_quote "$MAX_MIB")"
+    q_min="$(shell_quote "$MIN_MIB")"
+    q_iters="$(shell_quote "$ITERS")"
+    q_warmup="$(shell_quote "$WARMUP")"
+    q_dtype="$(shell_quote "$DTYPE")"
+    hip_prefix=""
+    if [[ -n "$HIP_VISIBLE" ]]; then
+        hip_prefix="HIP_VISIBLE_DEVICES=$(shell_quote "$HIP_VISIBLE") "
+    fi
+    inner="cd $q_workdir && mkdir -p .humanize/lightop-agent && ${hip_prefix}$PYTHON_BIN - --output $q_output --max-mib $q_max --min-mib $q_min --iters $q_iters --warmup $q_warmup --dtype $q_dtype"
+    emit_python | docker exec -i "$CONTAINER" bash -lc "$inner"
+else
+    if [[ -n "$WORKDIR" ]]; then
+        cd "$WORKDIR"
+    fi
+    mkdir -p "$(dirname "$OUTPUT")"
+    if [[ -n "$HIP_VISIBLE" ]]; then
+        export HIP_VISIBLE_DEVICES="$HIP_VISIBLE"
+    fi
+    emit_python | "$PYTHON_BIN" - --output "$OUTPUT" --max-mib "$MAX_MIB" --min-mib "$MIN_MIB" --iters "$ITERS" --warmup "$WARMUP" --dtype "$DTYPE"
+fi
--- a/humanize/skills/humanize-kernel-agent-loop/SKILL.md
+++ b/humanize/skills/humanize-kernel-agent-loop/SKILL.md
@@ -70,6 +70,13 @@ logs and repeat the same command:
 docker exec <container> bash -lc 'cd /path/in/container/lightop && <command>'
 ```

+When Docker is the named execution environment, the LightOp install/build step
+must also be run through that exact `docker exec` shape. Do not build on the
+host for that task. The install command inside the container is always
+`python setup.py install` (optionally prefixed by `PYTORCH_ROCM_ARCH=...`);
+never use `setup_torch29.py`, and never remove `build/` as part of the normal
+iteration loop.
+
 If the user provides an image rather than a running container, ask for or infer
 the host LightOp path and run with DCU devices exposed:

@@ -221,39 +228,18 @@ Example:

 ```bash
 hy-smi || rocm-smi
-mkdir -p .humanize/lightop-agent
-HIP_VISIBLE_DEVICES=<idle-card> python - <<'PY' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
-import time, torch
-torch.cuda.init()
-free, total = torch.cuda.mem_get_info()
-bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
-n = bytes_per_buf // 4
-a = torch.empty(n, device="cuda", dtype=torch.float32)
-b = torch.empty_like(a)
-c = torch.empty_like(a)
-a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
-
-def bench(name, fn, bytes_moved, iters=80, warmup=20):
-    for _ in range(warmup):
-        fn()
-    torch.cuda.synchronize()
-    t0 = time.perf_counter()
-    for _ in range(iters):
-        fn()
-    torch.cuda.synchronize()
-    dt = (time.perf_counter() - t0) / iters
-    print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
-
-bench("write_fill", lambda: a.fill_(3.0), n * 4)
-bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
-bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
-bench("read_reduce", lambda: torch.sum(a), n * 4)
-print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
-PY
+bash "{{HUMANIZE_RUNTIME_ROOT}}/scripts/measure-device-bandwidth.sh" \
+  --docker <container> \
+  --workdir <container-lightop> \
+  --hip-visible-devices <idle-card> \
+  --output .humanize/lightop-agent/device-bandwidth.txt
 HIP_VISIBLE_DEVICES=<idle-card> python test/<family>/benchmark_<op>.py
 HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py
 ```

+For direct-host execution, omit `--docker` and pass `--workdir <lightop-root>`
+or run the script from the LightOp root.
+
 Do not report a performance number as actionable unless the device-selection
 gate and device bandwidth calibration were recorded, or the user explicitly
 accepts the missing evidence.
@@ -450,11 +436,22 @@ PyTorch version. Do not switch to `setup_torch29.py`.
 PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install
 ```

+If the selected execution environment is Docker, wrap the same command in the
+user-provided container path:
+
+```bash
+docker exec <container> bash -lc 'cd <container-lightop> && PYTORCH_ROCM_ARCH="gfx928;gfx936;gfx938" python setup.py install'
+```
+
 Keep the existing `build/` directory between attempts so incremental extension
 builds can reuse prior compilation output. Do not delete `build/` as part of
 the normal build/test/tune loop unless the user explicitly requests a clean
 build or the build cache is proven to be stale or corrupt.

+Commands such as `rm -rf build`, `python setup_torch29.py install`, or a host
+side `python setup.py install` for a Docker-bound task violate this skill unless
+the user explicitly overrides the rule.
+
 After install, run an import smoke test in the same environment:

 ```bash

--- a/humanize/skills/ncu-report/SKILL.md
+++ b/humanize/skills/ncu-report/SKILL.md
@@ -140,31 +140,11 @@ Load [examples.md](references/examples.md) for copyable command variants.
 Minimal first-pass capture from the LightOp root:

 ```bash
-mkdir -p .humanize/lightop-agent
-HIP_VISIBLE_DEVICES=<idle-card> python - <<'PY' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
-import time, torch
-torch.cuda.init()
-free, total = torch.cuda.mem_get_info()
-bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
-n = bytes_per_buf // 4
-a = torch.empty(n, device="cuda", dtype=torch.float32)
-b = torch.empty_like(a)
-c = torch.empty_like(a)
-a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
-def bench(name, fn, bytes_moved, iters=80, warmup=20):
-    for _ in range(warmup): fn()
-    torch.cuda.synchronize()
-    t0 = time.perf_counter()
-    for _ in range(iters): fn()
-    torch.cuda.synchronize()
-    dt = (time.perf_counter() - t0) / iters
-    print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
-bench("write_fill", lambda: a.fill_(3.0), n * 4)
-bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
-bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
-bench("read_reduce", lambda: torch.sum(a), n * 4)
-print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
-PY
+bash <humanize-root>/scripts/measure-device-bandwidth.sh \
+  --docker <container> \
+  --workdir <container-lightop> \
+  --hip-visible-devices <idle-card> \
+  --output .humanize/lightop-agent/device-bandwidth.txt
 mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline
 hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt || \
  rocm-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt