Commit 0fec721c authored by whlwhlwhl's avatar whlwhlwhl
Browse files

添加编译约束,完善prompt

parent 2ad344b2
...@@ -11,9 +11,6 @@ LightOp KernelPilot 是面向 LightOp DCU 算子库的 KernelPilot 工作流改 ...@@ -11,9 +11,6 @@ LightOp KernelPilot 是面向 LightOp DCU 算子库的 KernelPilot 工作流改
workload 分布、benchmark 证据、profile digest、尝试记录、优化记录,以及 workload 分布、benchmark 证据、profile digest、尝试记录、优化记录,以及
带 review gate 的迭代。 带 review gate 的迭代。
它移除了 NVIDIA 优先的假设,例如 Nsight Compute、CUTLASS/CuTe、PTX/SASS、
Blackwell/Hopper、TMA、WGMMA、tcgen05 等。默认面向 DCU/ROCm/HIP/DTK 环境。
## Skills ## Skills
| Skill | 作用 | | Skill | 作用 |
...@@ -25,6 +22,9 @@ Blackwell/Hopper、TMA、WGMMA、tcgen05 等。默认面向 DCU/ROCm/HIP/DTK 环 ...@@ -25,6 +22,9 @@ Blackwell/Hopper、TMA、WGMMA、tcgen05 等。默认面向 DCU/ROCm/HIP/DTK 环
磁盘上的目录名 `humanize-kernel-agent-loop``ncu-report` 是为了兼容上游 磁盘上的目录名 `humanize-kernel-agent-loop``ncu-report` 是为了兼容上游
Humanize installer;真正暴露给 agent 的 skill 名称以上表 frontmatter 为准。 Humanize installer;真正暴露给 agent 的 skill 名称以上表 frontmatter 为准。
如果只是想人工阅读这三个 skill 的中文说明,可以看
[`docs/lightop-skills.zh-CN.md`](docs/lightop-skills.zh-CN.md)
## 手动安装 ## 手动安装
如果没有 Claude 或 Codex CLI installer,可以直接安装这三个 LightOp/DCU 如果没有 Claude 或 Codex CLI installer,可以直接安装这三个 LightOp/DCU
...@@ -66,32 +66,37 @@ Codex: ${CODEX_HOME:-~/.codex}/skills ...@@ -66,32 +66,37 @@ Codex: ${CODEX_HOME:-~/.codex}/skills
一个清晰的请求最好包含:算子名、正确性参考、workload、执行环境、目标 一个清晰的请求最好包含:算子名、正确性参考、workload、执行环境、目标
DCU/gfx arch、baseline、benchmark 方法、成功阈值。 DCU/gfx arch、baseline、benchmark 方法、成功阈值。
示例: prompt示例:
```text
[$lightop-kernel-agent-loop] 给 LightOp 添加 fused rmsnorm + rope + fp8
kv-cache store 算子,目标 gfx936。正确性参考使用 PyTorch/native LightOp
组合路径,覆盖 Qwen decode 的 batch/token/head_dim shape。验证环境使用
Docker 容器 lightop-dtk,容器内 repo 路径是 /workspace/lightop。
性能要求:p50 latency 比现有 unfused 路径快 15%。
```
优化已有算子的示例:
```text
[$lightop-kernel-agent-loop] 优化 gfx938 上的 lightop.moe_gemm_w8a8,
目标 workload 是 DeepSeek EP8 decode。保持现有 Python API 不变,
和当前 LightOp baseline 对比;benchmark plateau 时使用 hipprof 证据继续分析。
```
如果宿主机路径和容器路径不同,建议直接写清楚:
```text ```text
宿主机 LightOp 路径:/public/wanghl6/lightop @lightop-kernel-agent-loop
验证容器:wanghl_lightop209 - 宿主机 LightOp 路径:/path/to/lightop
容器内 LightOp 路径:/home/lightop - 验证 Docker:<lightop-container>
所有 build、correctness test、benchmark、profiling 都必须在容器内执行: - 容器内 LightOp 路径:/workspace/lightop
docker exec wanghl_lightop209 bash -lc 'cd /home/lightop && <command>'
任务:添加 1-pass layer norm 算子。
- 正确性参考:PyTorch layer_norm。
- 性能目标:最高有效带宽达到 1.1 TB/s。
- 所有 build/test/benchmark/profile 必须在容器内 /workspace/lightop 执行。
- 使用命令格式:
docker exec <lightop-container> bash -lc 'cd /workspace/lightop && <command>'
- 缺少 shape、dtype、API、是否返回 mean/rstd 时,先从现有 LightOp layernorm/rmsnorm
测试和 benchmark 推断,无法推断再问。
必须先创建:
- .humanize/lightop-agent/refined-plan.md
- .humanize/lightop-agent/research-digest.md
- .humanize/lightop-agent/attempt-ledger.md
并启动或记录 Humanize loop state。
授权范围:
- 允许修改该 LightOp 仓库内与本任务相关的源码、测试、benchmark、必要 config。
- 允许创建/写入 .humanize/lightop-agent/ 记录文件。
- 允许执行上述 docker exec 命令进行 build、test、benchmark、profiling。
- 允许执行 python setup.py install、hipprof/rocprof。
- 不允许删除 build/。
- 不允许 git reset、清理仓库、force push、删除大目录。
- 不需要 git commit/push。
``` ```
## LightOp 接入位置 ## LightOp 接入位置
...@@ -125,6 +130,12 @@ LightOp KernelPilot 的 build 规则固定为: ...@@ -125,6 +130,12 @@ LightOp KernelPilot 的 build 规则固定为:
python setup.py install python setup.py install
``` ```
如果用户指定 Docker 容器,编译也必须进容器执行:
```bash
docker exec <container> bash -lc 'cd <container-lightop> && python setup.py install'
```
无论 PyTorch 版本是什么,都不切到 `setup_torch29.py`。正常调优循环中也不删除 无论 PyTorch 版本是什么,都不切到 `setup_torch29.py`。正常调优循环中也不删除
`build/`,以便复用增量编译结果;只有用户明确要求 clean build,或证明 cache `build/`,以便复用增量编译结果;只有用户明确要求 clean build,或证明 cache
损坏时才清理。 损坏时才清理。
...@@ -141,6 +152,11 @@ https://developer.sourcefind.cn/gitbook//dcu_developer/DeveloperGuide/dcu_progra ...@@ -141,6 +152,11 @@ https://developer.sourcefind.cn/gitbook//dcu_developer/DeveloperGuide/dcu_progra
```bash ```bash
cd /path/to/lightop cd /path/to/lightop
bash /path/to/lightop-skiils/humanize/scripts/measure-device-bandwidth.sh \
--docker <container> \
--workdir <container-lightop> \
--hip-visible-devices <idle-card> \
--output .humanize/lightop-agent/device-bandwidth.txt
mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline
python test/test_<op>.py 2>&1 \ python test/test_<op>.py 2>&1 \
| tee .humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log
...@@ -148,6 +164,13 @@ hipprof python test/test_<op>.py 2>&1 \ ...@@ -148,6 +164,13 @@ hipprof python test/test_<op>.py 2>&1 \
| tee .humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt
``` ```
开始优化前必须先测当前选中卡的实际读写/拷贝带宽,作为后续算子有效带宽目标的参照。
优化循环里,每个正确性通过的 candidate 在普通 benchmark 后都要补一轮
`hipprof --pmc`,用于查看 cache 行为、LDS/bank conflict、occupancy/resource
压力,并把结果转成下一步明确的 kernel edit;如果当前 DTK 不支持某个 counter,
要记录实际命令和报错,不能用猜测替代。
`hipprof` 和 benchmark log 不够解释问题时,可以进一步使用: `hipprof` 和 benchmark log 不够解释问题时,可以进一步使用:
```text ```text
......
...@@ -70,6 +70,11 @@ build、correctness test、benchmark、profiling 必须在同一个环境里做 ...@@ -70,6 +70,11 @@ build、correctness test、benchmark、profiling 必须在同一个环境里做
docker exec <container> bash -lc 'cd /path/in/container/lightop && <command>' docker exec <container> bash -lc 'cd /path/in/container/lightop && <command>'
``` ```
如果用户指定了 Docker 容器,build/install 也必须使用这个形式,不能在宿主机直接编译。
容器里的安装命令固定是 `python setup.py install`,可以按需加
`PYTORCH_ROCM_ARCH=...` 前缀;不要使用 `setup_torch29.py`,正常调优循环里也不要删除
`build/`
开工前需要记录: 开工前需要记录:
- 容器名或镜像名;不用 Docker 时记录 `direct-host` - 容器名或镜像名;不用 Docker 时记录 `direct-host`
...@@ -169,33 +174,16 @@ HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py ...@@ -169,33 +174,16 @@ HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py
当前卡、当前负载状态下的 sanity baseline,用来判断算子的有效带宽目标是否合理。 当前卡、当前负载状态下的 sanity baseline,用来判断算子的有效带宽目标是否合理。
```bash ```bash
mkdir -p .humanize/lightop-agent bash /path/to/lightop-skiils/humanize/scripts/measure-device-bandwidth.sh \
HIP_VISIBLE_DEVICES=<idle-card> python - <<'PY' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt --docker <container> \
import time, torch --workdir <container-lightop> \
torch.cuda.init() --hip-visible-devices <idle-card> \
free, total = torch.cuda.mem_get_info() --output .humanize/lightop-agent/device-bandwidth.txt
bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
n = bytes_per_buf // 4
a = torch.empty(n, device="cuda", dtype=torch.float32)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
def bench(name, fn, bytes_moved, iters=80, warmup=20):
for _ in range(warmup): fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(iters): fn()
torch.cuda.synchronize()
dt = (time.perf_counter() - t0) / iters
print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
bench("write_fill", lambda: a.fill_(3.0), n * 4)
bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
bench("read_reduce", lambda: torch.sum(a), n * 4)
print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
PY
``` ```
如果不使用 Docker,就去掉 `--docker`,并用 `--workdir <lightop-root>` 指定宿主机
LightOp 路径,或者在 LightOp root 里直接运行脚本。
### 工作流 ### 工作流
Stage 1:检查和计划。 Stage 1:检查和计划。
...@@ -284,10 +272,20 @@ LightOp build 固定使用: ...@@ -284,10 +272,20 @@ LightOp build 固定使用:
PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install
``` ```
如果执行环境是 Docker,必须包在用户指定的容器路径里:
```bash
docker exec <container> bash -lc 'cd <container-lightop> && PYTORCH_ROCM_ARCH="gfx928;gfx936;gfx938" python setup.py install'
```
无论 PyTorch 版本是什么,都不切到 `setup_torch29.py`。正常调优循环里不删除 无论 PyTorch 版本是什么,都不切到 `setup_torch29.py`。正常调优循环里不删除
`build/`,这样能复用增量编译结果。只有用户明确要求 clean build,或者证明 build `build/`,这样能复用增量编译结果。只有用户明确要求 clean build,或者证明 build
cache 损坏时,才清理。 cache 损坏时,才清理。
也就是说,普通 LightOp 调优任务里不要执行 `rm -rf build`,不要执行
`python setup_torch29.py install`,Docker 任务也不要在宿主机直接
`python setup.py install`
安装后做 import smoke: 安装后做 import smoke:
```bash ```bash
...@@ -579,17 +577,17 @@ hipprof --codeobj-analyze <binary-or-so> \ ...@@ -579,17 +577,17 @@ hipprof --codeobj-analyze <binary-or-so> \
```text ```text
@lightop-kernel-agent-loop @lightop-kernel-agent-loop
宿主机 LightOp 路径:/public/home/wanghl6/pr/lightop 宿主机 LightOp 路径:/path/to/lightop
验证 Docker:wanghl_lightop209 验证 Docker:<lightop-container>
容器内 LightOp 路径:/home/pr/lightop 容器内 LightOp 路径:/workspace/lightop
任务:在 LightOp 中添加 <operator> 算子。 任务:在 LightOp 中添加 <operator> 算子。
正确性参考:<PyTorch/native LightOp reference> 正确性参考:<PyTorch/native LightOp reference>
性能目标:<目标 shape/dtype 的 latency 或有效带宽> 性能目标:<目标 shape/dtype 的 latency 或有效带宽>
要求: 要求:
- 所有 build/test/benchmark/profile 都必须在容器内 /home/pr/lightop 执行。 - 所有 build/test/benchmark/profile 都必须在容器内 /workspace/lightop 执行。
- 使用命令:docker exec wanghl_lightop209 bash -lc 'cd /home/pr/lightop && <command>' - 使用命令:docker exec <lightop-container> bash -lc 'cd /workspace/lightop && <command>'
- 遵循现有 LightOp wrapper、export.cpp、csrc、test、benchmark 风格。 - 遵循现有 LightOp wrapper、export.cpp、csrc、test、benchmark 风格。
- 如果首版正确但未达性能目标,进入 profiling/tuning,不要停止。 - 如果首版正确但未达性能目标,进入 profiling/tuning,不要停止。
- 完成前给出 build、smoke test、correctness test、benchmark 的实际命令和结果。 - 完成前给出 build、smoke test、correctness test、benchmark 的实际命令和结果。
...@@ -599,9 +597,9 @@ hipprof --codeobj-analyze <binary-or-so> \ ...@@ -599,9 +597,9 @@ hipprof --codeobj-analyze <binary-or-so> \
```text ```text
@lightop-kernel-agent-loop @lightop-kernel-agent-loop
宿主机 LightOp 路径:/public/home/wanghl6/pr/lightop 宿主机 LightOp 路径:/path/to/lightop
验证 Docker:wanghl_lightop209 验证 Docker:<lightop-container>
容器内 LightOp 路径:/home/pr/lightop 容器内 LightOp 路径:/workspace/lightop
任务:优化已有算子 <operator>。 任务:优化已有算子 <operator>。
目标源码:lightop/csrc/<family>/<file>.cu 目标源码:lightop/csrc/<family>/<file>.cu
......
#!/usr/bin/env bash
#
# Measure selected-device memory bandwidth before LightOp tuning.
#
# Direct mode:
# bash measure-device-bandwidth.sh --output .humanize/lightop-agent/device-bandwidth.txt --hip-visible-devices 0
#
# Docker mode:
# bash measure-device-bandwidth.sh --docker wanghl_lightop209 --workdir /home/lightop \
# --output .humanize/lightop-agent/device-bandwidth.txt --hip-visible-devices 0
set -euo pipefail
CONTAINER=""
WORKDIR=""
OUTPUT=".humanize/lightop-agent/device-bandwidth.txt"
HIP_VISIBLE=""
MAX_MIB="512"
MIN_MIB="16"
ITERS="80"
WARMUP="20"
DTYPE="float32"
PYTHON_BIN="${PYTHON:-python}"
usage() {
sed -n '2,10p' "$0" >&2
cat >&2 <<'EOF'
Options:
--docker <container> Run the measurement inside this Docker container.
--workdir <path> Container or local LightOp root. Required with --docker.
--output <path> Output path relative to workdir/cwd unless absolute.
--hip-visible-devices <id> Value for HIP_VISIBLE_DEVICES during measurement.
--max-mib <n> Maximum bytes per buffer in MiB. Default: 512.
--min-mib <n> Minimum bytes per buffer in MiB. Default: 16.
--iters <n> Timed iterations. Default: 80.
--warmup <n> Warmup iterations. Default: 20.
--dtype <torch dtype> float32, float16, bfloat16. Default: float32.
EOF
}
shell_quote() {
printf '%q' "$1"
}
emit_python() {
cat <<'PY'
import argparse
import datetime as _dt
import os
import platform
import sys
import time
import torch
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--output", default=".humanize/lightop-agent/device-bandwidth.txt")
parser.add_argument("--max-mib", type=int, default=512)
parser.add_argument("--min-mib", type=int, default=16)
parser.add_argument("--iters", type=int, default=80)
parser.add_argument("--warmup", type=int, default=20)
parser.add_argument("--dtype", default="float32")
return parser.parse_args()
def dtype_from_name(name):
mapping = {
"float32": torch.float32,
"fp32": torch.float32,
"float": torch.float32,
"float16": torch.float16,
"fp16": torch.float16,
"half": torch.float16,
"bfloat16": torch.bfloat16,
"bf16": torch.bfloat16,
}
if name not in mapping:
raise SystemExit(f"unsupported dtype: {name}")
return mapping[name]
def main():
args = parse_args()
if not torch.cuda.is_available():
raise SystemExit("torch.cuda is not available in this environment")
dtype = dtype_from_name(args.dtype)
torch.cuda.init()
device_index = torch.cuda.current_device()
props = torch.cuda.get_device_properties(device_index)
free, total = torch.cuda.mem_get_info()
min_bytes = max(1, args.min_mib) << 20
max_bytes = max(args.min_mib, args.max_mib) << 20
bytes_per_buf = max(min_bytes, min(max_bytes, int(free // 5)))
elem_size = torch.empty((), device="cuda", dtype=dtype).element_size()
n = max(1, bytes_per_buf // elem_size)
bytes_per_buf = n * elem_size
a = torch.empty(n, device="cuda", dtype=dtype)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0)
b.fill_(2.0)
c.zero_()
torch.cuda.synchronize()
def bench(name, fn, bytes_moved):
for _ in range(args.warmup):
fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(args.iters):
fn()
torch.cuda.synchronize()
seconds = (time.perf_counter() - t0) / args.iters
tbps = bytes_moved / seconds / 1e12
return name, tbps, seconds, bytes_moved
rows = [
bench("write_fill", lambda: a.fill_(3.0), bytes_per_buf),
bench("copy_read_write", lambda: c.copy_(a), bytes_per_buf * 2),
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), bytes_per_buf * 3),
bench("read_reduce", lambda: torch.sum(a), bytes_per_buf),
]
lines = [
"device_bandwidth_calibration:",
f" timestamp_utc: {_dt.datetime.utcnow().isoformat(timespec='seconds')}Z",
f" host: {platform.node()}",
f" cwd: {os.getcwd()}",
f" python: {sys.version.split()[0]}",
f" torch: {torch.__version__}",
f" hip: {getattr(torch.version, 'hip', None)}",
f" hip_visible_devices: {os.environ.get('HIP_VISIBLE_DEVICES', '')}",
f" device_index: {device_index}",
f" device_name: {torch.cuda.get_device_name(device_index)}",
f" gcn_arch: {getattr(props, 'gcnArchName', '')}",
f" dtype: {args.dtype}",
f" buffer_bytes: {bytes_per_buf}",
f" total_mem_bytes: {total}",
f" free_mem_bytes_at_start: {free}",
f" warmup: {args.warmup}",
f" iters: {args.iters}",
" results:",
]
for name, tbps, seconds, bytes_moved in rows:
lines.extend([
f" {name}:",
f" tbps: {tbps:.6f}",
f" us_per_iter: {seconds * 1e6:.3f}",
f" bytes_moved: {bytes_moved}",
])
text = "\n".join(lines) + "\n"
print(text, end="")
if args.output:
output_dir = os.path.dirname(os.path.abspath(args.output))
os.makedirs(output_dir, exist_ok=True)
with open(args.output, "w", encoding="utf-8") as fh:
fh.write(text)
if __name__ == "__main__":
main()
PY
}
while [[ $# -gt 0 ]]; do
case "$1" in
--docker) CONTAINER="$2"; shift 2 ;;
--workdir) WORKDIR="$2"; shift 2 ;;
--output) OUTPUT="$2"; shift 2 ;;
--hip-visible-devices) HIP_VISIBLE="$2"; shift 2 ;;
--max-mib) MAX_MIB="$2"; shift 2 ;;
--min-mib) MIN_MIB="$2"; shift 2 ;;
--iters) ITERS="$2"; shift 2 ;;
--warmup) WARMUP="$2"; shift 2 ;;
--dtype) DTYPE="$2"; shift 2 ;;
-h|--help) usage; exit 0 ;;
*) echo "Error: unknown argument: $1" >&2; usage; exit 1 ;;
esac
done
if [[ -n "$CONTAINER" ]]; then
if [[ -z "$WORKDIR" ]]; then
echo "Error: --workdir is required with --docker" >&2
exit 1
fi
q_workdir="$(shell_quote "$WORKDIR")"
q_output="$(shell_quote "$OUTPUT")"
q_max="$(shell_quote "$MAX_MIB")"
q_min="$(shell_quote "$MIN_MIB")"
q_iters="$(shell_quote "$ITERS")"
q_warmup="$(shell_quote "$WARMUP")"
q_dtype="$(shell_quote "$DTYPE")"
hip_prefix=""
if [[ -n "$HIP_VISIBLE" ]]; then
hip_prefix="HIP_VISIBLE_DEVICES=$(shell_quote "$HIP_VISIBLE") "
fi
inner="cd $q_workdir && mkdir -p .humanize/lightop-agent && ${hip_prefix}$PYTHON_BIN - --output $q_output --max-mib $q_max --min-mib $q_min --iters $q_iters --warmup $q_warmup --dtype $q_dtype"
emit_python | docker exec -i "$CONTAINER" bash -lc "$inner"
else
if [[ -n "$WORKDIR" ]]; then
cd "$WORKDIR"
fi
mkdir -p "$(dirname "$OUTPUT")"
if [[ -n "$HIP_VISIBLE" ]]; then
export HIP_VISIBLE_DEVICES="$HIP_VISIBLE"
fi
emit_python | "$PYTHON_BIN" - --output "$OUTPUT" --max-mib "$MAX_MIB" --min-mib "$MIN_MIB" --iters "$ITERS" --warmup "$WARMUP" --dtype "$DTYPE"
fi
...@@ -70,6 +70,13 @@ logs and repeat the same command: ...@@ -70,6 +70,13 @@ logs and repeat the same command:
docker exec <container> bash -lc 'cd /path/in/container/lightop && <command>' docker exec <container> bash -lc 'cd /path/in/container/lightop && <command>'
``` ```
When Docker is the named execution environment, the LightOp install/build step
must also be run through that exact `docker exec` shape. Do not build on the
host for that task. The install command inside the container is always
`python setup.py install` (optionally prefixed by `PYTORCH_ROCM_ARCH=...`);
never use `setup_torch29.py`, and never remove `build/` as part of the normal
iteration loop.
If the user provides an image rather than a running container, ask for or infer If the user provides an image rather than a running container, ask for or infer
the host LightOp path and run with DCU devices exposed: the host LightOp path and run with DCU devices exposed:
...@@ -221,39 +228,18 @@ Example: ...@@ -221,39 +228,18 @@ Example:
```bash ```bash
hy-smi || rocm-smi hy-smi || rocm-smi
mkdir -p .humanize/lightop-agent bash "{{HUMANIZE_RUNTIME_ROOT}}/scripts/measure-device-bandwidth.sh" \
HIP_VISIBLE_DEVICES=<idle-card> python - <<'PY' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt --docker <container> \
import time, torch --workdir <container-lightop> \
torch.cuda.init() --hip-visible-devices <idle-card> \
free, total = torch.cuda.mem_get_info() --output .humanize/lightop-agent/device-bandwidth.txt
bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
n = bytes_per_buf // 4
a = torch.empty(n, device="cuda", dtype=torch.float32)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
def bench(name, fn, bytes_moved, iters=80, warmup=20):
for _ in range(warmup):
fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(iters):
fn()
torch.cuda.synchronize()
dt = (time.perf_counter() - t0) / iters
print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
bench("write_fill", lambda: a.fill_(3.0), n * 4)
bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
bench("read_reduce", lambda: torch.sum(a), n * 4)
print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
PY
HIP_VISIBLE_DEVICES=<idle-card> python test/<family>/benchmark_<op>.py HIP_VISIBLE_DEVICES=<idle-card> python test/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py
``` ```
For direct-host execution, omit `--docker` and pass `--workdir <lightop-root>`
or run the script from the LightOp root.
Do not report a performance number as actionable unless the device-selection Do not report a performance number as actionable unless the device-selection
gate and device bandwidth calibration were recorded, or the user explicitly gate and device bandwidth calibration were recorded, or the user explicitly
accepts the missing evidence. accepts the missing evidence.
...@@ -450,11 +436,22 @@ PyTorch version. Do not switch to `setup_torch29.py`. ...@@ -450,11 +436,22 @@ PyTorch version. Do not switch to `setup_torch29.py`.
PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install
``` ```
If the selected execution environment is Docker, wrap the same command in the
user-provided container path:
```bash
docker exec <container> bash -lc 'cd <container-lightop> && PYTORCH_ROCM_ARCH="gfx928;gfx936;gfx938" python setup.py install'
```
Keep the existing `build/` directory between attempts so incremental extension Keep the existing `build/` directory between attempts so incremental extension
builds can reuse prior compilation output. Do not delete `build/` as part of builds can reuse prior compilation output. Do not delete `build/` as part of
the normal build/test/tune loop unless the user explicitly requests a clean the normal build/test/tune loop unless the user explicitly requests a clean
build or the build cache is proven to be stale or corrupt. build or the build cache is proven to be stale or corrupt.
Commands such as `rm -rf build`, `python setup_torch29.py install`, or a host
side `python setup.py install` for a Docker-bound task violate this skill unless
the user explicitly overrides the rule.
After install, run an import smoke test in the same environment: After install, run an import smoke test in the same environment:
```bash ```bash
......
...@@ -140,31 +140,11 @@ Load [examples.md](references/examples.md) for copyable command variants. ...@@ -140,31 +140,11 @@ Load [examples.md](references/examples.md) for copyable command variants.
Minimal first-pass capture from the LightOp root: Minimal first-pass capture from the LightOp root:
```bash ```bash
mkdir -p .humanize/lightop-agent bash <humanize-root>/scripts/measure-device-bandwidth.sh \
HIP_VISIBLE_DEVICES=<idle-card> python - <<'PY' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt --docker <container> \
import time, torch --workdir <container-lightop> \
torch.cuda.init() --hip-visible-devices <idle-card> \
free, total = torch.cuda.mem_get_info() --output .humanize/lightop-agent/device-bandwidth.txt
bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
n = bytes_per_buf // 4
a = torch.empty(n, device="cuda", dtype=torch.float32)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
def bench(name, fn, bytes_moved, iters=80, warmup=20):
for _ in range(warmup): fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(iters): fn()
torch.cuda.synchronize()
dt = (time.perf_counter() - t0) / iters
print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
bench("write_fill", lambda: a.fill_(3.0), n * 4)
bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
bench("read_reduce", lambda: torch.sum(a), n * 4)
print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
PY
mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline
hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt || \ hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt || \
rocm-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt rocm-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment