Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
whlwhlwhl
Lightop-SKIILS
Commits
0fec721c
Commit
0fec721c
authored
May 21, 2026
by
whlwhlwhl
Browse files
添加编译约束,完善prompt
parent
2ad344b2
Changes
5
Hide whitespace changes
Inline
Side-by-side
Showing
5 changed files
with
327 additions
and
114 deletions
+327
-114
README.md
README.md
+50
-27
docs/lightop-skills.zh-CN.md
docs/lightop-skills.zh-CN.md
+31
-33
humanize/scripts/measure-device-bandwidth.sh
humanize/scripts/measure-device-bandwidth.sh
+215
-0
humanize/skills/humanize-kernel-agent-loop/SKILL.md
humanize/skills/humanize-kernel-agent-loop/SKILL.md
+26
-29
humanize/skills/ncu-report/SKILL.md
humanize/skills/ncu-report/SKILL.md
+5
-25
No files found.
README.md
View file @
0fec721c
...
...
@@ -11,9 +11,6 @@ LightOp KernelPilot 是面向 LightOp DCU 算子库的 KernelPilot 工作流改
workload 分布、benchmark 证据、profile digest、尝试记录、优化记录,以及
带 review gate 的迭代。
它移除了 NVIDIA 优先的假设,例如 Nsight Compute、CUTLASS/CuTe、PTX/SASS、
Blackwell/Hopper、TMA、WGMMA、tcgen05 等。默认面向 DCU/ROCm/HIP/DTK 环境。
## Skills
| Skill | 作用 |
...
...
@@ -25,6 +22,9 @@ Blackwell/Hopper、TMA、WGMMA、tcgen05 等。默认面向 DCU/ROCm/HIP/DTK 环
磁盘上的目录名
`humanize-kernel-agent-loop`
和
`ncu-report`
是为了兼容上游
Humanize installer;真正暴露给 agent 的 skill 名称以上表 frontmatter 为准。
如果只是想人工阅读这三个 skill 的中文说明,可以看
[
`docs/lightop-skills.zh-CN.md`
](
docs/lightop-skills.zh-CN.md
)
。
## 手动安装
如果没有 Claude 或 Codex CLI installer,可以直接安装这三个 LightOp/DCU
...
...
@@ -66,32 +66,37 @@ Codex: ${CODEX_HOME:-~/.codex}/skills
一个清晰的请求最好包含:算子名、正确性参考、workload、执行环境、目标
DCU/gfx arch、baseline、benchmark 方法、成功阈值。
示例:
```
text
[$lightop-kernel-agent-loop] 给 LightOp 添加 fused rmsnorm + rope + fp8
kv-cache store 算子,目标 gfx936。正确性参考使用 PyTorch/native LightOp
组合路径,覆盖 Qwen decode 的 batch/token/head_dim shape。验证环境使用
Docker 容器 lightop-dtk,容器内 repo 路径是 /workspace/lightop。
性能要求:p50 latency 比现有 unfused 路径快 15%。
```
优化已有算子的示例:
```
text
[$lightop-kernel-agent-loop] 优化 gfx938 上的 lightop.moe_gemm_w8a8,
目标 workload 是 DeepSeek EP8 decode。保持现有 Python API 不变,
和当前 LightOp baseline 对比;benchmark plateau 时使用 hipprof 证据继续分析。
```
如果宿主机路径和容器路径不同,建议直接写清楚:
prompt示例:
```
text
宿主机 LightOp 路径:/public/wanghl6/lightop
验证容器:wanghl_lightop209
容器内 LightOp 路径:/home/lightop
所有 build、correctness test、benchmark、profiling 都必须在容器内执行:
docker exec wanghl_lightop209 bash -lc 'cd /home/lightop && <command>'
@lightop-kernel-agent-loop
- 宿主机 LightOp 路径:/path/to/lightop
- 验证 Docker:<lightop-container>
- 容器内 LightOp 路径:/workspace/lightop
任务:添加 1-pass layer norm 算子。
- 正确性参考:PyTorch layer_norm。
- 性能目标:最高有效带宽达到 1.1 TB/s。
- 所有 build/test/benchmark/profile 必须在容器内 /workspace/lightop 执行。
- 使用命令格式:
docker exec <lightop-container> bash -lc 'cd /workspace/lightop && <command>'
- 缺少 shape、dtype、API、是否返回 mean/rstd 时,先从现有 LightOp layernorm/rmsnorm
测试和 benchmark 推断,无法推断再问。
必须先创建:
- .humanize/lightop-agent/refined-plan.md
- .humanize/lightop-agent/research-digest.md
- .humanize/lightop-agent/attempt-ledger.md
并启动或记录 Humanize loop state。
授权范围:
- 允许修改该 LightOp 仓库内与本任务相关的源码、测试、benchmark、必要 config。
- 允许创建/写入 .humanize/lightop-agent/ 记录文件。
- 允许执行上述 docker exec 命令进行 build、test、benchmark、profiling。
- 允许执行 python setup.py install、hipprof/rocprof。
- 不允许删除 build/。
- 不允许 git reset、清理仓库、force push、删除大目录。
- 不需要 git commit/push。
```
## LightOp 接入位置
...
...
@@ -125,6 +130,12 @@ LightOp KernelPilot 的 build 规则固定为:
python setup.py
install
```
如果用户指定 Docker 容器,编译也必须进容器执行:
```
bash
docker
exec
<container> bash
-lc
'cd <container-lightop> && python setup.py install'
```
无论 PyTorch 版本是什么,都不切到
`setup_torch29.py`
。正常调优循环中也不删除
`build/`
,以便复用增量编译结果;只有用户明确要求 clean build,或证明 cache
损坏时才清理。
...
...
@@ -141,6 +152,11 @@ https://developer.sourcefind.cn/gitbook//dcu_developer/DeveloperGuide/dcu_progra
```
bash
cd
/path/to/lightop
bash /path/to/lightop-skiils/humanize/scripts/measure-device-bandwidth.sh
\
--docker
<container>
\
--workdir
<container-lightop>
\
--hip-visible-devices
<idle-card>
\
--output
.humanize/lightop-agent/device-bandwidth.txt
mkdir
-p
.humanize/lightop-agent/profile-artifacts/v000_baseline
python
test
/test_<op>.py 2>&1
\
|
tee
.humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log
...
...
@@ -148,6 +164,13 @@ hipprof python test/test_<op>.py 2>&1 \
|
tee
.humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt
```
开始优化前必须先测当前选中卡的实际读写/拷贝带宽,作为后续算子有效带宽目标的参照。
优化循环里,每个正确性通过的 candidate 在普通 benchmark 后都要补一轮
`hipprof --pmc`
,用于查看 cache 行为、LDS/bank conflict、occupancy/resource
压力,并把结果转成下一步明确的 kernel edit;如果当前 DTK 不支持某个 counter,
要记录实际命令和报错,不能用猜测替代。
当
`hipprof`
和 benchmark log 不够解释问题时,可以进一步使用:
```
text
...
...
docs/lightop-skills.zh-CN.md
View file @
0fec721c
...
...
@@ -70,6 +70,11 @@ build、correctness test、benchmark、profiling 必须在同一个环境里做
docker
exec
<container> bash
-lc
'cd /path/in/container/lightop && <command>'
```
如果用户指定了 Docker 容器,build/install 也必须使用这个形式,不能在宿主机直接编译。
容器里的安装命令固定是
`python setup.py install`
,可以按需加
`PYTORCH_ROCM_ARCH=...`
前缀;不要使用
`setup_torch29.py`
,正常调优循环里也不要删除
`build/`
。
开工前需要记录:
-
容器名或镜像名;不用 Docker 时记录
`direct-host`
。
...
...
@@ -169,33 +174,16 @@ HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py
当前卡、当前负载状态下的 sanity baseline,用来判断算子的有效带宽目标是否合理。
```
bash
mkdir
-p
.humanize/lightop-agent
HIP_VISIBLE_DEVICES
=
<idle-card> python -
<<
'
PY
' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
import time, torch
torch.cuda.init()
free, total = torch.cuda.mem_get_info()
bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
n = bytes_per_buf // 4
a = torch.empty(n, device="cuda", dtype=torch.float32)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
def bench(name, fn, bytes_moved, iters=80, warmup=20):
for _ in range(warmup): fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(iters): fn()
torch.cuda.synchronize()
dt = (time.perf_counter() - t0) / iters
print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
bench("write_fill", lambda: a.fill_(3.0), n * 4)
bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
bench("read_reduce", lambda: torch.sum(a), n * 4)
print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
PY
bash /path/to/lightop-skiils/humanize/scripts/measure-device-bandwidth.sh
\
--docker
<container>
\
--workdir
<container-lightop>
\
--hip-visible-devices
<idle-card>
\
--output
.humanize/lightop-agent/device-bandwidth.txt
```
如果不使用 Docker,就去掉
`--docker`
,并用
`--workdir <lightop-root>`
指定宿主机
LightOp 路径,或者在 LightOp root 里直接运行脚本。
### 工作流
Stage 1:检查和计划。
...
...
@@ -284,10 +272,20 @@ LightOp build 固定使用:
PYTORCH_ROCM_ARCH
=
'gfx928;gfx936;gfx938'
python setup.py
install
```
如果执行环境是 Docker,必须包在用户指定的容器路径里:
```
bash
docker
exec
<container> bash
-lc
'cd <container-lightop> && PYTORCH_ROCM_ARCH="gfx928;gfx936;gfx938" python setup.py install'
```
无论 PyTorch 版本是什么,都不切到
`setup_torch29.py`
。正常调优循环里不删除
`build/`
,这样能复用增量编译结果。只有用户明确要求 clean build,或者证明 build
cache 损坏时,才清理。
也就是说,普通 LightOp 调优任务里不要执行
`rm -rf build`
,不要执行
`python setup_torch29.py install`
,Docker 任务也不要在宿主机直接
`python setup.py install`
。
安装后做 import smoke:
```
bash
...
...
@@ -579,17 +577,17 @@ hipprof --codeobj-analyze <binary-or-so> \
```
text
@lightop-kernel-agent-loop
宿主机 LightOp 路径:/p
ublic/home/wanghl6/pr
/lightop
验证 Docker:
wanghl_lightop209
容器内 LightOp 路径:/
home/pr
/lightop
宿主机 LightOp 路径:/p
ath/to
/lightop
验证 Docker:
<lightop-container>
容器内 LightOp 路径:/
workspace
/lightop
任务:在 LightOp 中添加 <operator> 算子。
正确性参考:<PyTorch/native LightOp reference>
性能目标:<目标 shape/dtype 的 latency 或有效带宽>
要求:
- 所有 build/test/benchmark/profile 都必须在容器内 /
home/pr
/lightop 执行。
- 使用命令:docker exec
wanghl_lightop209
bash -lc 'cd /
home/pr
/lightop && <command>'
- 所有 build/test/benchmark/profile 都必须在容器内 /
workspace
/lightop 执行。
- 使用命令:docker exec
<lightop-container>
bash -lc 'cd /
workspace
/lightop && <command>'
- 遵循现有 LightOp wrapper、export.cpp、csrc、test、benchmark 风格。
- 如果首版正确但未达性能目标,进入 profiling/tuning,不要停止。
- 完成前给出 build、smoke test、correctness test、benchmark 的实际命令和结果。
...
...
@@ -599,9 +597,9 @@ hipprof --codeobj-analyze <binary-or-so> \
```
text
@lightop-kernel-agent-loop
宿主机 LightOp 路径:/p
ublic/home/wanghl6/pr
/lightop
验证 Docker:
wanghl_lightop209
容器内 LightOp 路径:/
home/pr
/lightop
宿主机 LightOp 路径:/p
ath/to
/lightop
验证 Docker:
<lightop-container>
容器内 LightOp 路径:/
workspace
/lightop
任务:优化已有算子 <operator>。
目标源码:lightop/csrc/<family>/<file>.cu
...
...
humanize/scripts/measure-device-bandwidth.sh
0 → 100644
View file @
0fec721c
#!/usr/bin/env bash
#
# Measure selected-device memory bandwidth before LightOp tuning.
#
# Direct mode:
# bash measure-device-bandwidth.sh --output .humanize/lightop-agent/device-bandwidth.txt --hip-visible-devices 0
#
# Docker mode:
# bash measure-device-bandwidth.sh --docker wanghl_lightop209 --workdir /home/lightop \
# --output .humanize/lightop-agent/device-bandwidth.txt --hip-visible-devices 0
set
-euo
pipefail
CONTAINER
=
""
WORKDIR
=
""
OUTPUT
=
".humanize/lightop-agent/device-bandwidth.txt"
HIP_VISIBLE
=
""
MAX_MIB
=
"512"
MIN_MIB
=
"16"
ITERS
=
"80"
WARMUP
=
"20"
DTYPE
=
"float32"
PYTHON_BIN
=
"
${
PYTHON
:-
python
}
"
usage
()
{
sed
-n
'2,10p'
"
$0
"
>
&2
cat
>
&2
<<
'
EOF
'
Options:
--docker <container> Run the measurement inside this Docker container.
--workdir <path> Container or local LightOp root. Required with --docker.
--output <path> Output path relative to workdir/cwd unless absolute.
--hip-visible-devices <id> Value for HIP_VISIBLE_DEVICES during measurement.
--max-mib <n> Maximum bytes per buffer in MiB. Default: 512.
--min-mib <n> Minimum bytes per buffer in MiB. Default: 16.
--iters <n> Timed iterations. Default: 80.
--warmup <n> Warmup iterations. Default: 20.
--dtype <torch dtype> float32, float16, bfloat16. Default: float32.
EOF
}
shell_quote
()
{
printf
'%q'
"
$1
"
}
emit_python
()
{
cat
<<
'
PY
'
import argparse
import datetime as _dt
import os
import platform
import sys
import time
import torch
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--output", default=".humanize/lightop-agent/device-bandwidth.txt")
parser.add_argument("--max-mib", type=int, default=512)
parser.add_argument("--min-mib", type=int, default=16)
parser.add_argument("--iters", type=int, default=80)
parser.add_argument("--warmup", type=int, default=20)
parser.add_argument("--dtype", default="float32")
return parser.parse_args()
def dtype_from_name(name):
mapping = {
"float32": torch.float32,
"fp32": torch.float32,
"float": torch.float32,
"float16": torch.float16,
"fp16": torch.float16,
"half": torch.float16,
"bfloat16": torch.bfloat16,
"bf16": torch.bfloat16,
}
if name not in mapping:
raise SystemExit(f"unsupported dtype: {name}")
return mapping[name]
def main():
args = parse_args()
if not torch.cuda.is_available():
raise SystemExit("torch.cuda is not available in this environment")
dtype = dtype_from_name(args.dtype)
torch.cuda.init()
device_index = torch.cuda.current_device()
props = torch.cuda.get_device_properties(device_index)
free, total = torch.cuda.mem_get_info()
min_bytes = max(1, args.min_mib) << 20
max_bytes = max(args.min_mib, args.max_mib) << 20
bytes_per_buf = max(min_bytes, min(max_bytes, int(free // 5)))
elem_size = torch.empty((), device="cuda", dtype=dtype).element_size()
n = max(1, bytes_per_buf // elem_size)
bytes_per_buf = n * elem_size
a = torch.empty(n, device="cuda", dtype=dtype)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0)
b.fill_(2.0)
c.zero_()
torch.cuda.synchronize()
def bench(name, fn, bytes_moved):
for _ in range(args.warmup):
fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(args.iters):
fn()
torch.cuda.synchronize()
seconds = (time.perf_counter() - t0) / args.iters
tbps = bytes_moved / seconds / 1e12
return name, tbps, seconds, bytes_moved
rows = [
bench("write_fill", lambda: a.fill_(3.0), bytes_per_buf),
bench("copy_read_write", lambda: c.copy_(a), bytes_per_buf * 2),
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), bytes_per_buf * 3),
bench("read_reduce", lambda: torch.sum(a), bytes_per_buf),
]
lines = [
"device_bandwidth_calibration:",
f" timestamp_utc: {_dt.datetime.utcnow().isoformat(timespec='seconds')}Z",
f" host: {platform.node()}",
f" cwd: {os.getcwd()}",
f" python: {sys.version.split()[0]}",
f" torch: {torch.__version__}",
f" hip: {getattr(torch.version, 'hip', None)}",
f" hip_visible_devices: {os.environ.get('HIP_VISIBLE_DEVICES', '')}",
f" device_index: {device_index}",
f" device_name: {torch.cuda.get_device_name(device_index)}",
f" gcn_arch: {getattr(props, 'gcnArchName', '')}",
f" dtype: {args.dtype}",
f" buffer_bytes: {bytes_per_buf}",
f" total_mem_bytes: {total}",
f" free_mem_bytes_at_start: {free}",
f" warmup: {args.warmup}",
f" iters: {args.iters}",
" results:",
]
for name, tbps, seconds, bytes_moved in rows:
lines.extend([
f" {name}:",
f" tbps: {tbps:.6f}",
f" us_per_iter: {seconds * 1e6:.3f}",
f" bytes_moved: {bytes_moved}",
])
text = "
\n
".join(lines) + "
\n
"
print(text, end="")
if args.output:
output_dir = os.path.dirname(os.path.abspath(args.output))
os.makedirs(output_dir, exist_ok=True)
with open(args.output, "w", encoding="utf-8") as fh:
fh.write(text)
if __name__ == "__main__":
main()
PY
}
while
[[
$#
-gt
0
]]
;
do
case
"
$1
"
in
--docker
)
CONTAINER
=
"
$2
"
;
shift
2
;;
--workdir
)
WORKDIR
=
"
$2
"
;
shift
2
;;
--output
)
OUTPUT
=
"
$2
"
;
shift
2
;;
--hip-visible-devices
)
HIP_VISIBLE
=
"
$2
"
;
shift
2
;;
--max-mib
)
MAX_MIB
=
"
$2
"
;
shift
2
;;
--min-mib
)
MIN_MIB
=
"
$2
"
;
shift
2
;;
--iters
)
ITERS
=
"
$2
"
;
shift
2
;;
--warmup
)
WARMUP
=
"
$2
"
;
shift
2
;;
--dtype
)
DTYPE
=
"
$2
"
;
shift
2
;;
-h
|
--help
)
usage
;
exit
0
;;
*
)
echo
"Error: unknown argument:
$1
"
>
&2
;
usage
;
exit
1
;;
esac
done
if
[[
-n
"
$CONTAINER
"
]]
;
then
if
[[
-z
"
$WORKDIR
"
]]
;
then
echo
"Error: --workdir is required with --docker"
>
&2
exit
1
fi
q_workdir
=
"
$(
shell_quote
"
$WORKDIR
"
)
"
q_output
=
"
$(
shell_quote
"
$OUTPUT
"
)
"
q_max
=
"
$(
shell_quote
"
$MAX_MIB
"
)
"
q_min
=
"
$(
shell_quote
"
$MIN_MIB
"
)
"
q_iters
=
"
$(
shell_quote
"
$ITERS
"
)
"
q_warmup
=
"
$(
shell_quote
"
$WARMUP
"
)
"
q_dtype
=
"
$(
shell_quote
"
$DTYPE
"
)
"
hip_prefix
=
""
if
[[
-n
"
$HIP_VISIBLE
"
]]
;
then
hip_prefix
=
"HIP_VISIBLE_DEVICES=
$(
shell_quote
"
$HIP_VISIBLE
"
)
"
fi
inner
=
"cd
$q_workdir
&& mkdir -p .humanize/lightop-agent &&
${
hip_prefix
}
$PYTHON_BIN
- --output
$q_output
--max-mib
$q_max
--min-mib
$q_min
--iters
$q_iters
--warmup
$q_warmup
--dtype
$q_dtype
"
emit_python | docker
exec
-i
"
$CONTAINER
"
bash
-lc
"
$inner
"
else
if
[[
-n
"
$WORKDIR
"
]]
;
then
cd
"
$WORKDIR
"
fi
mkdir
-p
"
$(
dirname
"
$OUTPUT
"
)
"
if
[[
-n
"
$HIP_VISIBLE
"
]]
;
then
export
HIP_VISIBLE_DEVICES
=
"
$HIP_VISIBLE
"
fi
emit_python |
"
$PYTHON_BIN
"
-
--output
"
$OUTPUT
"
--max-mib
"
$MAX_MIB
"
--min-mib
"
$MIN_MIB
"
--iters
"
$ITERS
"
--warmup
"
$WARMUP
"
--dtype
"
$DTYPE
"
fi
humanize/skills/humanize-kernel-agent-loop/SKILL.md
View file @
0fec721c
...
...
@@ -70,6 +70,13 @@ logs and repeat the same command:
docker
exec
<container> bash
-lc
'cd /path/in/container/lightop && <command>'
```
When Docker is the named execution environment, the LightOp install/build step
must also be run through that exact
`docker exec`
shape. Do not build on the
host for that task. The install command inside the container is always
`python setup.py install`
(optionally prefixed by
`PYTORCH_ROCM_ARCH=...`
);
never use
`setup_torch29.py`
, and never remove
`build/`
as part of the normal
iteration loop.
If the user provides an image rather than a running container, ask for or infer
the host LightOp path and run with DCU devices exposed:
...
...
@@ -221,39 +228,18 @@ Example:
```
bash
hy-smi
||
rocm-smi
mkdir
-p
.humanize/lightop-agent
HIP_VISIBLE_DEVICES
=
<idle-card> python -
<<
'
PY
' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
import time, torch
torch.cuda.init()
free, total = torch.cuda.mem_get_info()
bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
n = bytes_per_buf // 4
a = torch.empty(n, device="cuda", dtype=torch.float32)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
def bench(name, fn, bytes_moved, iters=80, warmup=20):
for _ in range(warmup):
fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(iters):
fn()
torch.cuda.synchronize()
dt = (time.perf_counter() - t0) / iters
print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
bench("write_fill", lambda: a.fill_(3.0), n * 4)
bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
bench("read_reduce", lambda: torch.sum(a), n * 4)
print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
PY
bash
"{{HUMANIZE_RUNTIME_ROOT}}/scripts/measure-device-bandwidth.sh"
\
--docker
<container>
\
--workdir
<container-lightop>
\
--hip-visible-devices
<idle-card>
\
--output
.humanize/lightop-agent/device-bandwidth.txt
HIP_VISIBLE_DEVICES
=
<idle-card> python
test
/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof python
test
/<family>/benchmark_<op>.py
```
For direct-host execution, omit
`--docker`
and pass
`--workdir <lightop-root>`
or run the script from the LightOp root.
Do not report a performance number as actionable unless the device-selection
gate and device bandwidth calibration were recorded, or the user explicitly
accepts the missing evidence.
...
...
@@ -450,11 +436,22 @@ PyTorch version. Do not switch to `setup_torch29.py`.
PYTORCH_ROCM_ARCH
=
'gfx928;gfx936;gfx938'
python setup.py
install
```
If the selected execution environment is Docker, wrap the same command in the
user-provided container path:
```
bash
docker
exec
<container> bash
-lc
'cd <container-lightop> && PYTORCH_ROCM_ARCH="gfx928;gfx936;gfx938" python setup.py install'
```
Keep the existing
`build/`
directory between attempts so incremental extension
builds can reuse prior compilation output. Do not delete
`build/`
as part of
the normal build/test/tune loop unless the user explicitly requests a clean
build or the build cache is proven to be stale or corrupt.
Commands such as
`rm -rf build`
,
`python setup_torch29.py install`
, or a host
side
`python setup.py install`
for a Docker-bound task violate this skill unless
the user explicitly overrides the rule.
After install, run an import smoke test in the same environment:
```
bash
...
...
humanize/skills/ncu-report/SKILL.md
View file @
0fec721c
...
...
@@ -140,31 +140,11 @@ Load [examples.md](references/examples.md) for copyable command variants.
Minimal first-pass capture from the LightOp root:
```
bash
mkdir
-p
.humanize/lightop-agent
HIP_VISIBLE_DEVICES
=
<idle-card> python -
<<
'
PY
' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
import time, torch
torch.cuda.init()
free, total = torch.cuda.mem_get_info()
bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
n = bytes_per_buf // 4
a = torch.empty(n, device="cuda", dtype=torch.float32)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
def bench(name, fn, bytes_moved, iters=80, warmup=20):
for _ in range(warmup): fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(iters): fn()
torch.cuda.synchronize()
dt = (time.perf_counter() - t0) / iters
print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
bench("write_fill", lambda: a.fill_(3.0), n * 4)
bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
bench("read_reduce", lambda: torch.sum(a), n * 4)
print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
PY
bash <humanize-root>/scripts/measure-device-bandwidth.sh
\
--docker
<container>
\
--workdir
<container-lightop>
\
--hip-visible-devices
<idle-card>
\
--output
.humanize/lightop-agent/device-bandwidth.txt
mkdir
-p
.humanize/lightop-agent/profile-artifacts/v000_baseline
hy-smi 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt
||
\
rocm-smi 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment