Commit 0fec721c authored by whlwhlwhl's avatar whlwhlwhl
Browse files

添加编译约束,完善prompt

parent 2ad344b2
......@@ -11,9 +11,6 @@ LightOp KernelPilot 是面向 LightOp DCU 算子库的 KernelPilot 工作流改
workload 分布、benchmark 证据、profile digest、尝试记录、优化记录,以及
带 review gate 的迭代。
它移除了 NVIDIA 优先的假设,例如 Nsight Compute、CUTLASS/CuTe、PTX/SASS、
Blackwell/Hopper、TMA、WGMMA、tcgen05 等。默认面向 DCU/ROCm/HIP/DTK 环境。
## Skills
| Skill | 作用 |
......@@ -25,6 +22,9 @@ Blackwell/Hopper、TMA、WGMMA、tcgen05 等。默认面向 DCU/ROCm/HIP/DTK 环
磁盘上的目录名 `humanize-kernel-agent-loop``ncu-report` 是为了兼容上游
Humanize installer;真正暴露给 agent 的 skill 名称以上表 frontmatter 为准。
如果只是想人工阅读这三个 skill 的中文说明,可以看
[`docs/lightop-skills.zh-CN.md`](docs/lightop-skills.zh-CN.md)
## 手动安装
如果没有 Claude 或 Codex CLI installer,可以直接安装这三个 LightOp/DCU
......@@ -66,32 +66,37 @@ Codex: ${CODEX_HOME:-~/.codex}/skills
一个清晰的请求最好包含:算子名、正确性参考、workload、执行环境、目标
DCU/gfx arch、baseline、benchmark 方法、成功阈值。
示例:
```text
[$lightop-kernel-agent-loop] 给 LightOp 添加 fused rmsnorm + rope + fp8
kv-cache store 算子,目标 gfx936。正确性参考使用 PyTorch/native LightOp
组合路径,覆盖 Qwen decode 的 batch/token/head_dim shape。验证环境使用
Docker 容器 lightop-dtk,容器内 repo 路径是 /workspace/lightop。
性能要求:p50 latency 比现有 unfused 路径快 15%。
```
优化已有算子的示例:
```text
[$lightop-kernel-agent-loop] 优化 gfx938 上的 lightop.moe_gemm_w8a8,
目标 workload 是 DeepSeek EP8 decode。保持现有 Python API 不变,
和当前 LightOp baseline 对比;benchmark plateau 时使用 hipprof 证据继续分析。
```
如果宿主机路径和容器路径不同,建议直接写清楚:
prompt示例:
```text
宿主机 LightOp 路径:/public/wanghl6/lightop
验证容器:wanghl_lightop209
容器内 LightOp 路径:/home/lightop
所有 build、correctness test、benchmark、profiling 都必须在容器内执行:
docker exec wanghl_lightop209 bash -lc 'cd /home/lightop && <command>'
@lightop-kernel-agent-loop
- 宿主机 LightOp 路径:/path/to/lightop
- 验证 Docker:<lightop-container>
- 容器内 LightOp 路径:/workspace/lightop
任务:添加 1-pass layer norm 算子。
- 正确性参考:PyTorch layer_norm。
- 性能目标:最高有效带宽达到 1.1 TB/s。
- 所有 build/test/benchmark/profile 必须在容器内 /workspace/lightop 执行。
- 使用命令格式:
docker exec <lightop-container> bash -lc 'cd /workspace/lightop && <command>'
- 缺少 shape、dtype、API、是否返回 mean/rstd 时,先从现有 LightOp layernorm/rmsnorm
测试和 benchmark 推断,无法推断再问。
必须先创建:
- .humanize/lightop-agent/refined-plan.md
- .humanize/lightop-agent/research-digest.md
- .humanize/lightop-agent/attempt-ledger.md
并启动或记录 Humanize loop state。
授权范围:
- 允许修改该 LightOp 仓库内与本任务相关的源码、测试、benchmark、必要 config。
- 允许创建/写入 .humanize/lightop-agent/ 记录文件。
- 允许执行上述 docker exec 命令进行 build、test、benchmark、profiling。
- 允许执行 python setup.py install、hipprof/rocprof。
- 不允许删除 build/。
- 不允许 git reset、清理仓库、force push、删除大目录。
- 不需要 git commit/push。
```
## LightOp 接入位置
......@@ -125,6 +130,12 @@ LightOp KernelPilot 的 build 规则固定为:
python setup.py install
```
如果用户指定 Docker 容器,编译也必须进容器执行:
```bash
docker exec <container> bash -lc 'cd <container-lightop> && python setup.py install'
```
无论 PyTorch 版本是什么,都不切到 `setup_torch29.py`。正常调优循环中也不删除
`build/`,以便复用增量编译结果;只有用户明确要求 clean build,或证明 cache
损坏时才清理。
......@@ -141,6 +152,11 @@ https://developer.sourcefind.cn/gitbook//dcu_developer/DeveloperGuide/dcu_progra
```bash
cd /path/to/lightop
bash /path/to/lightop-skiils/humanize/scripts/measure-device-bandwidth.sh \
--docker <container> \
--workdir <container-lightop> \
--hip-visible-devices <idle-card> \
--output .humanize/lightop-agent/device-bandwidth.txt
mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline
python test/test_<op>.py 2>&1 \
| tee .humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log
......@@ -148,6 +164,13 @@ hipprof python test/test_<op>.py 2>&1 \
| tee .humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt
```
开始优化前必须先测当前选中卡的实际读写/拷贝带宽,作为后续算子有效带宽目标的参照。
优化循环里,每个正确性通过的 candidate 在普通 benchmark 后都要补一轮
`hipprof --pmc`,用于查看 cache 行为、LDS/bank conflict、occupancy/resource
压力,并把结果转成下一步明确的 kernel edit;如果当前 DTK 不支持某个 counter,
要记录实际命令和报错,不能用猜测替代。
`hipprof` 和 benchmark log 不够解释问题时,可以进一步使用:
```text
......
......@@ -70,6 +70,11 @@ build、correctness test、benchmark、profiling 必须在同一个环境里做
docker exec <container> bash -lc 'cd /path/in/container/lightop && <command>'
```
如果用户指定了 Docker 容器,build/install 也必须使用这个形式,不能在宿主机直接编译。
容器里的安装命令固定是 `python setup.py install`,可以按需加
`PYTORCH_ROCM_ARCH=...` 前缀;不要使用 `setup_torch29.py`,正常调优循环里也不要删除
`build/`
开工前需要记录:
- 容器名或镜像名;不用 Docker 时记录 `direct-host`
......@@ -169,33 +174,16 @@ HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py
当前卡、当前负载状态下的 sanity baseline,用来判断算子的有效带宽目标是否合理。
```bash
mkdir -p .humanize/lightop-agent
HIP_VISIBLE_DEVICES=<idle-card> python - <<'PY' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
import time, torch
torch.cuda.init()
free, total = torch.cuda.mem_get_info()
bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
n = bytes_per_buf // 4
a = torch.empty(n, device="cuda", dtype=torch.float32)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
def bench(name, fn, bytes_moved, iters=80, warmup=20):
for _ in range(warmup): fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(iters): fn()
torch.cuda.synchronize()
dt = (time.perf_counter() - t0) / iters
print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
bench("write_fill", lambda: a.fill_(3.0), n * 4)
bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
bench("read_reduce", lambda: torch.sum(a), n * 4)
print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
PY
bash /path/to/lightop-skiils/humanize/scripts/measure-device-bandwidth.sh \
--docker <container> \
--workdir <container-lightop> \
--hip-visible-devices <idle-card> \
--output .humanize/lightop-agent/device-bandwidth.txt
```
如果不使用 Docker,就去掉 `--docker`,并用 `--workdir <lightop-root>` 指定宿主机
LightOp 路径,或者在 LightOp root 里直接运行脚本。
### 工作流
Stage 1:检查和计划。
......@@ -284,10 +272,20 @@ LightOp build 固定使用:
PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install
```
如果执行环境是 Docker,必须包在用户指定的容器路径里:
```bash
docker exec <container> bash -lc 'cd <container-lightop> && PYTORCH_ROCM_ARCH="gfx928;gfx936;gfx938" python setup.py install'
```
无论 PyTorch 版本是什么,都不切到 `setup_torch29.py`。正常调优循环里不删除
`build/`,这样能复用增量编译结果。只有用户明确要求 clean build,或者证明 build
cache 损坏时,才清理。
也就是说,普通 LightOp 调优任务里不要执行 `rm -rf build`,不要执行
`python setup_torch29.py install`,Docker 任务也不要在宿主机直接
`python setup.py install`
安装后做 import smoke:
```bash
......@@ -579,17 +577,17 @@ hipprof --codeobj-analyze <binary-or-so> \
```text
@lightop-kernel-agent-loop
宿主机 LightOp 路径:/public/home/wanghl6/pr/lightop
验证 Docker:wanghl_lightop209
容器内 LightOp 路径:/home/pr/lightop
宿主机 LightOp 路径:/path/to/lightop
验证 Docker:<lightop-container>
容器内 LightOp 路径:/workspace/lightop
任务:在 LightOp 中添加 <operator> 算子。
正确性参考:<PyTorch/native LightOp reference>
性能目标:<目标 shape/dtype 的 latency 或有效带宽>
要求:
- 所有 build/test/benchmark/profile 都必须在容器内 /home/pr/lightop 执行。
- 使用命令:docker exec wanghl_lightop209 bash -lc 'cd /home/pr/lightop && <command>'
- 所有 build/test/benchmark/profile 都必须在容器内 /workspace/lightop 执行。
- 使用命令:docker exec <lightop-container> bash -lc 'cd /workspace/lightop && <command>'
- 遵循现有 LightOp wrapper、export.cpp、csrc、test、benchmark 风格。
- 如果首版正确但未达性能目标,进入 profiling/tuning,不要停止。
- 完成前给出 build、smoke test、correctness test、benchmark 的实际命令和结果。
......@@ -599,9 +597,9 @@ hipprof --codeobj-analyze <binary-or-so> \
```text
@lightop-kernel-agent-loop
宿主机 LightOp 路径:/public/home/wanghl6/pr/lightop
验证 Docker:wanghl_lightop209
容器内 LightOp 路径:/home/pr/lightop
宿主机 LightOp 路径:/path/to/lightop
验证 Docker:<lightop-container>
容器内 LightOp 路径:/workspace/lightop
任务:优化已有算子 <operator>。
目标源码:lightop/csrc/<family>/<file>.cu
......
#!/usr/bin/env bash
#
# Measure selected-device memory bandwidth before LightOp tuning.
#
# Direct mode:
# bash measure-device-bandwidth.sh --output .humanize/lightop-agent/device-bandwidth.txt --hip-visible-devices 0
#
# Docker mode:
# bash measure-device-bandwidth.sh --docker wanghl_lightop209 --workdir /home/lightop \
# --output .humanize/lightop-agent/device-bandwidth.txt --hip-visible-devices 0
set -euo pipefail
CONTAINER=""
WORKDIR=""
OUTPUT=".humanize/lightop-agent/device-bandwidth.txt"
HIP_VISIBLE=""
MAX_MIB="512"
MIN_MIB="16"
ITERS="80"
WARMUP="20"
DTYPE="float32"
PYTHON_BIN="${PYTHON:-python}"
usage() {
sed -n '2,10p' "$0" >&2
cat >&2 <<'EOF'
Options:
--docker <container> Run the measurement inside this Docker container.
--workdir <path> Container or local LightOp root. Required with --docker.
--output <path> Output path relative to workdir/cwd unless absolute.
--hip-visible-devices <id> Value for HIP_VISIBLE_DEVICES during measurement.
--max-mib <n> Maximum bytes per buffer in MiB. Default: 512.
--min-mib <n> Minimum bytes per buffer in MiB. Default: 16.
--iters <n> Timed iterations. Default: 80.
--warmup <n> Warmup iterations. Default: 20.
--dtype <torch dtype> float32, float16, bfloat16. Default: float32.
EOF
}
shell_quote() {
printf '%q' "$1"
}
emit_python() {
cat <<'PY'
import argparse
import datetime as _dt
import os
import platform
import sys
import time
import torch
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--output", default=".humanize/lightop-agent/device-bandwidth.txt")
parser.add_argument("--max-mib", type=int, default=512)
parser.add_argument("--min-mib", type=int, default=16)
parser.add_argument("--iters", type=int, default=80)
parser.add_argument("--warmup", type=int, default=20)
parser.add_argument("--dtype", default="float32")
return parser.parse_args()
def dtype_from_name(name):
mapping = {
"float32": torch.float32,
"fp32": torch.float32,
"float": torch.float32,
"float16": torch.float16,
"fp16": torch.float16,
"half": torch.float16,
"bfloat16": torch.bfloat16,
"bf16": torch.bfloat16,
}
if name not in mapping:
raise SystemExit(f"unsupported dtype: {name}")
return mapping[name]
def main():
args = parse_args()
if not torch.cuda.is_available():
raise SystemExit("torch.cuda is not available in this environment")
dtype = dtype_from_name(args.dtype)
torch.cuda.init()
device_index = torch.cuda.current_device()
props = torch.cuda.get_device_properties(device_index)
free, total = torch.cuda.mem_get_info()
min_bytes = max(1, args.min_mib) << 20
max_bytes = max(args.min_mib, args.max_mib) << 20
bytes_per_buf = max(min_bytes, min(max_bytes, int(free // 5)))
elem_size = torch.empty((), device="cuda", dtype=dtype).element_size()
n = max(1, bytes_per_buf // elem_size)
bytes_per_buf = n * elem_size
a = torch.empty(n, device="cuda", dtype=dtype)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0)
b.fill_(2.0)
c.zero_()
torch.cuda.synchronize()
def bench(name, fn, bytes_moved):
for _ in range(args.warmup):
fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(args.iters):
fn()
torch.cuda.synchronize()
seconds = (time.perf_counter() - t0) / args.iters
tbps = bytes_moved / seconds / 1e12
return name, tbps, seconds, bytes_moved
rows = [
bench("write_fill", lambda: a.fill_(3.0), bytes_per_buf),
bench("copy_read_write", lambda: c.copy_(a), bytes_per_buf * 2),
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), bytes_per_buf * 3),
bench("read_reduce", lambda: torch.sum(a), bytes_per_buf),
]
lines = [
"device_bandwidth_calibration:",
f" timestamp_utc: {_dt.datetime.utcnow().isoformat(timespec='seconds')}Z",
f" host: {platform.node()}",
f" cwd: {os.getcwd()}",
f" python: {sys.version.split()[0]}",
f" torch: {torch.__version__}",
f" hip: {getattr(torch.version, 'hip', None)}",
f" hip_visible_devices: {os.environ.get('HIP_VISIBLE_DEVICES', '')}",
f" device_index: {device_index}",
f" device_name: {torch.cuda.get_device_name(device_index)}",
f" gcn_arch: {getattr(props, 'gcnArchName', '')}",
f" dtype: {args.dtype}",
f" buffer_bytes: {bytes_per_buf}",
f" total_mem_bytes: {total}",
f" free_mem_bytes_at_start: {free}",
f" warmup: {args.warmup}",
f" iters: {args.iters}",
" results:",
]
for name, tbps, seconds, bytes_moved in rows:
lines.extend([
f" {name}:",
f" tbps: {tbps:.6f}",
f" us_per_iter: {seconds * 1e6:.3f}",
f" bytes_moved: {bytes_moved}",
])
text = "\n".join(lines) + "\n"
print(text, end="")
if args.output:
output_dir = os.path.dirname(os.path.abspath(args.output))
os.makedirs(output_dir, exist_ok=True)
with open(args.output, "w", encoding="utf-8") as fh:
fh.write(text)
if __name__ == "__main__":
main()
PY
}
while [[ $# -gt 0 ]]; do
case "$1" in
--docker) CONTAINER="$2"; shift 2 ;;
--workdir) WORKDIR="$2"; shift 2 ;;
--output) OUTPUT="$2"; shift 2 ;;
--hip-visible-devices) HIP_VISIBLE="$2"; shift 2 ;;
--max-mib) MAX_MIB="$2"; shift 2 ;;
--min-mib) MIN_MIB="$2"; shift 2 ;;
--iters) ITERS="$2"; shift 2 ;;
--warmup) WARMUP="$2"; shift 2 ;;
--dtype) DTYPE="$2"; shift 2 ;;
-h|--help) usage; exit 0 ;;
*) echo "Error: unknown argument: $1" >&2; usage; exit 1 ;;
esac
done
if [[ -n "$CONTAINER" ]]; then
if [[ -z "$WORKDIR" ]]; then
echo "Error: --workdir is required with --docker" >&2
exit 1
fi
q_workdir="$(shell_quote "$WORKDIR")"
q_output="$(shell_quote "$OUTPUT")"
q_max="$(shell_quote "$MAX_MIB")"
q_min="$(shell_quote "$MIN_MIB")"
q_iters="$(shell_quote "$ITERS")"
q_warmup="$(shell_quote "$WARMUP")"
q_dtype="$(shell_quote "$DTYPE")"
hip_prefix=""
if [[ -n "$HIP_VISIBLE" ]]; then
hip_prefix="HIP_VISIBLE_DEVICES=$(shell_quote "$HIP_VISIBLE") "
fi
inner="cd $q_workdir && mkdir -p .humanize/lightop-agent && ${hip_prefix}$PYTHON_BIN - --output $q_output --max-mib $q_max --min-mib $q_min --iters $q_iters --warmup $q_warmup --dtype $q_dtype"
emit_python | docker exec -i "$CONTAINER" bash -lc "$inner"
else
if [[ -n "$WORKDIR" ]]; then
cd "$WORKDIR"
fi
mkdir -p "$(dirname "$OUTPUT")"
if [[ -n "$HIP_VISIBLE" ]]; then
export HIP_VISIBLE_DEVICES="$HIP_VISIBLE"
fi
emit_python | "$PYTHON_BIN" - --output "$OUTPUT" --max-mib "$MAX_MIB" --min-mib "$MIN_MIB" --iters "$ITERS" --warmup "$WARMUP" --dtype "$DTYPE"
fi
......@@ -70,6 +70,13 @@ logs and repeat the same command:
docker exec <container> bash -lc 'cd /path/in/container/lightop && <command>'
```
When Docker is the named execution environment, the LightOp install/build step
must also be run through that exact `docker exec` shape. Do not build on the
host for that task. The install command inside the container is always
`python setup.py install` (optionally prefixed by `PYTORCH_ROCM_ARCH=...`);
never use `setup_torch29.py`, and never remove `build/` as part of the normal
iteration loop.
If the user provides an image rather than a running container, ask for or infer
the host LightOp path and run with DCU devices exposed:
......@@ -221,39 +228,18 @@ Example:
```bash
hy-smi || rocm-smi
mkdir -p .humanize/lightop-agent
HIP_VISIBLE_DEVICES=<idle-card> python - <<'PY' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
import time, torch
torch.cuda.init()
free, total = torch.cuda.mem_get_info()
bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
n = bytes_per_buf // 4
a = torch.empty(n, device="cuda", dtype=torch.float32)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
def bench(name, fn, bytes_moved, iters=80, warmup=20):
for _ in range(warmup):
fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(iters):
fn()
torch.cuda.synchronize()
dt = (time.perf_counter() - t0) / iters
print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
bench("write_fill", lambda: a.fill_(3.0), n * 4)
bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
bench("read_reduce", lambda: torch.sum(a), n * 4)
print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
PY
bash "{{HUMANIZE_RUNTIME_ROOT}}/scripts/measure-device-bandwidth.sh" \
--docker <container> \
--workdir <container-lightop> \
--hip-visible-devices <idle-card> \
--output .humanize/lightop-agent/device-bandwidth.txt
HIP_VISIBLE_DEVICES=<idle-card> python test/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py
```
For direct-host execution, omit `--docker` and pass `--workdir <lightop-root>`
or run the script from the LightOp root.
Do not report a performance number as actionable unless the device-selection
gate and device bandwidth calibration were recorded, or the user explicitly
accepts the missing evidence.
......@@ -450,11 +436,22 @@ PyTorch version. Do not switch to `setup_torch29.py`.
PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install
```
If the selected execution environment is Docker, wrap the same command in the
user-provided container path:
```bash
docker exec <container> bash -lc 'cd <container-lightop> && PYTORCH_ROCM_ARCH="gfx928;gfx936;gfx938" python setup.py install'
```
Keep the existing `build/` directory between attempts so incremental extension
builds can reuse prior compilation output. Do not delete `build/` as part of
the normal build/test/tune loop unless the user explicitly requests a clean
build or the build cache is proven to be stale or corrupt.
Commands such as `rm -rf build`, `python setup_torch29.py install`, or a host
side `python setup.py install` for a Docker-bound task violate this skill unless
the user explicitly overrides the rule.
After install, run an import smoke test in the same environment:
```bash
......
......@@ -140,31 +140,11 @@ Load [examples.md](references/examples.md) for copyable command variants.
Minimal first-pass capture from the LightOp root:
```bash
mkdir -p .humanize/lightop-agent
HIP_VISIBLE_DEVICES=<idle-card> python - <<'PY' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
import time, torch
torch.cuda.init()
free, total = torch.cuda.mem_get_info()
bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
n = bytes_per_buf // 4
a = torch.empty(n, device="cuda", dtype=torch.float32)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
def bench(name, fn, bytes_moved, iters=80, warmup=20):
for _ in range(warmup): fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(iters): fn()
torch.cuda.synchronize()
dt = (time.perf_counter() - t0) / iters
print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
bench("write_fill", lambda: a.fill_(3.0), n * 4)
bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
bench("read_reduce", lambda: torch.sum(a), n * 4)
print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
PY
bash <humanize-root>/scripts/measure-device-bandwidth.sh \
--docker <container> \
--workdir <container-lightop> \
--hip-visible-devices <idle-card> \
--output .humanize/lightop-agent/device-bandwidth.txt
mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline
hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt || \
rocm-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment