Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
whlwhlwhl
Lightop-SKIILS
Commits
2ad344b2
Commit
2ad344b2
authored
May 21, 2026
by
whlwhlwhl
Browse files
添加优化限制,带宽查询
parent
067b04c0
Changes
3
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
248 additions
and
39 deletions
+248
-39
docs/lightop-skills.zh-CN.md
docs/lightop-skills.zh-CN.md
+71
-8
humanize/skills/humanize-kernel-agent-loop/SKILL.md
humanize/skills/humanize-kernel-agent-loop/SKILL.md
+114
-21
humanize/skills/ncu-report/SKILL.md
humanize/skills/ncu-report/SKILL.md
+63
-10
No files found.
docs/lightop-skills.zh-CN.md
View file @
2ad344b2
...
@@ -164,6 +164,38 @@ HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py
...
@@ -164,6 +164,38 @@ HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py
要选择 HCU 利用率低、显存占用低的卡。baseline 和 candidate 尽量使用同一张卡。
要选择 HCU 利用率低、显存占用低的卡。baseline 和 candidate 尽量使用同一张卡。
如果没有记录设备状态,不应该把性能数字当作最终可采信证据。
如果没有记录设备状态,不应该把性能数字当作最终可采信证据。
开始优化前,还要在当前选中的卡上实测一次实际读写带宽,写到
`.humanize/lightop-agent/device-bandwidth.txt`
。这个值不是理论峰值,而是当前容器、
当前卡、当前负载状态下的 sanity baseline,用来判断算子的有效带宽目标是否合理。
```
bash
mkdir
-p
.humanize/lightop-agent
HIP_VISIBLE_DEVICES
=
<idle-card> python -
<<
'
PY
' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
import time, torch
torch.cuda.init()
free, total = torch.cuda.mem_get_info()
bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
n = bytes_per_buf // 4
a = torch.empty(n, device="cuda", dtype=torch.float32)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
def bench(name, fn, bytes_moved, iters=80, warmup=20):
for _ in range(warmup): fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(iters): fn()
torch.cuda.synchronize()
dt = (time.perf_counter() - t0) / iters
print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
bench("write_fill", lambda: a.fill_(3.0), n * 4)
bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
bench("read_reduce", lambda: torch.sum(a), n * 4)
print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
PY
```
### 工作流
### 工作流
Stage 1:检查和计划。
Stage 1:检查和计划。
...
@@ -180,16 +212,22 @@ Stage 1:检查和计划。
...
@@ -180,16 +212,22 @@ Stage 1:检查和计划。
Stage 2:实现和验证。
Stage 2:实现和验证。
-
第一次优化前先做 baseline matrix:正确性、benchmark、卡状态、p50/p90/mean、
-
第一次优化前先做 baseline matrix:正确性、benchmark、卡状态、p50/p90/mean、
有效带宽、噪声估计。
当前卡实际读/写/拷贝带宽、
有效带宽、噪声估计。
-
每轮只做一个主要优化假设。
-
每轮只做一个主要优化假设。
-
build 使用
`python setup.py install`
。
-
build 使用
`python setup.py install`
。
-
同环境做 import smoke、target correctness、benchmark。
-
同环境做 import smoke、target correctness、benchmark。
-
每个通过正确性的 candidate 都记录 shape、dtype、配置、带宽或延迟、对比基准、
-
每个通过正确性的 candidate 都记录 shape、dtype、配置、带宽或延迟、对比基准、
keep/reject/inconclusive 原因。
keep/reject/inconclusive 原因。
-
每个正确性通过的优化 candidate,在普通 benchmark 后都必须用同一张卡、同一代表
shape 跑
`hipprof --pmc`
。digest 要分析 cache 行为、memory/cache traffic、
LDS 或 bank conflict、occupancy/resource pressure,并给出一个明确的下一步修改。
如果当前 DTK 不支持某个 counter 或 occupancy 信号,要记录实际命令和报错。
Stage 3:profiling 和 tuning。
Stage 3:profiling 和 tuning。
-
当首个正确 candidate 没达到性能目标时,进入 profiling/tuning,不要停止。
-
当首个正确 candidate 没达到性能目标时,进入 profiling/tuning,不要停止。
-
每轮优化改动后,只要正确性通过,就先 benchmark,再跑 per-candidate
`hipprof --pmc`
,用 profiler 证据决定下一轮,不要只凭源码直觉继续改。
-
两个连续正确 candidate 没达标时,下一次 kernel/dispatch edit 前必须同时有:
-
两个连续正确 candidate 没达标时,下一次 kernel/dispatch edit 前必须同时有:
`lightop-kernel-knowledge`
调研结论和
`dcu-profiler-report`
profile digest。
`lightop-kernel-knowledge`
调研结论和
`dcu-profiler-report`
profile digest。
-
第二个正确优化尝试如果相对 parent 或 baseline 提升小于 5%,下一步必须做深度
-
第二个正确优化尝试如果相对 parent 或 baseline 提升小于 5%,下一步必须做深度
...
@@ -208,7 +246,10 @@ Stage 4:收尾。
...
@@ -208,7 +246,10 @@ Stage 4:收尾。
性能目标是验收契约,不是“尽量试试”。规则如下:
性能目标是验收契约,不是“尽量试试”。规则如下:
-
只有 correctness 通过后,performance candidate 才算有效。
-
只有 correctness 通过后,performance candidate 才算有效。
-
开始优化前必须实测当前卡的实际读写带宽,并把结果作为 baseline matrix 的一部分。
-
每个正确 candidate 都必须有 benchmark 数据。
-
每个正确 candidate 都必须有 benchmark 数据。
-
每个正确优化 candidate 都必须有配套
`hipprof --pmc`
证据,至少说明 cache、
LDS/bank conflict、occupancy/resource 这几类信号,或者记录工具不支持的原因。
-
提升必须超过计划里定义的噪声带,噪声带内只能记为 inconclusive 或 plateau。
-
提升必须超过计划里定义的噪声带,噪声带内只能记为 inconclusive 或 plateau。
-
首个正确 candidate 未达标时,必须 profile 和 tune。
-
首个正确 candidate 未达标时,必须 profile 和 tune。
-
至少尝试 3 条有证据支持的优化 lineage,除非 profiler 证明目标不可达。
-
至少尝试 3 条有证据支持的优化 lineage,除非 profiler 证明目标不可达。
...
@@ -270,6 +311,24 @@ python test_<op>.py
...
@@ -270,6 +311,24 @@ python test_<op>.py
如果没有 benchmark,就添加小 benchmark,必须包含 warmup、固定 shape、固定 seed、
如果没有 benchmark,就添加小 benchmark,必须包含 warmup、固定 shape、固定 seed、
计时前后显式
`torch.cuda.synchronize()`
。
计时前后显式
`torch.cuda.synchronize()`
。
每个正确性通过的优化 candidate 都要在 benchmark 后跑 PMC:
```
bash
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof
--pmc
--pmc-type
3
\
-o
.humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc
\
python
test
/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof
--pmc-read
--pmc-type
3
\
-o
.humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc-read
\
python
test
/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof
--pmc-write
--pmc-type
3
\
-o
.humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc-write
\
python
test
/<family>/benchmark_<op>.py
```
这些结果要写进
`.humanize/lightop-agent/profile-artifacts/<version>/digest.md`
或
`kernel_opt_readme.md`
,用于判断 cache 命中、LDS bank conflict、内核 occupancy、
VGPR/LDS/resource 压力,并产出下一步具体 kernel edit。
### RLCR 启动和失败回退
### RLCR 启动和失败回退
写好 refined plan 且确认
`.humanize*`
已忽略后,从 LightOp root 启动:
写好 refined plan 且确认
`.humanize*`
已忽略后,从 LightOp root 启动:
...
@@ -409,6 +468,8 @@ python3 scripts/search-index-repos.py <term1> <term2> [<term3>]
...
@@ -409,6 +468,8 @@ python3 scripts/search-index-repos.py <term1> <term2> [<term3>]
-
baseline benchmark 已通过,但还没有 baseline profile digest。
-
baseline benchmark 已通过,但还没有 baseline profile digest。
-
正确 candidate 和 baseline 或 prior best 差距在 +/-2% 内。
-
正确 candidate 和 baseline 或 prior best 差距在 +/-2% 内。
-
每个正确性通过的优化 candidate 已经跑完 benchmark,需要 mandatory
`hipprof --pmc`
digest 来决定下一步。
-
第二个正确优化尝试相对 parent 或 baseline 提升小于 5%。
-
第二个正确优化尝试相对 parent 或 baseline 提升小于 5%。
-
candidate 在重要 shape 上回退。
-
candidate 在重要 shape 上回退。
-
candidate 快得异常,需要解释。
-
candidate 快得异常,需要解释。
...
@@ -428,7 +489,7 @@ python3 scripts/search-index-repos.py <term1> <term2> [<term3>]
...
@@ -428,7 +489,7 @@ python3 scripts/search-index-repos.py <term1> <term2> [<term3>]
device-status.txt
device-status.txt
benchmark.log
benchmark.log
hipprof.txt
hipprof.txt
hipprof-pmc-all/
hipprof-pmc-all/
# 每个正确优化 candidate 必须有
sqtt-json/
sqtt-json/
rocprof.csv
rocprof.csv
rocprof-stats.csv
rocprof-stats.csv
...
@@ -448,12 +509,15 @@ python3 scripts/search-index-repos.py <term1> <term2> [<term3>]
...
@@ -448,12 +509,15 @@ python3 scripts/search-index-repos.py <term1> <term2> [<term3>]
3.
执行设备选择 gate,并用
`HIP_VISIBLE_DEVICES=<idle-card>`
固定卡。
3.
执行设备选择 gate,并用
`HIP_VISIBLE_DEVICES=<idle-card>`
固定卡。
4.
先跑普通 benchmark,避免 profiler overhead 变成性能结论。
4.
先跑普通 benchmark,避免 profiler overhead 变成性能结论。
5.
用
`hipprof`
做第一阶段 API/kernel/memcpy timing。
5.
用
`hipprof`
做第一阶段 API/kernel/memcpy timing。
6.
深度分析时收集 PMC、SQTT、
`dccobjdump`
、code-object resource、LDS/register/
6.
每个正确优化 candidate 都收集
`hipprof --pmc`
、
`--pmc-read`
、
`--pmc-write`
中当前环境支持的结果,并解释 cache、memory/cache traffic、LDS/bank conflict、
occupancy/resource pressure。
7.
深度分析时收集 SQTT、
`dccobjdump`
、code-object resource、LDS/register/
occupancy 证据。
occupancy 证据。
7
.
如果
`hipprof`
只显示热点 kernel 但解释不了原因,再用更深的 ROCm/DTK profiler。
8
.
如果
`hipprof`
只显示热点 kernel 但解释不了原因,再用更深的 ROCm/DTK profiler。
8
.
如果怀疑 codegen,检查 AMDGPU ISA 或 code-object metadata。
9
.
如果怀疑 codegen,检查 AMDGPU ISA 或 code-object metadata。
9
.
candidate 要和 baseline 或 parent 比,不只看绝对数。
10
.
candidate 要和 baseline 或 parent 比,不只看绝对数。
1
0
.
写
`digest.md`
,最后必须只有一个明确的 next edit。
1
1
.
写
`digest.md`
,最后必须只有一个明确的 next edit。
### 常用命令
### 常用命令
...
@@ -560,4 +624,3 @@ hipprof --codeobj-analyze <binary-or-so> \
...
@@ -560,4 +624,3 @@ hipprof --codeobj-analyze <binary-or-so> \
- 优先只修改目标算子相关源码、测试、benchmark 或必要 config。
- 优先只修改目标算子相关源码、测试、benchmark 或必要 config。
- 直接在 Docker 内进行 install、correctness test、benchmark、profiling、tuning。
- 直接在 Docker 内进行 install、correctness test、benchmark、profiling、tuning。
```
```
humanize/skills/humanize-kernel-agent-loop/SKILL.md
View file @
2ad344b2
...
@@ -211,18 +211,52 @@ execution environment and pin the run to an idle card.
...
@@ -211,18 +211,52 @@ execution environment and pin the run to an idle card.
comparison is deliberately measuring cross-device behavior.
comparison is deliberately measuring cross-device behavior.
-
Record the chosen card, HCU utilization, VRAM use, and exact command in the
-
Record the chosen card, HCU utilization, VRAM use, and exact command in the
attempt ledger or profile artifact directory.
attempt ledger or profile artifact directory.
-
Before the first optimization edit, run a device bandwidth calibration on the
selected card in the same execution environment. Record actual read, write,
copy/read-write, and simple triad bandwidth, buffer size, dtype, selected
card, and command in
`.humanize/lightop-agent/device-bandwidth.txt`
. Treat
this as a sanity baseline for any user-specified effective-bandwidth target.
Example:
Example:
```
bash
```
bash
hy-smi
||
rocm-smi
hy-smi
||
rocm-smi
mkdir
-p
.humanize/lightop-agent
HIP_VISIBLE_DEVICES
=
<idle-card> python -
<<
'
PY
' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
import time, torch
torch.cuda.init()
free, total = torch.cuda.mem_get_info()
bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
n = bytes_per_buf // 4
a = torch.empty(n, device="cuda", dtype=torch.float32)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
def bench(name, fn, bytes_moved, iters=80, warmup=20):
for _ in range(warmup):
fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(iters):
fn()
torch.cuda.synchronize()
dt = (time.perf_counter() - t0) / iters
print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
bench("write_fill", lambda: a.fill_(3.0), n * 4)
bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
bench("read_reduce", lambda: torch.sum(a), n * 4)
print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
PY
HIP_VISIBLE_DEVICES
=
<idle-card> python
test
/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES
=
<idle-card> python
test
/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof python
test
/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof python
test
/<family>/benchmark_<op>.py
```
```
Do not report a performance number as actionable unless the device-selection
Do not report a performance number as actionable unless the device-selection
gate
was
recorded or the user explicitly
accepts the missing card-state
gate
and device bandwidth calibration were
recorded
,
or the user explicitly
evidence.
accepts the missing
evidence.
## Workflow
## Workflow
...
@@ -252,10 +286,11 @@ evidence.
...
@@ -252,10 +286,11 @@ evidence.
### Stage 2: Implement And Verify
### Stage 2: Implement And Verify
1.
Build a baseline matrix before the first optimization edit: run correctness,
1.
Build a baseline matrix before the first optimization edit: run correctness,
then benchmark the target workload on the selected idle card for enough
run the selected-card device bandwidth calibration, then benchmark the
repeats to report p50/p90 or mean, effective bandwidth, variance/noise, card
target workload on the selected idle card for enough repeats to report
status, and command line. Store it in the attempt ledger and
p50/p90 or mean, effective bandwidth, variance/noise, card status, device
`kernel_opt_readme.md`
.
read/write/copy bandwidth, and command line. Store it in the attempt ledger
and
`kernel_opt_readme.md`
.
2.
Make the smallest LightOp source change that can satisfy the current
2.
Make the smallest LightOp source change that can satisfy the current
task-acceptance pair.
task-acceptance pair.
3.
Build LightOp with the target arch.
3.
Build LightOp with the target arch.
...
@@ -263,9 +298,15 @@ evidence.
...
@@ -263,9 +298,15 @@ evidence.
the performance device gate and pin the run with
`HIP_VISIBLE_DEVICES`
.
the performance device gate and pin the run with
`HIP_VISIBLE_DEVICES`
.
5.
Record every candidate result: correctness failure, build failure,
5.
Record every candidate result: correctness failure, build failure,
regression, plateau, and improvement.
regression, plateau, and improvement.
6.
Invoke
`dcu-profiler-report`
when benchmark evidence is not enough to choose
6.
For every correctness-passing optimization candidate, run the normal
the next edit.
synchronized benchmark first, then collect
`hipprof --pmc`
evidence for the
7.
If the first correctness-passing candidate misses the required performance
same representative target shape before making the next kernel or dispatch
edit. Capture cache-related counters, LDS/bank-conflict clues,
occupancy/resource signals, and the exact command output or unsupported-tool
reason. Use
`dcu-profiler-report`
to turn this evidence into the next edit.
7.
Invoke deeper
`dcu-profiler-report`
analysis when the per-candidate PMC
evidence and benchmark still do not explain the next edit.
8.
If the first correctness-passing candidate misses the required performance
threshold, continue into profiling and tuning instead of declaring the task
threshold, continue into profiling and tuning instead of declaring the task
done.
done.
...
@@ -282,15 +323,19 @@ evidence.
...
@@ -282,15 +323,19 @@ evidence.
regimes need different choices.
regimes need different choices.
4.
Re-run correctness across all touched dtypes/layouts/modes.
4.
Re-run correctness across all touched dtypes/layouts/modes.
5.
Re-run benchmark cases that define
`W`
and any nearby regression guards.
5.
Re-run benchmark cases that define
`W`
and any nearby regression guards.
6.
Reject or revert a candidate lineage in the final chosen path when it fails
6.
Re-run per-candidate
`hipprof --pmc`
captures after each correctness-passing
optimization edit and summarize cache behavior, LDS/bank conflicts,
occupancy/resource pressure, and one profiler-backed next action before
starting the next edit.
7.
Reject or revert a candidate lineage in the final chosen path when it fails
correctness, improves less than the noise/stability threshold, helps only
correctness, improves less than the noise/stability threshold, helps only
non-target shapes, or lacks required profile/resource/ISA evidence after a
non-target shapes, or lacks required profile/resource/ISA evidence after a
profiling gate. Record the rejected lineage instead of silently overwriting
profiling gate. Record the rejected lineage instead of silently overwriting
it.
it.
7
.
After reaching the target, run a final guard validation: targeted
8
.
After reaching the target, run a final guard validation: targeted
correctness, repeated target benchmark on the selected card, and nearby
correctness, repeated target benchmark on the selected card, and nearby
shape/dtype regression checks when relevant.
shape/dtype regression checks when relevant.
8
.
Summarize final code paths, fallback behavior, unsupported regimes, and
9
.
Summarize final code paths, fallback behavior, unsupported regimes, and
remaining risks.
remaining risks.
## Performance Target Discipline
## Performance Target Discipline
...
@@ -304,6 +349,13 @@ that threshold as part of the acceptance contract.
...
@@ -304,6 +349,13 @@ that threshold as part of the acceptance contract.
-
For every correctness-passing candidate, record shape, dtype, layout/mode,
-
For every correctness-passing candidate, record shape, dtype, layout/mode,
kernel or dispatch configuration, measured bandwidth/latency, comparison
kernel or dispatch configuration, measured bandwidth/latency, comparison
baseline, and the reason it improved, regressed, or plateaued.
baseline, and the reason it improved, regressed, or plateaued.
-
For every correctness-passing optimization candidate, record paired
`hipprof --pmc`
evidence for the representative workload before choosing the
next edit. The digest must discuss cache behavior, memory/cache traffic,
LDS or bank-conflict evidence, occupancy/resource pressure, and exactly one
next LightOp kernel, launcher, dispatcher, config, or benchmark edit. If a
PMC counter or occupancy signal is unavailable on the installed DTK, record
the exact command and failure instead of guessing.
-
A performance improvement counts as effective only when it exceeds the
-
A performance improvement counts as effective only when it exceeds the
benchmark noise/stability threshold defined in the plan. If the measured
benchmark noise/stability threshold defined in the plan. If the measured
delta is inside the noise band, record it as inconclusive or plateau, not as
delta is inside the noise band, record it as inconclusive or plateau, not as
...
@@ -331,8 +383,7 @@ that threshold as part of the acceptance contract.
...
@@ -331,8 +383,7 @@ that threshold as part of the acceptance contract.
-
Do not claim that an optimization is effective from intuition, source
-
Do not claim that an optimization is effective from intuition, source
inspection, or expected hardware behavior. Promotion to the optimization
inspection, or expected hardware behavior. Promotion to the optimization
ledger requires measured correctness-passing benchmark data, baseline or
ledger requires measured correctness-passing benchmark data, baseline or
parent comparison, and, when a profiling gate applies, profiler/resource/ISA
parent comparison, and per-candidate PMC/profile/resource evidence.
evidence.
-
If the target remains unmet after the required tuning lineages, summarize the
-
If the target remains unmet after the required tuning lineages, summarize the
best candidate, failed lineages, profiler bottleneck class, unsupported
best candidate, failed lineages, profiler bottleneck class, unsupported
regimes, and the most likely next engineering investment. Do not present the
regimes, and the most likely next engineering investment. Do not present the
...
@@ -368,6 +419,8 @@ Use this shape:
...
@@ -368,6 +419,8 @@ Use this shape:
-
Target workload:
-
Target workload:
-
Effective-bandwidth formula:
-
Effective-bandwidth formula:
-
Device gate:
-
Device gate:
-
Device bandwidth calibration: read/write/copy/triad bandwidth, buffer size,
selected card, command:
-
Build/test/benchmark commands:
-
Build/test/benchmark commands:
-
Baseline p50/p90/mean, variance/noise:
-
Baseline p50/p90/mean, variance/noise:
...
@@ -380,7 +433,8 @@ Use this shape:
...
@@ -380,7 +433,8 @@ Use this shape:
-
Device status and HIP_VISIBLE_DEVICES:
-
Device status and HIP_VISIBLE_DEVICES:
-
Benchmark table: baseline/parent/candidate, p50/p90/mean, effective BW,
-
Benchmark table: baseline/parent/candidate, p50/p90/mean, effective BW,
delta, noise threshold:
delta, noise threshold:
-
Profile/resource/ISA evidence:
-
Per-candidate PMC/profile/resource evidence: cache behavior, LDS/bank
conflicts, occupancy/resource pressure, unavailable counters:
-
Decision: keep | reject | inconclusive
-
Decision: keep | reject | inconclusive
-
Reason:
-
Reason:
-
Next step:
-
Next step:
...
@@ -431,11 +485,34 @@ capture `hy-smi` or `rocm-smi`, choose a low-utilization/low-VRAM card, and run
...
@@ -431,11 +485,34 @@ capture `hy-smi` or `rocm-smi`, choose a low-utilization/low-VRAM card, and run
the benchmark with
`HIP_VISIBLE_DEVICES=<idle-card>`
. Reuse the same selected
the benchmark with
`HIP_VISIBLE_DEVICES=<idle-card>`
. Reuse the same selected
card for baseline and candidate measurements.
card for baseline and candidate measurements.
After every correctness-passing optimization candidate, collect PMC evidence
for the same representative benchmark shape before making the next edit. Use
the same selected card, store the artifacts under
`.humanize/lightop-agent/profile-artifacts/<version>/`
, and run supported
variants such as:
```
bash
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof
--pmc
--pmc-type
3
\
-o
.humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc
\
python
test
/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof
--pmc-read
--pmc-type
3
\
-o
.humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc-read
\
python
test
/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof
--pmc-write
--pmc-type
3
\
-o
.humanize/lightop-agent/profile-artifacts/<version>/hipprof-pmc-all/pmc-write
\
python
test
/<family>/benchmark_<op>.py
```
Use the PMC digest to inspect cache behavior, memory/cache traffic, LDS or bank
conflicts, occupancy/resource pressure, and the next concrete edit. If a
counter,
`--pmc`
mode, or occupancy signal is unsupported in the installed DTK,
record the attempted command and error output in the artifact directory.
Do not claim success from a passing build alone. A LightOp operator change is
Do not claim success from a passing build alone. A LightOp operator change is
complete only after install, import smoke, targeted correctness, benchmark
complete only after install, import smoke, targeted correctness, benchmark
comparison, and
, when
the re
sult is near the threshold or surprising, profiler
comparison, and the re
quired per-candidate PMC/profile evidence. Do not claim
evidence. Do not claim
speedups from Python wall-clock timing unless
speedups from Python wall-clock timing unless
asynchronous DCU work is
asynchronous DCU work is
synchronized.
synchronized.
Use the same benchmark command, selected card, workload shape(s), and effective
Use the same benchmark command, selected card, workload shape(s), and effective
bandwidth formula for baseline and candidate comparisons. If the workload
bandwidth formula for baseline and candidate comparisons. If the workload
...
@@ -443,8 +520,14 @@ contract changes, start a new baseline matrix and explain why.
...
@@ -443,8 +520,14 @@ contract changes, start a new baseline matrix and explain why.
## Profiling
## Profiling
Invoke
`dcu-profiler-report`
autonomously when profiler evidence is the best
Invoke
`dcu-profiler-report`
autonomously after every correctness-passing
next source of truth. These are heuristics, not user-facing gates:
optimization candidate to interpret the mandatory
`hipprof --pmc`
capture and
produce the next edit. The digest may be concise, but it must explain cache
behavior, LDS/bank-conflict clues, occupancy/resource pressure, and exactly one
profiler-backed next action.
Escalate from the mandatory per-candidate PMC pass to deeper profiling when any
of these hold:
-
Baseline benchmark has passed and no profile digest exists.
-
Baseline benchmark has passed and no profile digest exists.
-
A correct candidate is within +/-2% of baseline or the prior best.
-
A correct candidate is within +/-2% of baseline or the prior best.
...
@@ -492,6 +575,10 @@ schema. Include acceptance criteria for:
...
@@ -492,6 +575,10 @@ schema. Include acceptance criteria for:
-
Device gate:
`hy-smi`
or
`rocm-smi`
command, idle-card selection criteria,
-
Device gate:
`hy-smi`
or
`rocm-smi`
command, idle-card selection criteria,
required
`HIP_VISIBLE_DEVICES=<idle-card>`
prefix for benchmark/profile, and
required
`HIP_VISIBLE_DEVICES=<idle-card>`
prefix for benchmark/profile, and
where the card-state output is stored.
where the card-state output is stored.
-
Device bandwidth calibration before the first optimization edit: exact
command, selected card, buffer size, dtype, measured read/write/copy/triad
bandwidth, artifact path, and how the result constrains or contextualizes the
target effective-bandwidth threshold.
-
Correctness coverage for
`W`
, edge cases, dtype/layout/mode boundaries, and
-
Correctness coverage for
`W`
, edge cases, dtype/layout/mode boundaries, and
baseline/reference parity.
baseline/reference parity.
-
Build command, ROCm/DTK/PyTorch versions,
`PYTORCH_ROCM_ARCH`
, and device
-
Build command, ROCm/DTK/PyTorch versions,
`PYTORCH_ROCM_ARCH`
, and device
...
@@ -499,8 +586,14 @@ schema. Include acceptance criteria for:
...
@@ -499,8 +586,14 @@ schema. Include acceptance criteria for:
-
Benchmark method with warmup, repeats, synchronization, per-shape timing,
-
Benchmark method with warmup, repeats, synchronization, per-shape timing,
p50/p90 or mean as appropriate, variance/noise band, minimum effective delta,
p50/p90 or mean as appropriate, variance/noise band, minimum effective delta,
and environment metadata.
and environment metadata.
-
Per-candidate
`hipprof --pmc`
capture after every correctness-passing
optimization edit, including artifact path, selected card, representative
shape, cache counters or unavailable-counter reason, LDS/bank-conflict
evidence, occupancy/resource interpretation, and the profiler-backed next
edit.
-
Baseline matrix required before the first optimization edit, including card
-
Baseline matrix required before the first optimization edit, including card
status, repeated timing, effective bandwidth, and noise/stability estimate.
status, device bandwidth calibration, repeated timing, effective bandwidth,
and noise/stability estimate.
-
Iteration discipline: one primary optimization hypothesis per lineage, plus
-
Iteration discipline: one primary optimization hypothesis per lineage, plus
explicit keep/reject/inconclusive decision rules.
explicit keep/reject/inconclusive decision rules.
-
Research digest covering local LightOp patterns and any upstream/source
-
Research digest covering local LightOp patterns and any upstream/source
...
...
humanize/skills/ncu-report/SKILL.md
View file @
2ad344b2
...
@@ -38,6 +38,9 @@ the next edit.
...
@@ -38,6 +38,9 @@ the next edit.
Invoke
`dcu-profiler-report`
when any of these hold:
Invoke
`dcu-profiler-report`
when any of these hold:
-
A correctness-passing LightOp optimization candidate has just been benchmarked
and the loop needs the mandatory per-candidate
`hipprof --pmc`
interpretation
before the next edit.
-
A LightOp baseline benchmark has passed and no baseline profile digest exists.
-
A LightOp baseline benchmark has passed and no baseline profile digest exists.
-
A correct candidate is within +/-2% of baseline or the prior best.
-
A correct candidate is within +/-2% of baseline or the prior best.
-
The second correctness-passing optimization attempt improves less than 5%
-
The second correctness-passing optimization attempt improves less than 5%
...
@@ -66,17 +69,23 @@ same host/container environment that will run the workload.
...
@@ -66,17 +69,23 @@ same host/container environment that will run the workload.
-
Store the status output and the selected card in the artifact directory and
-
Store the status output and the selected card in the artifact directory and
digest. If the tools are unavailable, record the failed command and treat the
digest. If the tools are unavailable, record the failed command and treat the
resulting performance data as non-final unless the user accepts that caveat.
resulting performance data as non-final unless the user accepts that caveat.
-
Before the first optimization profile or baseline comparison, include the
selected-card device bandwidth calibration from
`.humanize/lightop-agent/device-bandwidth.txt`
. If it is missing, run the
calibration in the same environment and record actual read, write,
copy/read-write, and triad bandwidth before interpreting operator bandwidth.
## Required Artifacts
## Required Artifacts
Store artifacts under the LightOp loop state or user-specified evidence path:
Store artifacts under the LightOp loop state or user-specified evidence path:
```
text
```
text
.humanize/lightop-agent/device-bandwidth.txt
.humanize/lightop-agent/profile-artifacts/<version>/
.humanize/lightop-agent/profile-artifacts/<version>/
device-status.txt # hy-smi or rocm-smi before benchmark/profile
device-status.txt # hy-smi or rocm-smi before benchmark/profile
benchmark.log
benchmark.log
hipprof.txt
hipprof.txt
hipprof-pmc-all/ # mandatory for
deep-analysis g
ate
s
hipprof-pmc-all/ # mandatory for
each correctness-passing optimization candid
ate
sqtt-json/ # mandatory when SQTT is available
sqtt-json/ # mandatory when SQTT is available
rocprof.csv # when rocprof/rocprofv3 is used
rocprof.csv # when rocprof/rocprofv3 is used
rocprof-stats.csv # when available
rocprof-stats.csv # when available
...
@@ -98,21 +107,30 @@ the digest.
...
@@ -98,21 +107,30 @@ the digest.
shape, fixed seed, and explicit synchronization around timed regions.
shape, fixed seed, and explicit synchronization around timed regions.
3.
Run the device selection gate and pin the selected idle card with
3.
Run the device selection gate and pin the selected idle card with
`HIP_VISIBLE_DEVICES`
.
`HIP_VISIBLE_DEVICES`
.
4.
Capture normal benchmark output before profiling, so profiler overhead does
4.
Before the first optimization edit or baseline profile, run or cite the
selected-card device bandwidth calibration and compare the operator's
effective-bandwidth target against the measured read/write/copy ceiling.
5.
Capture normal benchmark output before profiling, so profiler overhead does
not become the performance claim.
not become the performance claim.
5.
Run
`hipprof`
for first-pass API/kernel/memcpy timing.
6.
Run
`hipprof`
for first-pass API/kernel/memcpy timing.
6.
For deep-analysis gates, collect
`hipprof`
PMC all, SQTT JSON when
7.
For every correctness-passing optimization candidate, collect
`hipprof`
PMC evidence with supported variants such as
`--pmc`
,
`--pmc-read`
, and
`--pmc-write`
. Interpret cache behavior, memory/cache traffic, LDS or
bank-conflict evidence, and occupancy/resource pressure before choosing the
next edit. If a counter group or occupancy signal is unavailable, record the
exact attempted command and error output.
8.
For deep-analysis gates, add SQTT JSON when
supported,
`dccobjdump`
disassembly, code-object resource usage, and
supported,
`dccobjdump`
disassembly, code-object resource usage, and
LDS/register/occupancy evidence before choosing the next edit.
LDS/register/occupancy evidence before choosing the next edit.
7
.
If
`hipprof`
shows one or a few dominant kernels but does not explain why,
9
.
If
`hipprof`
shows one or a few dominant kernels but does not explain why,
collect deeper counters with the installed ROCm/DTK profiler.
collect deeper counters with the installed ROCm/DTK profiler.
8
.
If the issue looks codegen-sensitive, inspect AMDGPU ISA or code-object
10
.
If the issue looks codegen-sensitive, inspect AMDGPU ISA or code-object
metadata before choosing an edit.
metadata before choosing an edit.
9
.
Compare candidate against baseline or parent, not only absolute metrics.
11
.
Compare candidate against baseline or parent, not only absolute metrics.
1
0
.
Diagnose using the metric groups in
[
metrics.md
](
references/metrics.md
)
.
1
2
.
Diagnose using the metric groups in
[
metrics.md
](
references/metrics.md
)
.
1
1
.
Write
`digest.md`
using the template below. The final section must contain
1
3
.
Write
`digest.md`
using the template below. The final section must contain
exactly one next edit.
exactly one next edit.
1
2
.
Update the LightOp loop ledger with the digest path, bottleneck class, and
1
4
.
Update the LightOp loop ledger with the digest path, bottleneck class, and
selected next edit.
selected next edit.
## Common Commands
## Common Commands
...
@@ -122,6 +140,31 @@ Load [examples.md](references/examples.md) for copyable command variants.
...
@@ -122,6 +140,31 @@ Load [examples.md](references/examples.md) for copyable command variants.
Minimal first-pass capture from the LightOp root:
Minimal first-pass capture from the LightOp root:
```
bash
```
bash
mkdir
-p
.humanize/lightop-agent
HIP_VISIBLE_DEVICES
=
<idle-card> python -
<<
'
PY
' 2>&1 | tee .humanize/lightop-agent/device-bandwidth.txt
import time, torch
torch.cuda.init()
free, total = torch.cuda.mem_get_info()
bytes_per_buf = max(16 << 20, min(512 << 20, int(free // 5)))
n = bytes_per_buf // 4
a = torch.empty(n, device="cuda", dtype=torch.float32)
b = torch.empty_like(a)
c = torch.empty_like(a)
a.fill_(1.0); b.fill_(2.0); c.zero_(); torch.cuda.synchronize()
def bench(name, fn, bytes_moved, iters=80, warmup=20):
for _ in range(warmup): fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(iters): fn()
torch.cuda.synchronize()
dt = (time.perf_counter() - t0) / iters
print(f"{name}: {bytes_moved / dt / 1e12:.3f} TB/s ({dt * 1e6:.2f} us, bytes={bytes_moved})")
bench("write_fill", lambda: a.fill_(3.0), n * 4)
bench("copy_read_write", lambda: c.copy_(a), n * 4 * 2)
bench("triad_2read_1write", lambda: torch.add(a, b, out=c), n * 4 * 3)
bench("read_reduce", lambda: torch.sum(a), n * 4)
print("buffer_bytes:", n * 4, "total_mem:", total, "free_mem_at_start:", free)
PY
mkdir
-p
.humanize/lightop-agent/profile-artifacts/v000_baseline
mkdir
-p
.humanize/lightop-agent/profile-artifacts/v000_baseline
hy-smi 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt
||
\
hy-smi 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt
||
\
rocm-smi 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt
rocm-smi 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt
...
@@ -137,6 +180,16 @@ hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v001_candidate/devic
...
@@ -137,6 +180,16 @@ hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v001_candidate/devic
rocm-smi 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v001_candidate/device-status.txt
rocm-smi 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v001_candidate/device-status.txt
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof python
test
/<family>/benchmark_<op>.py 2>&1
\
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof python
test
/<family>/benchmark_<op>.py 2>&1
\
|
tee
.humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof.txt
|
tee
.humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof.txt
mkdir
-p
.humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof-pmc-all
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof
--pmc
--pmc-type
3
\
-o
.humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof-pmc-all/pmc
\
python
test
/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof
--pmc-read
--pmc-type
3
\
-o
.humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof-pmc-all/pmc-read
\
python
test
/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof
--pmc-write
--pmc-type
3
\
-o
.humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof-pmc-all/pmc-write
\
python
test
/<family>/benchmark_<op>.py
```
```
Deep-analysis capture, required after a second correctness-passing
Deep-analysis capture, required after a second correctness-passing
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment