Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
whlwhlwhl
Lightop-SKIILS
Commits
f18f386a
Commit
f18f386a
authored
May 20, 2026
by
whlwhlwhl
Browse files
add prof 约束
parent
91f343be
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
151 additions
and
12 deletions
+151
-12
humanize/skills/humanize-kernel-agent-loop/SKILL.md
humanize/skills/humanize-kernel-agent-loop/SKILL.md
+64
-1
humanize/skills/ncu-report/SKILL.md
humanize/skills/ncu-report/SKILL.md
+87
-11
No files found.
humanize/skills/humanize-kernel-agent-loop/SKILL.md
View file @
f18f386a
...
@@ -82,6 +82,9 @@ Record these before the first build:
...
@@ -82,6 +82,9 @@ Record these before the first build:
-
Container name or image tag, or
`direct-host`
when not using Docker.
-
Container name or image tag, or
`direct-host`
when not using Docker.
-
LightOp path from the command's point of view, not only the host path.
-
LightOp path from the command's point of view, not only the host path.
-
Visible device env such as
`HIP_VISIBLE_DEVICES`
or
`HSA_VISIBLE_DEVICES`
.
-
Visible device env such as
`HIP_VISIBLE_DEVICES`
or
`HSA_VISIBLE_DEVICES`
.
-
DCU status command and selected card before performance runs:
`hy-smi`
or
`rocm-smi`
, the observed HCU utilization, VRAM use, and the
`HIP_VISIBLE_DEVICES=<idle-card>`
value used for benchmark/profile commands.
-
`PYTORCH_ROCM_ARCH`
, DTK/ROCm version, PyTorch version, HIP version, device
-
`PYTORCH_ROCM_ARCH`
, DTK/ROCm version, PyTorch version, HIP version, device
name, and
`gcnArchName`
.
name, and
`gcnArchName`
.
-
Exact build/install, import-smoke, correctness, benchmark, and profiler
-
Exact build/install, import-smoke, correctness, benchmark, and profiler
...
@@ -174,6 +177,33 @@ PY
...
@@ -174,6 +177,33 @@ PY
hipcc
--version
hipcc
--version
```
```
## Performance Device Gate
Before any benchmark or profiling command, check card status in the target
execution environment and pin the run to an idle card.
-
Run
`hy-smi`
or
`rocm-smi`
immediately before the performance command.
-
Choose the card with low HCU utilization and low VRAM occupancy. If no card
is idle enough for stable results, delay the performance run or record that
the result is noisy and not acceptable as final evidence.
-
Prefix benchmark and profiler commands with
`HIP_VISIBLE_DEVICES=<card>`
.
Keep the same card for the paired baseline/candidate comparison unless the
comparison is deliberately measuring cross-device behavior.
-
Record the chosen card, HCU utilization, VRAM use, and exact command in the
attempt ledger or profile artifact directory.
Example:
```
bash
hy-smi
||
rocm-smi
HIP_VISIBLE_DEVICES
=
<idle-card> python
test
/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof python
test
/<family>/benchmark_<op>.py
```
Do not report a performance number as actionable unless the device-selection
gate was recorded or the user explicitly accepts the missing card-state
evidence.
## Workflow
## Workflow
### Stage 1: Inspect And Plan
### Stage 1: Inspect And Plan
...
@@ -193,7 +223,8 @@ hipcc --version
...
@@ -193,7 +223,8 @@ hipcc --version
1.
Make the smallest LightOp source change that can satisfy the current
1.
Make the smallest LightOp source change that can satisfy the current
task-acceptance pair.
task-acceptance pair.
2.
Build LightOp with the target arch.
2.
Build LightOp with the target arch.
3.
Run the targeted correctness test and then the benchmark for
`W`
.
3.
Run the targeted correctness test. Before the benchmark for
`W`
, execute
the performance device gate and pin the run with
`HIP_VISIBLE_DEVICES`
.
4.
Record every candidate result: correctness failure, build failure,
4.
Record every candidate result: correctness failure, build failure,
regression, plateau, and improvement.
regression, plateau, and improvement.
5.
Invoke
`dcu-profiler-report`
when benchmark evidence is not enough to choose
5.
Invoke
`dcu-profiler-report`
when benchmark evidence is not enough to choose
...
@@ -234,9 +265,20 @@ that threshold as part of the acceptance contract.
...
@@ -234,9 +265,20 @@ that threshold as part of the acceptance contract.
layernorm/rmsnorm/fused-norm patterns, relevant ROCm/DCU upstream evidence,
layernorm/rmsnorm/fused-norm patterns, relevant ROCm/DCU upstream evidence,
and any portable reduction/vectorization ideas from the bundled corpus.
and any portable reduction/vectorization ideas from the bundled corpus.
-
A
`dcu-profiler-report`
digest for a representative target shape.
-
A
`dcu-profiler-report`
digest for a representative target shape.
-
If the second correctness-passing optimization attempt improves less than 5%
over the relevant parent or baseline, run a deep
`dcu-profiler-report`
analysis before the next optimization edit. That analysis must include
`hipprof`
PMC all mode, SQTT JSON when available,
`dccobjdump`
disassembly, code-object resource usage, and explicit LDS/register/occupancy
evidence or a recorded reason any item was unavailable.
-
The next edit after that gate must name exactly one concrete LightOp kernel,
-
The next edit after that gate must name exactly one concrete LightOp kernel,
binding, dispatcher, config, or benchmark change and cite the knowledge and
binding, dispatcher, config, or benchmark change and cite the knowledge and
profiler evidence that motivated it.
profiler evidence that motivated it.
-
Do not claim that an optimization is effective from intuition, source
inspection, or expected hardware behavior. Promotion to the optimization
ledger requires measured correctness-passing benchmark data, baseline or
parent comparison, and, when a profiling gate applies, profiler/resource/ISA
evidence.
-
If the target remains unmet after the required tuning lineages, summarize the
-
If the target remains unmet after the required tuning lineages, summarize the
best candidate, failed lineages, profiler bottleneck class, unsupported
best candidate, failed lineages, profiler bottleneck class, unsupported
regimes, and the most likely next engineering investment. Do not present the
regimes, and the most likely next engineering investment. Do not present the
...
@@ -300,6 +342,11 @@ baseline and threshold. If no benchmark exists, add a small benchmark that uses
...
@@ -300,6 +342,11 @@ baseline and threshold. If no benchmark exists, add a small benchmark that uses
warmup, fixed shapes, fixed seeds, and explicit
`torch.cuda.synchronize()`
warmup, fixed shapes, fixed seeds, and explicit
`torch.cuda.synchronize()`
around timed regions.
around timed regions.
Before every benchmark, run the performance device gate from this skill:
capture
`hy-smi`
or
`rocm-smi`
, choose a low-utilization/low-VRAM card, and run
the benchmark with
`HIP_VISIBLE_DEVICES=<idle-card>`
. Reuse the same selected
card for baseline and candidate measurements.
Do not claim success from a passing build alone. A LightOp operator change is
Do not claim success from a passing build alone. A LightOp operator change is
complete only after install, import smoke, targeted correctness, benchmark
complete only after install, import smoke, targeted correctness, benchmark
comparison, and, when the result is near the threshold or surprising, profiler
comparison, and, when the result is near the threshold or surprising, profiler
...
@@ -320,6 +367,9 @@ next source of truth. These are heuristics, not user-facing gates:
...
@@ -320,6 +367,9 @@ next source of truth. These are heuristics, not user-facing gates:
-
Two consecutive correctness-passing candidates miss the target, in which case
-
Two consecutive correctness-passing candidates miss the target, in which case
pair this profile with a
`lightop-kernel-knowledge`
research pass before the
pair this profile with a
`lightop-kernel-knowledge`
research pass before the
next kernel or dispatch edit.
next kernel or dispatch edit.
-
The second correctness-passing optimization attempt improves less than 5%
over its parent or baseline. This is a mandatory deep-analysis gate, not a
heuristic.
-
A candidate is much faster than expected and needs explanation.
-
A candidate is much faster than expected and needs explanation.
-
A reviewer asks for profiling evidence.
-
A reviewer asks for profiling evidence.
...
@@ -327,6 +377,12 @@ Persist profile artifacts under `.humanize/lightop-agent/profile-artifacts/`
...
@@ -327,6 +377,12 @@ Persist profile artifacts under `.humanize/lightop-agent/profile-artifacts/`
or the user-specified evidence directory. Each digest must end with exactly one
or the user-specified evidence directory. Each digest must end with exactly one
concrete next kernel edit or a clear reason profiling is not actionable.
concrete next kernel edit or a clear reason profiling is not actionable.
When the <5% second-optimization gate fires, the digest must include
`hipprof`
PMC all, SQTT JSON when supported,
`dccobjdump`
disassembly,
code-object resource usage, and LDS/register/occupancy evidence. If any tool is
missing, record the exact command attempted and do not replace the missing
evidence with a guess.
## Plan Requirements
## Plan Requirements
Write
`.humanize/lightop-agent/refined-plan.md`
using the Humanize gen-plan
Write
`.humanize/lightop-agent/refined-plan.md`
using the Humanize gen-plan
...
@@ -339,6 +395,9 @@ schema. Include acceptance criteria for:
...
@@ -339,6 +395,9 @@ schema. Include acceptance criteria for:
the execution environment, visible device selection, install command, smoke
the execution environment, visible device selection, install command, smoke
command, correctness command, benchmark command, profiler command, and pass
command, correctness command, benchmark command, profiler command, and pass
threshold.
threshold.
-
Device gate:
`hy-smi`
or
`rocm-smi`
command, idle-card selection criteria,
required
`HIP_VISIBLE_DEVICES=<idle-card>`
prefix for benchmark/profile, and
where the card-state output is stored.
-
Correctness coverage for
`W`
, edge cases, dtype/layout/mode boundaries, and
-
Correctness coverage for
`W`
, edge cases, dtype/layout/mode boundaries, and
baseline/reference parity.
baseline/reference parity.
-
Build command, ROCm/DTK/PyTorch versions,
`PYTORCH_ROCM_ARCH`
, and device
-
Build command, ROCm/DTK/PyTorch versions,
`PYTORCH_ROCM_ARCH`
, and device
...
@@ -353,6 +412,10 @@ schema. Include acceptance criteria for:
...
@@ -353,6 +412,10 @@ schema. Include acceptance criteria for:
consecutive correctness-passing misses trigger both
`lightop-kernel-knowledge`
consecutive correctness-passing misses trigger both
`lightop-kernel-knowledge`
research and
`dcu-profiler-report`
evidence before the next edit, and unmet
research and
`dcu-profiler-report`
evidence before the next edit, and unmet
targets cannot be reported as complete.
targets cannot be reported as complete.
-
Low-gain discipline: if the second correctness-passing optimization improves
less than 5%, the next edit is blocked on deep profiling evidence:
`hipprof`
PMC all, SQTT JSON if available,
`dccobjdump`
, code-object
resource usage, and LDS/register/occupancy analysis.
-
Tuning decisions and dispatcher/config updates when
`W`
has multiple
-
Tuning decisions and dispatcher/config updates when
`W`
has multiple
regimes.
regimes.
-
Final correctness matrix, benchmark matrix, fallback paths, unsupported
-
Final correctness matrix, benchmark matrix, fallback paths, unsupported
...
...
humanize/skills/ncu-report/SKILL.md
View file @
f18f386a
...
@@ -40,6 +40,9 @@ Invoke `dcu-profiler-report` when any of these hold:
...
@@ -40,6 +40,9 @@ Invoke `dcu-profiler-report` when any of these hold:
-
A LightOp baseline benchmark has passed and no baseline profile digest exists.
-
A LightOp baseline benchmark has passed and no baseline profile digest exists.
-
A correct candidate is within +/-2% of baseline or the prior best.
-
A correct candidate is within +/-2% of baseline or the prior best.
-
The second correctness-passing optimization attempt improves less than 5%
over its parent or baseline. This requires a deep analysis pass before the
next optimization edit.
-
A correct candidate regresses on one or more important shapes.
-
A correct candidate regresses on one or more important shapes.
-
A candidate is much faster than expected and needs explanation.
-
A candidate is much faster than expected and needs explanation.
-
The next edit is unclear after benchmark results.
-
The next edit is unclear after benchmark results.
...
@@ -51,19 +54,36 @@ Invoke `dcu-profiler-report` when any of these hold:
...
@@ -51,19 +54,36 @@ Invoke `dcu-profiler-report` when any of these hold:
Do not profile while correctness is failing unless the failure is itself a
Do not profile while correctness is failing unless the failure is itself a
profiler collection problem. Fix correctness first.
profiler collection problem. Fix correctness first.
## Device Selection Gate
Before any benchmark or profiling capture, record current card state in the
same host/container environment that will run the workload.
-
Run
`hy-smi`
or
`rocm-smi`
.
-
Pick a card with low HCU utilization and low VRAM occupancy.
-
Run benchmark and profiling commands with
`HIP_VISIBLE_DEVICES=<idle-card>`
.
-
Keep the same selected card for paired baseline/candidate captures.
-
Store the status output and the selected card in the artifact directory and
digest. If the tools are unavailable, record the failed command and treat the
resulting performance data as non-final unless the user accepts that caveat.
## Required Artifacts
## Required Artifacts
Store artifacts under the LightOp loop state or user-specified evidence path:
Store artifacts under the LightOp loop state or user-specified evidence path:
```
text
```
text
.humanize/lightop-agent/profile-artifacts/<version>/
.humanize/lightop-agent/profile-artifacts/<version>/
device-status.txt # hy-smi or rocm-smi before benchmark/profile
benchmark.log
benchmark.log
hipprof.txt
hipprof.txt
hipprof-pmc-all/ # mandatory for deep-analysis gates
sqtt-json/ # mandatory when SQTT is available
rocprof.csv # when rocprof/rocprofv3 is used
rocprof.csv # when rocprof/rocprofv3 is used
rocprof-stats.csv # when available
rocprof-stats.csv # when available
rocprof-compute/ # when available
rocprof-compute/ # when available
code-object-metadata.txt # when extracted
code-object-metadata.txt # when extracted
amdgpu-isa.txt # when extracted
amdgpu-isa.txt # when extracted
resource-usage.txt # VGPR/SGPR/LDS/occupancy/code-object evidence
digest.md
digest.md
```
```
...
@@ -76,18 +96,23 @@ the digest.
...
@@ -76,18 +96,23 @@ the digest.
exposes the regression, plateau, launch overhead, or suspected bottleneck.
exposes the regression, plateau, launch overhead, or suspected bottleneck.
2.
Build a focused benchmark harness with stable warmup, fixed dtype, fixed
2.
Build a focused benchmark harness with stable warmup, fixed dtype, fixed
shape, fixed seed, and explicit synchronization around timed regions.
shape, fixed seed, and explicit synchronization around timed regions.
3.
Capture normal benchmark output before profiling, so profiler overhead does
3.
Run the device selection gate and pin the selected idle card with
`HIP_VISIBLE_DEVICES`
.
4.
Capture normal benchmark output before profiling, so profiler overhead does
not become the performance claim.
not become the performance claim.
4.
Run
`hipprof`
for first-pass API/kernel/memcpy timing.
5.
Run
`hipprof`
for first-pass API/kernel/memcpy timing.
5.
If
`hipprof`
shows one or a few dominant kernels but does not explain why,
6.
For deep-analysis gates, collect
`hipprof`
PMC all, SQTT JSON when
supported,
`dccobjdump`
disassembly, code-object resource usage, and
LDS/register/occupancy evidence before choosing the next edit.
7.
If
`hipprof`
shows one or a few dominant kernels but does not explain why,
collect deeper counters with the installed ROCm/DTK profiler.
collect deeper counters with the installed ROCm/DTK profiler.
6
.
If the issue looks codegen-sensitive, inspect AMDGPU ISA or code-object
8
.
If the issue looks codegen-sensitive, inspect AMDGPU ISA or code-object
metadata before choosing an edit.
metadata before choosing an edit.
7
.
Compare candidate against baseline or parent, not only absolute metrics.
9
.
Compare candidate against baseline or parent, not only absolute metrics.
8
.
Diagnose using the metric groups in
[
metrics.md
](
references/metrics.md
)
.
10
.
Diagnose using the metric groups in
[
metrics.md
](
references/metrics.md
)
.
9
.
Write
`digest.md`
using the template below. The final section must contain
11
.
Write
`digest.md`
using the template below. The final section must contain
exactly one next edit.
exactly one next edit.
1
0
.
Update the LightOp loop ledger with the digest path, bottleneck class, and
1
2
.
Update the LightOp loop ledger with the digest path, bottleneck class, and
selected next edit.
selected next edit.
## Common Commands
## Common Commands
...
@@ -98,18 +123,52 @@ Minimal first-pass capture from the LightOp root:
...
@@ -98,18 +123,52 @@ Minimal first-pass capture from the LightOp root:
```
bash
```
bash
mkdir
-p
.humanize/lightop-agent/profile-artifacts/v000_baseline
mkdir
-p
.humanize/lightop-agent/profile-artifacts/v000_baseline
python
test
/test_<op>.py 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log
hy-smi 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt
||
\
hipprof python
test
/test_<op>.py 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt
rocm-smi 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt
HIP_VISIBLE_DEVICES
=
<idle-card> python
test
/test_<op>.py 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof python
test
/test_<op>.py 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt
```
```
For a benchmark script:
For a benchmark script:
```
bash
```
bash
mkdir
-p
.humanize/lightop-agent/profile-artifacts/v001_candidate
mkdir
-p
.humanize/lightop-agent/profile-artifacts/v001_candidate
hipprof python
test
/<family>/benchmark_<op>.py 2>&1
\
hy-smi 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v001_candidate/device-status.txt
||
\
rocm-smi 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v001_candidate/device-status.txt
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof python
test
/<family>/benchmark_<op>.py 2>&1
\
|
tee
.humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof.txt
|
tee
.humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof.txt
```
```
Deep-analysis capture, required after a second correctness-passing
optimization improves less than 5%:
```
bash
mkdir
-p
.humanize/lightop-agent/profile-artifacts/v002_deep/
{
hipprof-pmc-all,sqtt-json
}
hy-smi 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v002_deep/device-status.txt
||
\
rocm-smi 2>&1 |
tee
.humanize/lightop-agent/profile-artifacts/v002_deep/device-status.txt
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof
--pmc
--pmc-type
3
\
-o
.humanize/lightop-agent/profile-artifacts/v002_deep/hipprof-pmc-all/pmc
\
python
test
/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof
--pmc-read
--pmc-type
3
\
-o
.humanize/lightop-agent/profile-artifacts/v002_deep/hipprof-pmc-all/pmc-read
\
python
test
/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof
--pmc-write
--pmc-type
3
\
-o
.humanize/lightop-agent/profile-artifacts/v002_deep/hipprof-pmc-all/pmc-write
\
python
test
/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES
=
<idle-card> hipprof
--sqtt
--sqtt-type
1
--output-type
0
\
-d
.humanize/lightop-agent/profile-artifacts/v002_deep/sqtt-json/
\
python
test
/<family>/benchmark_<op>.py
dccobjdump
--inputs
=
<binary-or-so>
--show-sass
--show-instruction-encoding
\
--separate-functions
>
.humanize/lightop-agent/profile-artifacts/v002_deep/amdgpu-isa.txt
hipprof
--codeobj-analyze
<binary-or-so>
\
>
.humanize/lightop-agent/profile-artifacts/v002_deep/resource-usage.txt
```
Adjust
`<binary-or-so>`
to the compiled extension, code object, or extracted
kernel binary produced by the actual build. If a command is unsupported by the
installed DTK, keep the command output or error in the artifact directory and
state the missing evidence in the digest.
Optional helper:
Optional helper:
```
bash
```
bash
...
@@ -131,6 +190,8 @@ full list and interpretation rules:
...
@@ -131,6 +190,8 @@ full list and interpretation rules:
-
Runtime identity: LightOp commit, operator, public API, shape, dtype, gfx
-
Runtime identity: LightOp commit, operator, public API, shape, dtype, gfx
arch, DTK/ROCm/PyTorch version, build command.
arch, DTK/ROCm/PyTorch version, build command.
-
Device selection:
`hy-smi`
or
`rocm-smi`
output, selected
`HIP_VISIBLE_DEVICES`
, HCU utilization, and VRAM occupancy.
-
Benchmark timing: warmup, repeats, p50/p90/mean, synchronization points,
-
Benchmark timing: warmup, repeats, p50/p90/mean, synchronization points,
variance, baseline and candidate deltas.
variance, baseline and candidate deltas.
-
`hipprof`
timing: total HIP API time, kernel time, memcpy/memset time,
-
`hipprof`
timing: total HIP API time, kernel time, memcpy/memset time,
...
@@ -149,6 +210,11 @@ full list and interpretation rules:
...
@@ -149,6 +210,11 @@ full list and interpretation rules:
-
ISA/code object: hot instruction window, vector width, MFMA selection,
-
ISA/code object: hot instruction window, vector width, MFMA selection,
excessive scalarization, scratch loads/stores.
excessive scalarization, scratch loads/stores.
For the mandatory <5% second-optimization gate, the digest must include all of
these sections:
`hipprof`
PMC all, SQTT JSON or the unavailable-tool reason,
`dccobjdump`
disassembly, code-object resource usage, and explicit
LDS/register/occupancy interpretation.
Validate exact tool availability on the target machine:
Validate exact tool availability on the target machine:
```
bash
```
bash
...
@@ -168,6 +234,9 @@ Environment
...
@@ -168,6 +234,9 @@ Environment
-
LightOp root:
-
LightOp root:
-
Repo commit:
-
Repo commit:
-
GPU / gfx:
-
GPU / gfx:
-
Device status command:
-
Selected HIP_VISIBLE_DEVICES:
-
HCU utilization / VRAM before run:
-
DTK / ROCm / PyTorch:
-
DTK / ROCm / PyTorch:
-
Build command:
-
Build command:
-
Benchmark command:
-
Benchmark command:
...
@@ -190,6 +259,13 @@ Evidence
...
@@ -190,6 +259,13 @@ Evidence
Profiler Hotspots
Profiler Hotspots
-
<kernel
/
API
/
copy
/
config
branch
>
:
<measured
signal
>
->
<meaning>
-
<kernel
/
API
/
copy
/
config
branch
>
:
<measured
signal
>
->
<meaning>
Deep Analysis Artifacts
-
hipprof PMC all:
-
SQTT JSON:
-
dccobjdump:
-
code-object resource usage:
-
LDS/register/occupancy:
Counter / ISA Analysis
Counter / ISA Analysis
-
Counter source:
-
Counter source:
-
Hot instruction window:
-
Hot instruction window:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment