Commit f18f386a authored by whlwhlwhl's avatar whlwhlwhl
Browse files

add prof 约束

parent 91f343be
...@@ -82,6 +82,9 @@ Record these before the first build: ...@@ -82,6 +82,9 @@ Record these before the first build:
- Container name or image tag, or `direct-host` when not using Docker. - Container name or image tag, or `direct-host` when not using Docker.
- LightOp path from the command's point of view, not only the host path. - LightOp path from the command's point of view, not only the host path.
- Visible device env such as `HIP_VISIBLE_DEVICES` or `HSA_VISIBLE_DEVICES`. - Visible device env such as `HIP_VISIBLE_DEVICES` or `HSA_VISIBLE_DEVICES`.
- DCU status command and selected card before performance runs: `hy-smi` or
`rocm-smi`, the observed HCU utilization, VRAM use, and the
`HIP_VISIBLE_DEVICES=<idle-card>` value used for benchmark/profile commands.
- `PYTORCH_ROCM_ARCH`, DTK/ROCm version, PyTorch version, HIP version, device - `PYTORCH_ROCM_ARCH`, DTK/ROCm version, PyTorch version, HIP version, device
name, and `gcnArchName`. name, and `gcnArchName`.
- Exact build/install, import-smoke, correctness, benchmark, and profiler - Exact build/install, import-smoke, correctness, benchmark, and profiler
...@@ -174,6 +177,33 @@ PY ...@@ -174,6 +177,33 @@ PY
hipcc --version hipcc --version
``` ```
## Performance Device Gate
Before any benchmark or profiling command, check card status in the target
execution environment and pin the run to an idle card.
- Run `hy-smi` or `rocm-smi` immediately before the performance command.
- Choose the card with low HCU utilization and low VRAM occupancy. If no card
is idle enough for stable results, delay the performance run or record that
the result is noisy and not acceptable as final evidence.
- Prefix benchmark and profiler commands with `HIP_VISIBLE_DEVICES=<card>`.
Keep the same card for the paired baseline/candidate comparison unless the
comparison is deliberately measuring cross-device behavior.
- Record the chosen card, HCU utilization, VRAM use, and exact command in the
attempt ledger or profile artifact directory.
Example:
```bash
hy-smi || rocm-smi
HIP_VISIBLE_DEVICES=<idle-card> python test/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py
```
Do not report a performance number as actionable unless the device-selection
gate was recorded or the user explicitly accepts the missing card-state
evidence.
## Workflow ## Workflow
### Stage 1: Inspect And Plan ### Stage 1: Inspect And Plan
...@@ -193,7 +223,8 @@ hipcc --version ...@@ -193,7 +223,8 @@ hipcc --version
1. Make the smallest LightOp source change that can satisfy the current 1. Make the smallest LightOp source change that can satisfy the current
task-acceptance pair. task-acceptance pair.
2. Build LightOp with the target arch. 2. Build LightOp with the target arch.
3. Run the targeted correctness test and then the benchmark for `W`. 3. Run the targeted correctness test. Before the benchmark for `W`, execute
the performance device gate and pin the run with `HIP_VISIBLE_DEVICES`.
4. Record every candidate result: correctness failure, build failure, 4. Record every candidate result: correctness failure, build failure,
regression, plateau, and improvement. regression, plateau, and improvement.
5. Invoke `dcu-profiler-report` when benchmark evidence is not enough to choose 5. Invoke `dcu-profiler-report` when benchmark evidence is not enough to choose
...@@ -234,9 +265,20 @@ that threshold as part of the acceptance contract. ...@@ -234,9 +265,20 @@ that threshold as part of the acceptance contract.
layernorm/rmsnorm/fused-norm patterns, relevant ROCm/DCU upstream evidence, layernorm/rmsnorm/fused-norm patterns, relevant ROCm/DCU upstream evidence,
and any portable reduction/vectorization ideas from the bundled corpus. and any portable reduction/vectorization ideas from the bundled corpus.
- A `dcu-profiler-report` digest for a representative target shape. - A `dcu-profiler-report` digest for a representative target shape.
- If the second correctness-passing optimization attempt improves less than 5%
over the relevant parent or baseline, run a deep `dcu-profiler-report`
analysis before the next optimization edit. That analysis must include
`hipprof` PMC all mode, SQTT JSON when available, `dccobjdump`
disassembly, code-object resource usage, and explicit LDS/register/occupancy
evidence or a recorded reason any item was unavailable.
- The next edit after that gate must name exactly one concrete LightOp kernel, - The next edit after that gate must name exactly one concrete LightOp kernel,
binding, dispatcher, config, or benchmark change and cite the knowledge and binding, dispatcher, config, or benchmark change and cite the knowledge and
profiler evidence that motivated it. profiler evidence that motivated it.
- Do not claim that an optimization is effective from intuition, source
inspection, or expected hardware behavior. Promotion to the optimization
ledger requires measured correctness-passing benchmark data, baseline or
parent comparison, and, when a profiling gate applies, profiler/resource/ISA
evidence.
- If the target remains unmet after the required tuning lineages, summarize the - If the target remains unmet after the required tuning lineages, summarize the
best candidate, failed lineages, profiler bottleneck class, unsupported best candidate, failed lineages, profiler bottleneck class, unsupported
regimes, and the most likely next engineering investment. Do not present the regimes, and the most likely next engineering investment. Do not present the
...@@ -300,6 +342,11 @@ baseline and threshold. If no benchmark exists, add a small benchmark that uses ...@@ -300,6 +342,11 @@ baseline and threshold. If no benchmark exists, add a small benchmark that uses
warmup, fixed shapes, fixed seeds, and explicit `torch.cuda.synchronize()` warmup, fixed shapes, fixed seeds, and explicit `torch.cuda.synchronize()`
around timed regions. around timed regions.
Before every benchmark, run the performance device gate from this skill:
capture `hy-smi` or `rocm-smi`, choose a low-utilization/low-VRAM card, and run
the benchmark with `HIP_VISIBLE_DEVICES=<idle-card>`. Reuse the same selected
card for baseline and candidate measurements.
Do not claim success from a passing build alone. A LightOp operator change is Do not claim success from a passing build alone. A LightOp operator change is
complete only after install, import smoke, targeted correctness, benchmark complete only after install, import smoke, targeted correctness, benchmark
comparison, and, when the result is near the threshold or surprising, profiler comparison, and, when the result is near the threshold or surprising, profiler
...@@ -320,6 +367,9 @@ next source of truth. These are heuristics, not user-facing gates: ...@@ -320,6 +367,9 @@ next source of truth. These are heuristics, not user-facing gates:
- Two consecutive correctness-passing candidates miss the target, in which case - Two consecutive correctness-passing candidates miss the target, in which case
pair this profile with a `lightop-kernel-knowledge` research pass before the pair this profile with a `lightop-kernel-knowledge` research pass before the
next kernel or dispatch edit. next kernel or dispatch edit.
- The second correctness-passing optimization attempt improves less than 5%
over its parent or baseline. This is a mandatory deep-analysis gate, not a
heuristic.
- A candidate is much faster than expected and needs explanation. - A candidate is much faster than expected and needs explanation.
- A reviewer asks for profiling evidence. - A reviewer asks for profiling evidence.
...@@ -327,6 +377,12 @@ Persist profile artifacts under `.humanize/lightop-agent/profile-artifacts/` ...@@ -327,6 +377,12 @@ Persist profile artifacts under `.humanize/lightop-agent/profile-artifacts/`
or the user-specified evidence directory. Each digest must end with exactly one or the user-specified evidence directory. Each digest must end with exactly one
concrete next kernel edit or a clear reason profiling is not actionable. concrete next kernel edit or a clear reason profiling is not actionable.
When the <5% second-optimization gate fires, the digest must include
`hipprof` PMC all, SQTT JSON when supported, `dccobjdump` disassembly,
code-object resource usage, and LDS/register/occupancy evidence. If any tool is
missing, record the exact command attempted and do not replace the missing
evidence with a guess.
## Plan Requirements ## Plan Requirements
Write `.humanize/lightop-agent/refined-plan.md` using the Humanize gen-plan Write `.humanize/lightop-agent/refined-plan.md` using the Humanize gen-plan
...@@ -339,6 +395,9 @@ schema. Include acceptance criteria for: ...@@ -339,6 +395,9 @@ schema. Include acceptance criteria for:
the execution environment, visible device selection, install command, smoke the execution environment, visible device selection, install command, smoke
command, correctness command, benchmark command, profiler command, and pass command, correctness command, benchmark command, profiler command, and pass
threshold. threshold.
- Device gate: `hy-smi` or `rocm-smi` command, idle-card selection criteria,
required `HIP_VISIBLE_DEVICES=<idle-card>` prefix for benchmark/profile, and
where the card-state output is stored.
- Correctness coverage for `W`, edge cases, dtype/layout/mode boundaries, and - Correctness coverage for `W`, edge cases, dtype/layout/mode boundaries, and
baseline/reference parity. baseline/reference parity.
- Build command, ROCm/DTK/PyTorch versions, `PYTORCH_ROCM_ARCH`, and device - Build command, ROCm/DTK/PyTorch versions, `PYTORCH_ROCM_ARCH`, and device
...@@ -353,6 +412,10 @@ schema. Include acceptance criteria for: ...@@ -353,6 +412,10 @@ schema. Include acceptance criteria for:
consecutive correctness-passing misses trigger both `lightop-kernel-knowledge` consecutive correctness-passing misses trigger both `lightop-kernel-knowledge`
research and `dcu-profiler-report` evidence before the next edit, and unmet research and `dcu-profiler-report` evidence before the next edit, and unmet
targets cannot be reported as complete. targets cannot be reported as complete.
- Low-gain discipline: if the second correctness-passing optimization improves
less than 5%, the next edit is blocked on deep profiling evidence:
`hipprof` PMC all, SQTT JSON if available, `dccobjdump`, code-object
resource usage, and LDS/register/occupancy analysis.
- Tuning decisions and dispatcher/config updates when `W` has multiple - Tuning decisions and dispatcher/config updates when `W` has multiple
regimes. regimes.
- Final correctness matrix, benchmark matrix, fallback paths, unsupported - Final correctness matrix, benchmark matrix, fallback paths, unsupported
......
...@@ -40,6 +40,9 @@ Invoke `dcu-profiler-report` when any of these hold: ...@@ -40,6 +40,9 @@ Invoke `dcu-profiler-report` when any of these hold:
- A LightOp baseline benchmark has passed and no baseline profile digest exists. - A LightOp baseline benchmark has passed and no baseline profile digest exists.
- A correct candidate is within +/-2% of baseline or the prior best. - A correct candidate is within +/-2% of baseline or the prior best.
- The second correctness-passing optimization attempt improves less than 5%
over its parent or baseline. This requires a deep analysis pass before the
next optimization edit.
- A correct candidate regresses on one or more important shapes. - A correct candidate regresses on one or more important shapes.
- A candidate is much faster than expected and needs explanation. - A candidate is much faster than expected and needs explanation.
- The next edit is unclear after benchmark results. - The next edit is unclear after benchmark results.
...@@ -51,19 +54,36 @@ Invoke `dcu-profiler-report` when any of these hold: ...@@ -51,19 +54,36 @@ Invoke `dcu-profiler-report` when any of these hold:
Do not profile while correctness is failing unless the failure is itself a Do not profile while correctness is failing unless the failure is itself a
profiler collection problem. Fix correctness first. profiler collection problem. Fix correctness first.
## Device Selection Gate
Before any benchmark or profiling capture, record current card state in the
same host/container environment that will run the workload.
- Run `hy-smi` or `rocm-smi`.
- Pick a card with low HCU utilization and low VRAM occupancy.
- Run benchmark and profiling commands with `HIP_VISIBLE_DEVICES=<idle-card>`.
- Keep the same selected card for paired baseline/candidate captures.
- Store the status output and the selected card in the artifact directory and
digest. If the tools are unavailable, record the failed command and treat the
resulting performance data as non-final unless the user accepts that caveat.
## Required Artifacts ## Required Artifacts
Store artifacts under the LightOp loop state or user-specified evidence path: Store artifacts under the LightOp loop state or user-specified evidence path:
```text ```text
.humanize/lightop-agent/profile-artifacts/<version>/ .humanize/lightop-agent/profile-artifacts/<version>/
device-status.txt # hy-smi or rocm-smi before benchmark/profile
benchmark.log benchmark.log
hipprof.txt hipprof.txt
hipprof-pmc-all/ # mandatory for deep-analysis gates
sqtt-json/ # mandatory when SQTT is available
rocprof.csv # when rocprof/rocprofv3 is used rocprof.csv # when rocprof/rocprofv3 is used
rocprof-stats.csv # when available rocprof-stats.csv # when available
rocprof-compute/ # when available rocprof-compute/ # when available
code-object-metadata.txt # when extracted code-object-metadata.txt # when extracted
amdgpu-isa.txt # when extracted amdgpu-isa.txt # when extracted
resource-usage.txt # VGPR/SGPR/LDS/occupancy/code-object evidence
digest.md digest.md
``` ```
...@@ -76,18 +96,23 @@ the digest. ...@@ -76,18 +96,23 @@ the digest.
exposes the regression, plateau, launch overhead, or suspected bottleneck. exposes the regression, plateau, launch overhead, or suspected bottleneck.
2. Build a focused benchmark harness with stable warmup, fixed dtype, fixed 2. Build a focused benchmark harness with stable warmup, fixed dtype, fixed
shape, fixed seed, and explicit synchronization around timed regions. shape, fixed seed, and explicit synchronization around timed regions.
3. Capture normal benchmark output before profiling, so profiler overhead does 3. Run the device selection gate and pin the selected idle card with
`HIP_VISIBLE_DEVICES`.
4. Capture normal benchmark output before profiling, so profiler overhead does
not become the performance claim. not become the performance claim.
4. Run `hipprof` for first-pass API/kernel/memcpy timing. 5. Run `hipprof` for first-pass API/kernel/memcpy timing.
5. If `hipprof` shows one or a few dominant kernels but does not explain why, 6. For deep-analysis gates, collect `hipprof` PMC all, SQTT JSON when
supported, `dccobjdump` disassembly, code-object resource usage, and
LDS/register/occupancy evidence before choosing the next edit.
7. If `hipprof` shows one or a few dominant kernels but does not explain why,
collect deeper counters with the installed ROCm/DTK profiler. collect deeper counters with the installed ROCm/DTK profiler.
6. If the issue looks codegen-sensitive, inspect AMDGPU ISA or code-object 8. If the issue looks codegen-sensitive, inspect AMDGPU ISA or code-object
metadata before choosing an edit. metadata before choosing an edit.
7. Compare candidate against baseline or parent, not only absolute metrics. 9. Compare candidate against baseline or parent, not only absolute metrics.
8. Diagnose using the metric groups in [metrics.md](references/metrics.md). 10. Diagnose using the metric groups in [metrics.md](references/metrics.md).
9. Write `digest.md` using the template below. The final section must contain 11. Write `digest.md` using the template below. The final section must contain
exactly one next edit. exactly one next edit.
10. Update the LightOp loop ledger with the digest path, bottleneck class, and 12. Update the LightOp loop ledger with the digest path, bottleneck class, and
selected next edit. selected next edit.
## Common Commands ## Common Commands
...@@ -98,18 +123,52 @@ Minimal first-pass capture from the LightOp root: ...@@ -98,18 +123,52 @@ Minimal first-pass capture from the LightOp root:
```bash ```bash
mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline
python test/test_<op>.py 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt || \
hipprof python test/test_<op>.py 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt rocm-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt
HIP_VISIBLE_DEVICES=<idle-card> python test/test_<op>.py 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log
HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/test_<op>.py 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt
``` ```
For a benchmark script: For a benchmark script:
```bash ```bash
mkdir -p .humanize/lightop-agent/profile-artifacts/v001_candidate mkdir -p .humanize/lightop-agent/profile-artifacts/v001_candidate
hipprof python test/<family>/benchmark_<op>.py 2>&1 \ hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v001_candidate/device-status.txt || \
rocm-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v001_candidate/device-status.txt
HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py 2>&1 \
| tee .humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof.txt | tee .humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof.txt
``` ```
Deep-analysis capture, required after a second correctness-passing
optimization improves less than 5%:
```bash
mkdir -p .humanize/lightop-agent/profile-artifacts/v002_deep/{hipprof-pmc-all,sqtt-json}
hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v002_deep/device-status.txt || \
rocm-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v002_deep/device-status.txt
HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc --pmc-type 3 \
-o .humanize/lightop-agent/profile-artifacts/v002_deep/hipprof-pmc-all/pmc \
python test/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-read --pmc-type 3 \
-o .humanize/lightop-agent/profile-artifacts/v002_deep/hipprof-pmc-all/pmc-read \
python test/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-write --pmc-type 3 \
-o .humanize/lightop-agent/profile-artifacts/v002_deep/hipprof-pmc-all/pmc-write \
python test/<family>/benchmark_<op>.py
HIP_VISIBLE_DEVICES=<idle-card> hipprof --sqtt --sqtt-type 1 --output-type 0 \
-d .humanize/lightop-agent/profile-artifacts/v002_deep/sqtt-json/ \
python test/<family>/benchmark_<op>.py
dccobjdump --inputs=<binary-or-so> --show-sass --show-instruction-encoding \
--separate-functions > .humanize/lightop-agent/profile-artifacts/v002_deep/amdgpu-isa.txt
hipprof --codeobj-analyze <binary-or-so> \
> .humanize/lightop-agent/profile-artifacts/v002_deep/resource-usage.txt
```
Adjust `<binary-or-so>` to the compiled extension, code object, or extracted
kernel binary produced by the actual build. If a command is unsupported by the
installed DTK, keep the command output or error in the artifact directory and
state the missing evidence in the digest.
Optional helper: Optional helper:
```bash ```bash
...@@ -131,6 +190,8 @@ full list and interpretation rules: ...@@ -131,6 +190,8 @@ full list and interpretation rules:
- Runtime identity: LightOp commit, operator, public API, shape, dtype, gfx - Runtime identity: LightOp commit, operator, public API, shape, dtype, gfx
arch, DTK/ROCm/PyTorch version, build command. arch, DTK/ROCm/PyTorch version, build command.
- Device selection: `hy-smi` or `rocm-smi` output, selected
`HIP_VISIBLE_DEVICES`, HCU utilization, and VRAM occupancy.
- Benchmark timing: warmup, repeats, p50/p90/mean, synchronization points, - Benchmark timing: warmup, repeats, p50/p90/mean, synchronization points,
variance, baseline and candidate deltas. variance, baseline and candidate deltas.
- `hipprof` timing: total HIP API time, kernel time, memcpy/memset time, - `hipprof` timing: total HIP API time, kernel time, memcpy/memset time,
...@@ -149,6 +210,11 @@ full list and interpretation rules: ...@@ -149,6 +210,11 @@ full list and interpretation rules:
- ISA/code object: hot instruction window, vector width, MFMA selection, - ISA/code object: hot instruction window, vector width, MFMA selection,
excessive scalarization, scratch loads/stores. excessive scalarization, scratch loads/stores.
For the mandatory <5% second-optimization gate, the digest must include all of
these sections: `hipprof` PMC all, SQTT JSON or the unavailable-tool reason,
`dccobjdump` disassembly, code-object resource usage, and explicit
LDS/register/occupancy interpretation.
Validate exact tool availability on the target machine: Validate exact tool availability on the target machine:
```bash ```bash
...@@ -168,6 +234,9 @@ Environment ...@@ -168,6 +234,9 @@ Environment
- LightOp root: - LightOp root:
- Repo commit: - Repo commit:
- GPU / gfx: - GPU / gfx:
- Device status command:
- Selected HIP_VISIBLE_DEVICES:
- HCU utilization / VRAM before run:
- DTK / ROCm / PyTorch: - DTK / ROCm / PyTorch:
- Build command: - Build command:
- Benchmark command: - Benchmark command:
...@@ -190,6 +259,13 @@ Evidence ...@@ -190,6 +259,13 @@ Evidence
Profiler Hotspots Profiler Hotspots
- <kernel/API/copy/config branch>: <measured signal> -> <meaning> - <kernel/API/copy/config branch>: <measured signal> -> <meaning>
Deep Analysis Artifacts
- hipprof PMC all:
- SQTT JSON:
- dccobjdump:
- code-object resource usage:
- LDS/register/occupancy:
Counter / ISA Analysis Counter / ISA Analysis
- Counter source: - Counter source:
- Hot instruction window: - Hot instruction window:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment