docs(profiler): add replay_optimize sweep README (#8195)

Signed-off-by: PeaBrane <yanrpei@gmail.com>

docs(profiler): add replay_optimize sweep README (#8195)
Signed-off-by: PeaBrane <yanrpei@gmail.com>
9a07ca15 · Yan Ru Pei · GitHub · da40db40 · 9a07ca15 · 9a07ca15
Unverified Commit 9a07ca15 authored Apr 14, 2026 by Yan Ru Pei Committed by GitHub Apr 14, 2026
2 changed files
--- a/components/src/dynamo/profiler/utils/replay_optimize/README.md
+++ b/components/src/dynamo/profiler/utils/replay_optimize/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES.
+All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
+# Replay Optimize
+
+## Experiment Goal
+
+This experiment searches over disaggregated replay states to answer a concrete question:
+
+- given a fixed GPU budget
+- for a workload with real prefix overlap
+- and latency constraints that still permit meaningful throughput
+
+which `(prefill_tp, decode_tp, prefill_workers, decode_workers, overlap_score_weight)` combination
+produces the best offline replay result?
+
+This is a heuristic search over replay states, not an exact optimizer over all feasible
+configurations.
+
+## Prerequisites
+
+Run from the repository root.
+
+Use the project virtual environment:
+
+```bash
+.venv/bin/python --version
+```
+
+If the Python bindings are not importable yet, build them first:
+
+```bash
+.venv/bin/maturin develop --uv -m lib/bindings/python/Cargo.toml
+```
+
+This example uses AIC-backed replay optimization by default:
+
+- AIC is used to enumerate dense TP candidates
+- AIC-backed engine timing is used for the replay candidate configs
+
+Install `aiconfigurator` into the project environment:
+
+```bash
+uv pip install --python .venv/bin/python aiconfigurator
+```
+
+If a regular install fails to load usable perf data, reinstall from a source checkout that has real
+systems data materialized:
+
+```bash
+uv pip install --python .venv/bin/python --force-reinstall /path/to/aiconfigurator
+```
+
+If replay optimization fails with AIC errors about missing perf databases or parse failures such as
+`KeyError: 'gemm_dtype'`, inspect the installed files under:
+
+```text
+.venv/lib/python*/site-packages/aiconfigurator/systems/data/...
+```
+
+If those files begin with `version https://git-lfs.github.com/spec/v1`, you have Git LFS pointer
+stubs instead of real perf tables. In that case, install `aiconfigurator` from a checkout or wheel
+that includes the real LFS materialized payloads in `systems/`.
+
+When running directly from a source checkout, expose the in-repo Python packages:
+
+```bash
+export PYTHONPATH=lib/bindings/python/src:components/src
+```
+
+If the replay search uses multiple worker processes, prefer a real script file over a heredoc. This
+matters on macOS because `ProcessPoolExecutor` child workers need a stable module path.
+
+For KV-router replay logs, this filter keeps the run readable without hiding useful `info` output:
+
+```bash
+export DYN_LOG='info,dynamo_kv_router::scheduling::selector=warn'
+```
+
+## Experiment Setup
+
+This sweep uses:
+
+- model: `Qwen/Qwen3-32B`
+- backend: `vllm`
+- system: `h200_sxm`
+- router mode: `kv_router`
+- workload type: `SyntheticReplayWorkload`
+- GPU budget: `16`
+
+The GPU budget here is a simulated search constraint used by offline replay when it enumerates
+candidate TP and worker configurations. You do not need 16 real GPUs locally to run this search.
+
+The synthetic workload is intentionally large enough to make worker allocation and router settings
+matter:
+
+- `isl=32768`
+- `osl=256`
+- `request_count=5000`
+- `replay_concurrency=200`
+- `shared_prefix_ratio=0.5`
+- `num_prefix_groups=50`
+
+The base engine args stay conservative:
+
+- `block_size=512`
+- `num_gpu_blocks=20000`
+- `enable_prefix_caching=True`
+- explicit `worker_type` for prefill vs decode
+
+This setup does not force scheduler-specific bottlenecks such as:
+
+- `enable_chunked_prefill`
+- a small `max_num_seqs`
+- a pinned `max_num_batched_tokens`
+
+Only add those when the experiment is specifically about scheduler limits.
+
+## Driver Script
+
+The canonical starting point now lives in [example.py](example.py). Keeping it as a real module is
+better than carrying a large inline snippet in the README, and it also satisfies the macOS
+`ProcessPoolExecutor` requirement for a stable module path.
+
+Treat [example.py](example.py) as a starting point, not a frozen harness. Modify it as needed for
+your search:
+
+- change the workload shape
+- swap `SyntheticReplayWorkload` for `TraceReplayWorkload`
+- change constraints
+- change `overlap_score_weights`
+- print different columns from `result.evaluated_df` or `result.feasible_df`
+- persist the tables to CSV or parquet if you want downstream analysis
+
+If you need to understand which knobs are available, see [models.py](models.py), [search.py](search.py),
+and [evaluate.py](evaluate.py).
+
+The default path in [example.py](example.py) is the synthetic disaggregated sweep documented in
+this README. It also accepts `--trace-file` and `--arrival-speedup-ratio` so the same driver can be
+used for the Mooncake-style replay path below without rewriting the harness from scratch.
+
+## Expected Outputs
+
+The returned object is a `DenseReplayOptimizationResult` with:
+
+- `best_feasible`: best visited state that satisfies all constraints
+- `best_infeasible`: best visited state that misses at least one constraint
+- `evaluated_df`: all visited states
+- `feasible_df`: only the feasible visited states
+
+Useful columns to inspect:
+
+- topology: `prefill_tp`, `decode_tp`, `prefill_workers`, `decode_workers`
+- routing: `router_mode`, `overlap_score_weight`
+- budget: `total_gpus_used`
+  This is the simulated GPU footprint of the candidate replay state, not a count of GPUs actually
+  allocated on the machine running the search.
+- throughput: `output_throughput_tok_s`
+- cache behavior: `prefix_cache_reused_ratio`
+- latency: `mean_ttft_ms`, `mean_tpot_ms`, `mean_e2e_latency_ms`
+
+In local testing, this setup produced a non-trivial mean-E2E winner around:
+
+- `prefill_tp=2`
+- `decode_tp=1`
+- `prefill_workers=2`
+- `decode_workers=4`
+- `overlap_score_weight=0.5`
+
+Ballpark metrics for that point were:
+
+- `prefix_cache_reused_ratio ~= 0.5`
+- `output_throughput_tok_s ~= 4500`
+- `mean_ttft_ms ~= 4500`
+- `mean_tpot_ms ~= 26`
+- `mean_e2e_latency_ms ~= 11150`
+
+Treat those as sanity-check ranges, not fixed assertions.
+
+## Tuning This Sweep
+
+To broaden or shift the search, vary one axis at a time:
+
+- `max_total_gpus`
+- `overlap_score_weights`
+- `shared_prefix_ratio`
+- `num_prefix_groups`
+- base prefill/decode engine args
+
+If you want to compare routing strategies directly, use `router_mode="both"` instead of the default
+KV-router-only search.
+
+## Real Traffic Replay
+
+`replay_optimize` is wired up for trace-driven replay. In
+[evaluate.py](evaluate.py), `TraceReplayWorkload` goes through `run_trace_replay(...)`, while
+`SyntheticReplayWorkload` goes through `run_synthetic_trace_replay(...)`.
+
+Use a separate trace-driven experiment when you want to evaluate the same search structure against a
+real Mooncake-style workload instead of the synthetic shared-prefix workload above.
+
+### Download a Mooncake Trace
+
+For a public starting point, use the FAST'25 toolagent trace:
+
+```bash
+curl -sL \
+  https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/traces/toolagent_trace.jsonl \
+  -o /tmp/toolagent_trace.jsonl
+```
+
+```bash
+wget -O /tmp/toolagent_trace.jsonl \
+  https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/traces/toolagent_trace.jsonl
+```
+
+### Replace the Synthetic Workload
+
+If you use [example.py](example.py), pass `--trace-file /tmp/toolagent_trace.jsonl` and optionally
+`--arrival-speedup-ratio 0.8`.
+
+If you want to edit the driver directly, replace:
+
+```python
+workload=SyntheticReplayWorkload(
+    isl=32768,
+    osl=256,
+    request_count=5000,
+    replay_concurrency=200,
+    shared_prefix_ratio=0.5,
+    num_prefix_groups=50,
+),
+```
+
+with:
+
+```python
+from dynamo.profiler.utils.replay_optimize import TraceReplayWorkload
+
+workload=TraceReplayWorkload(
+    trace_file="/tmp/toolagent_trace.jsonl",
+    arrival_speedup_ratio=1.0,
+),
+```
+
+If you want to replay the same trace at `0.80x` of its original arrival rate, keep the same file
+and set:
+
+```python
+workload=TraceReplayWorkload(
+    trace_file="/tmp/toolagent_trace.jsonl",
+    arrival_speedup_ratio=0.8,
+),
+```
+
+The main behavioral change is that the workload stops generating requests in memory and instead
+replays request arrivals from the JSONL trace. In this path:
+
+- `trace_file` points at the Mooncake-style JSONL input
+- `arrival_speedup_ratio` compresses or stretches the trace arrival process
+- synthetic-only knobs such as `isl`, `osl`, `request_count`, `replay_concurrency`,
+  `shared_prefix_ratio`, and `num_prefix_groups` no longer apply at the workload level
+
+Important notes for the public toolagent trace:
+
+- the dataset uses Mooncake-style `hash_ids` with `512` tokens per block
+- the underlying `run_trace_replay(...)` API defaults `trace_block_size` to `512`
+- the current `TraceReplayWorkload` wrapper does not expose a separate `trace_block_size` field
+- the prefix-data-generator tools in
+  [Prefix Data Generator](../../../../../../benchmarks/prefix_data_generator/README.md)
+  are useful if you want to inspect the trace first or synthesize a larger derivative trace before
+  running this search
--- a/components/src/dynamo/profiler/utils/replay_optimize/example.py
+++ b/components/src/dynamo/profiler/utils/replay_optimize/example.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+from __future__ import annotations
+
+import argparse
+from collections.abc import Sequence
+
+from dynamo.llm import KvRouterConfig, MockEngineArgs
+from dynamo.profiler.utils.replay_optimize import (
+    SyntheticReplayWorkload,
+    TraceReplayWorkload,
+    optimize_dense_disagg_with_replay,
+)
+
+MODEL = "Qwen/Qwen3-32B"
+BACKEND = "vllm"
+SYSTEM = "h200_sxm"
+MAX_TOTAL_GPUS = 16
+OVERLAP_SCORE_WEIGHTS = (0.0, 0.5, 1.0, 2.0)
+RESULT_COLUMNS: Sequence[str] = (
+    "prefill_tp",
+    "decode_tp",
+    "prefill_workers",
+    "decode_workers",
+    "overlap_score_weight",
+    "total_gpus_used",
+    "output_throughput_tok_s",
+    "prefix_cache_reused_ratio",
+    "mean_ttft_ms",
+    "mean_tpot_ms",
+    "mean_e2e_latency_ms",
+)
+
+
+def _build_workload(
+    *,
+    trace_file: str | None,
+    arrival_speedup_ratio: float,
+) -> SyntheticReplayWorkload | TraceReplayWorkload:
+    if trace_file is not None:
+        return TraceReplayWorkload(
+            trace_file=trace_file,
+            arrival_speedup_ratio=arrival_speedup_ratio,
+        )
+
+    return SyntheticReplayWorkload(
+        isl=32768,
+        osl=256,
+        request_count=5000,
+        replay_concurrency=200,
+        shared_prefix_ratio=0.5,
+        num_prefix_groups=50,
+    )
+
+
+def _build_engine_args(*, worker_type: str) -> MockEngineArgs:
+    return MockEngineArgs(
+        block_size=512,
+        num_gpu_blocks=20000,
+        enable_prefix_caching=True,
+        worker_type=worker_type,
+    )
+
+
+def run_example(
+    *,
+    trace_file: str | None = None,
+    arrival_speedup_ratio: float = 1.0,
+    max_parallel_evals: int = 1,
+) -> None:
+    result = optimize_dense_disagg_with_replay(
+        model=MODEL,
+        backend=BACKEND,
+        system=SYSTEM,
+        workload=_build_workload(
+            trace_file=trace_file,
+            arrival_speedup_ratio=arrival_speedup_ratio,
+        ),
+        base_prefill_engine_args=_build_engine_args(worker_type="prefill"),
+        base_decode_engine_args=_build_engine_args(worker_type="decode"),
+        base_router_config=KvRouterConfig(),
+        max_total_gpus=MAX_TOTAL_GPUS,
+        constraints={
+            "mean_ttft_ms": 50000.0,
+            "mean_tpot_ms": 100.0,
+            "mean_e2e_latency_ms": 60000.0,
+        },
+        overlap_score_weights=OVERLAP_SCORE_WEIGHTS,
+        max_parallel_evals=max_parallel_evals,
+    )
+
+    print("Best feasible:")
+    print(result.best_feasible)
+    print()
+
+    print("Top feasible states:")
+    print(result.feasible_df[list(RESULT_COLUMNS)].head(10).to_string(index=False))
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(
+        description="Run the replay_optimize disaggregated KV-router example."
+    )
+    parser.add_argument(
+        "--trace-file",
+        help="Optional Mooncake-style JSONL trace. If omitted, runs the synthetic workload.",
+    )
+    parser.add_argument(
+        "--arrival-speedup-ratio",
+        type=float,
+        default=1.0,
+        help="Arrival speedup ratio to use with --trace-file.",
+    )
+    parser.add_argument(
+        "--max-parallel-evals",
+        type=int,
+        default=1,
+        help="Number of concurrent replay state evaluations.",
+    )
+    args = parser.parse_args()
+    run_example(
+        trace_file=args.trace_file,
+        arrival_speedup_ratio=args.arrival_speedup_ratio,
+        max_parallel_evals=args.max_parallel_evals,
+    )
+
+
+if __name__ == "__main__":
+    main()