The GPU budget here is a simulated search constraint used by offline replay when it enumerates
candidate TP and worker configurations. You do not need 16 real GPUs locally to run this search.
...
...
@@ -99,10 +120,10 @@ matter:
-`isl=32768`
-`osl=256`
-`request_count=5000`
-`replay_concurrency=200`
-`shared_prefix_ratio=0.5`
-`num_prefix_groups=50`
-`requestCount=5000`
-`concurrency=200`
-`sharedPrefixRatio=0.5`
-`numPrefixGroups=50`
The base engine args stay conservative:
...
...
@@ -119,23 +140,37 @@ This setup does not force scheduler-specific bottlenecks such as:
Only add those when the experiment is specifically about scheduler limits.
## Search Strategy
`replay_optimize` runs a coordinate descent over three dimensions per round, iterating until the incumbent stops moving or `DEFAULT_SEARCH_ROUNDS` is reached:
B --> C["Router search<br/>choose routing mode<br/>and overlap_score_weight"]
C --> A
```
Each step calls [`evaluate._evaluate_states`](evaluate.py), which replays the candidate state through `run_synthetic_trace_replay` or `run_trace_replay` (see [Mocker Trace Replay](../../../../../../docs/benchmarks/mocker-trace-replay.md) for the underlying harness) and ranks the resulting records with `scoring._pick_best_record`. The ranking key is `spec.objective` (throughput, mean_ttft, or mean_e2e_latency) subject to `spec.sla` bounds and `spec.hardware.totalGpus` as a feasibility gate.
The descent is budget-focused: each step prunes to near-budget-edge states so the sweep ends up at a TP/worker shape that actually consumes the available GPU budget, rather than at a throughput-per-GPU pareto point. Aggregated replay (`optimize_dense_agg_with_replay`) collapses dimensions 1 and 2 into `(tp, workers)` but is otherwise identical; see [`search.py`](search.py) for both entrypoints.
## Driver Script
The canonical starting point now lives in [example.py](example.py). Keeping it as a real module is
The canonical starting point lives in [example.py](example.py). Keeping it as a real module is
better than carrying a large inline snippet in the README, and it also satisfies the macOS
`ProcessPoolExecutor` requirement for a stable module path.
Treat [example.py](example.py) as a starting point, not a frozen harness. Modify it as needed for
your search:
- change the workload shape
- swap `SyntheticReplayWorkload` for `TraceReplayWorkload`
- change constraints
- change `overlap_score_weights`
- change the `WorkloadSpec` shape (or switch to a trace source with `traceFile=...`)
- add SLA bounds on `SLASpec` (`ttft`, `itl`, `e2eLatency`, or their p95 variants)
- change `RouterSpec.overlapWeights`
- print different columns from `result.evaluated_df` or `result.feasible_df`
- persist the tables to CSV or parquet if you want downstream analysis
If you need to understand which knobs are available, see [models.py](models.py), [search.py](search.py),
If you need to understand which knobs are available, see [specs.py](specs.py), [search.py](search.py),
and [evaluate.py](evaluate.py).
The default path in [example.py](example.py) is the synthetic disaggregated sweep documented in
...
...
@@ -146,8 +181,8 @@ used for the Mooncake-style replay path below without rewriting the harness from
The returned object is a `DenseReplayOptimizationResult` with:
-`best_feasible`: best visited state that satisfies all constraints
-`best_infeasible`: best visited state that misses at least one constraint
-`best_feasible`: best visited state that satisfies all SLAs and the GPU budget
-`best_infeasible`: best visited state that misses at least one SLA bound or the budget
- compare timing and cache behavior across mocker configurations
- validate replay logic in CI without bringing up a distributed stack
## Harness Overview
The replay harness wires a load driver (trace file or synthetic workload generator) into one or more mocker engine simulations and tees request/token timing into a trace collector.
```mermaid
flowchart LR
LD[Load Driver] --> H[Replay Harness]
H --> SES[Single Engine Simulation]
H --> MES[Multi Engine Simulation]
SES --> H
MES --> H
H --> TC[Trace Collector]
```
The load driver is either a Mooncake-style JSONL trace (timestamps, ISL/OSL, `hash_ids`) or a synthetic generator parameterized by `isl`/`osl`/`concurrency`. Single-engine simulation (`SES`) is the fast path for `num_workers == 1` with the vLLM engine; multi-engine simulation (`MES`) covers aggregated multi-worker replay, disaggregated prefill/decode replay, and KV-router replay. The trace collector produces the AIPerf-style summary table, the JSON report, and the per-request timing fields consumed by downstream analysis.
Each simulation composes a different set of components. SES drives the engine core directly (scheduler + forward-pass modeling). MES composes multiple engine cores with KV transfer/offloading, KV routing, and planner simulation layered on top:
```mermaid
flowchart TD
subgraph SEC[Single Engine Core]
subgraph SCH[Scheduler Modeling]
F[Fwd Pass Modeling]
end
end
KV[KV Transfer + Offloading Simulation]
KR[KV Router Simulation]
P[Planner Simulation]
SES[Single Engine Simulation]
MES[Multi Engine Simulation]
SES --> SEC
MES --> SEC
MES --> KV
MES --> KR
MES --> P
```
See [`lib/mocker/src/replay/offline/README.md`](../../lib/mocker/src/replay/offline/README.md) for offline-harness internals (logical clock, event queue, worker model) and [`docs/mocker/mocker.md`](../mocker/mocker.md) for engine-core details (scheduler, KV block manager).
## Quick Start
Run offline replay through the dedicated replay CLI:
| `--sglang-page-size` | 1 | SGLang radix-cache page size in tokens. Also becomes the effective block size when `--engine-type sglang` and `--block-size` is omitted |
| `--sglang-max-prefill-tokens` | 16384 | SGLang max prefill-token budget per batch |
The mocker is organized into several cooperating components that mirror the internal architecture of production LLM inference engines.
The mocker is organized into several cooperating components that mirror the internal architecture of production LLM inference engines. The scheduler (vLLM-style and SGLang-style variants) and KV block manager live inside the engine core. Multi-engine behavior — KV transfer/offloading simulation, KV router simulation, planner simulation — is added by the replay harness on top of multiple engine cores; see [Mocker Trace Replay](../benchmarks/mocker-trace-replay.md) for the component-level diagram and for offline replay internals under [`lib/mocker/src/replay/offline/`](../../lib/mocker/src/replay/offline/README.md).
### Scheduler
...
...
@@ -388,43 +388,46 @@ When resources become constrained, the mocker simulates the engine's real recove
### KV Block Manager
The block manager tracks KV cache blocks using reference counting and an LRU eviction policy. Blocks exist in one of two pools:
The mocker's KV block manager is now built on [`kvbm-logical::BlockManager<G1>`](../../lib/kvbm-logical/), the same logical block manager the real Dynamo runtime uses. The mocker wraps it in [`lib/mocker/src/kv_manager/kvbm_backend.rs`](../../lib/mocker/src/kv_manager/kvbm_backend.rs) and translates its own `MoveBlock` protocol onto kvbm-logical's RAII lifecycle (`allocate → stage → register → drop`).
-**Active Pool** - Blocks currently in use by one or more sequences, tracked with reference counts
-**Inactive Pool** - Blocks no longer actively referenced but kept for potential reuse (prefix caching)
Blocks still conceptually live in one of two pools:
When a sequence needs blocks, the manager first checks if they already exist (cache hit). If not, it allocates new blocks, potentially evicting the least-recently-used inactive blocks to make room. When a sequence completes or is preempted, its blocks are either moved to the inactive pool (for potential reuse) or freed entirely.
-**Active** — blocks currently held by at least one sequence. Partial (still-filling) blocks are held as `MutableBlock<G1>`; full blocks are held as `ImmutableBlock<G1>` clones (the clone vec length is the mocker's refcount, one per `Use`).
-**Inactive** — blocks no longer referenced by any sequence but kept for prefix-cache reuse. Handled entirely by kvbm-logical's inactive pool; the mocker never tracks them manually.
The following diagram illustrates the block lifecycle, based on vLLM's block manager design:
The lifecycle is RAII: dropping the last `ImmutableBlock` clone transitions the block from active to inactive (kvbm-logical's `reset` pool), with no explicit `deref`/`evict` bookkeeping on the mocker side. When a sequence completes or is preempted, the mocker simply drops its handles; kvbm-logical recovers the capacity.
```mermaid
stateDiagram-v2
[*] --> Active : alloc
Active --> Inactive : deref
Inactive --> Active : cache hit (reuse)
Inactive --> Freed : evict
Active --> Freed : destroy (preemption)
[*] --> Active : allocate + stage + register
Active --> Inactive : last handle dropped (RAII)
Inactive --> Active : match_blocks(PLH) reuse
Inactive --> Freed : evicted by backend
Active --> Freed : explicit Removed (Destroy)
Freed --> [*]
state Active {
[*] --> Tracked : ref_count tracked
}
state Inactive {
[*] --> Ordered : LRU order
[*] --> Partial : MutableBlock<G1>
Partial --> Full : promote (PLH / SequenceHash)
[*] --> Full : ImmutableBlock<G1> clones
}
```
### Evictor
Three `Use` outcomes are tracked for KV-event emission: `ActiveHit` (bump refcount on an already-pinned block), `InactiveHit` (reactivate via `match_blocks(plh)`), and `NewStore` (fresh allocation). Only `NewStore` emits a `Stored` KV event — the router radix tree already knows about the other two and only forgets on explicit `Removed`.
### Eviction Backends
The LRU evictor maintains blocks ordered by a monotonic counter, enabling O(log n) eviction of the lowest-priority block. Each `insert` assigns the next counter value, so blocks inserted later have higher counters and survive longer.
The kvbm-logical inactive pool selects eviction victims via one of three backends, exposed as `MockerEvictionBackend` in [`lib/mocker/src/common/protocols.rs`](../../lib/mocker/src/common/protocols.rs):
This produces a **depth-aware eviction policy**: when a sequence completes, `free_signal` releases its blocks in reverse order (tail first). Deeper suffix blocks therefore receive lower counters and are evicted before shallower prefix blocks. This keeps shared prefixes cached longer, improving cache hit rates across requests with common prefixes.
-**`Lineage`** (default) — parent-chain aware: evicts leaf blocks first, preserving shared prefix chains. Subsumes the preemption-priority behavior the old hand-rolled `LRUEvictor::push_front` used to provide.
-**`Lru`** — plain recency-based LRU.
-**`MultiLru`** — 4-tier frequency-aware LRU built on a TinyLFU tracker.
The evictor also supports front-insertion (negative counters) for marking blocks for immediate eviction, though this is not currently used in the scheduler.
All three give the same "suffix blocks evicted before shared prefixes" outcome that the old evictor was designed to produce; `Lineage` does it structurally (via the block parent chain) rather than via monotonic counters.
### Sequence Tracking
Each active request is tracked as a sequence, managing its token blocks and generation state. As tokens are generated, the sequence tracks which blocks are partial (still being filled) versus full (complete and hashable for prefix caching). When a partial block fills up, it gets "promoted" to a full block with a content-based hash, enabling future cache hits from requests with matching prefixes.
Each active request is tracked as a sequence, managing its token blocks and generation state. As tokens are generated, the sequence tracks which blocks are partial (`MutableBlock<G1>`, still being filled) versus full (`ImmutableBlock<G1>`, complete and hashable for prefix caching). When a partial block fills up, it gets "promoted" to a full block with a content-based `SequenceHash` (or collapses onto an existing registered handle if the PLH is already present), enabling future cache hits from requests with matching prefixes.
@@ -4,6 +4,8 @@ This directory contains the in-process offline replay harness used by `dynamo_mo
The goal is to simulate trace execution without spinning up async runtimes, network planes, or real worker tasks. Instead, the harness advances a logical clock, steps mock engine cores directly, and records request/token timing into `TraceCollector` in `lib/mocker/src/replay/collector.rs`.
For the harness-level picture (load driver → harness → SES/MES → trace collector) and operator-facing CLI docs, see [`docs/benchmarks/mocker-trace-replay.md`](../../../../../docs/benchmarks/mocker-trace-replay.md). This README dives into the offline-specific internals: logical clock, event queue, per-worker state machine.
## Where It Sits
The public replay entrypoints live one level up in `lib/mocker/src/replay/entrypoints.rs`. They:
...
...
@@ -34,9 +36,20 @@ Offline replay starts in `lib/mocker/src/replay/offline/mod.rs`.
-`lib/mocker/src/replay/offline/state.rs`
Per-worker wrapper around `EngineCore`, including optional KV event capture.
-`lib/mocker/src/replay/offline/events.rs`
Priority-queue event type used by the multi-worker harness.
`SimulationEvent` + `SimulationEventKind` priority-queue types used by the multi-worker harness.
-`lib/mocker/src/replay/offline/core.rs`
Small `ReplayWorkerCore` wrapper used by the single-worker path.
-`lib/mocker/src/replay/offline/runtime_utils.rs`
Shared helpers used by `agg.rs` and `disagg.rs`: `WorkerCompletionPayload`, event scheduling, `next_timestamp`.
-`lib/mocker/src/replay/offline/progress.rs`
`ReplayProgress`, the indicatif-based progress bar used by the harnesses.
@@ -117,21 +130,25 @@ So offline replay is not a toy simulator. It reuses the real per-pass mocker sch
## Completion Event Queue
The multi-worker and disagg harnesses use `SimulationEvent` from `lib/mocker/src/replay/offline/events.rs` as a min-time priority queue implemented with `BinaryHeap`.
Right now the only scheduled event type is:
The multi-worker and disagg harnesses use `SimulationEvent` from `lib/mocker/src/replay/offline/events.rs` as a min-time priority queue implemented with `BinaryHeap`. The event itself is a small struct carrying the scheduled timestamp, a sequence number for tie-breaking, and a typed payload:
-`WorkerCompletion`
```rust
pub(crate)structSimulationEvent{
pub(crate)at_ms:f64,
pub(crate)seq_no:u64,
pub(crate)kind:SimulationEventKind,
}
That event carries:
- worker `stage` (`aggregated`, `prefill`, or `decode`)
Those are emitted after a worker pass is executed and then applied later when the harness clock reaches `pass.end_ms`.
-`WorkerCompletion` is emitted after a worker pass is executed and applied when the harness clock reaches `pass.end_ms`. It carries the `stage` (`Aggregated`, `Prefill`, or `Decode`), `worker_idx`, `completed_requests`, `output_signals`, and router-visible `kv_events`.
-`DecodeHandoff` is used by the disaggregated harness to move a request from prefill to decode at the same logical timestamp (see below).
-`WorkerReady` marks the point at which a worker returns to the admission pool after a pass completes.
## Router Integration
...
...
@@ -140,7 +157,7 @@ Offline replay can run in:
-`round_robin`
-`kv_router`
The router implementation for offline mode lives in `lib/mocker/src/replay/router/offline.rs`.
The router implementation for offline mode lives in `lib/mocker/src/replay/offline/components/router.rs` (`OfflineReplayRouter`).