Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
dynamo
Commits
b2c59aa4
Unverified
Commit
b2c59aa4
authored
Mar 24, 2026
by
Yan Ru Pei
Committed by
GitHub
Mar 24, 2026
Browse files
feat(replay): add shared loadgen workload paths [DYN-2510] (#7593)
Signed-off-by:
PeaBrane
<
yanrpei@gmail.com
>
parent
2b36b175
Changes
36
Show whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
2818 additions
and
495 deletions
+2818
-495
docs/benchmarks/mocker-trace-replay.md
docs/benchmarks/mocker-trace-replay.md
+85
-16
docs/mocker/mocker.md
docs/mocker/mocker.md
+36
-2
lib/bench/Cargo.toml
lib/bench/Cargo.toml
+1
-1
lib/bench/kv_router/active_sequences_bench.rs
lib/bench/kv_router/active_sequences_bench.rs
+111
-167
lib/bench/kv_router/common/mod.rs
lib/bench/kv_router/common/mod.rs
+308
-131
lib/bench/kv_router/mooncake_bench.rs
lib/bench/kv_router/mooncake_bench.rs
+71
-80
lib/bench/src/bin/multiturn_bench.rs
lib/bench/src/bin/multiturn_bench.rs
+49
-14
lib/bindings/python/rust/llm/replay.rs
lib/bindings/python/rust/llm/replay.rs
+138
-1
lib/bindings/python/src/dynamo/_core.pyi
lib/bindings/python/src/dynamo/_core.pyi
+4
-0
lib/bindings/python/src/dynamo/replay/api.py
lib/bindings/python/src/dynamo/replay/api.py
+8
-0
lib/bindings/python/src/dynamo/replay/main.py
lib/bindings/python/src/dynamo/replay/main.py
+21
-2
lib/bindings/python/tests/test_replay.py
lib/bindings/python/tests/test_replay.py
+217
-0
lib/kv-router/src/sequences/multi_worker.rs
lib/kv-router/src/sequences/multi_worker.rs
+97
-35
lib/llm/benches/kv_router_bench.rs
lib/llm/benches/kv_router_bench.rs
+39
-46
lib/mocker/src/lib.rs
lib/mocker/src/lib.rs
+1
-0
lib/mocker/src/loadgen/driver.rs
lib/mocker/src/loadgen/driver.rs
+237
-0
lib/mocker/src/loadgen/mod.rs
lib/mocker/src/loadgen/mod.rs
+15
-0
lib/mocker/src/loadgen/tests.rs
lib/mocker/src/loadgen/tests.rs
+487
-0
lib/mocker/src/loadgen/trace.rs
lib/mocker/src/loadgen/trace.rs
+793
-0
lib/mocker/src/loadgen/types.rs
lib/mocker/src/loadgen/types.rs
+100
-0
No files found.
docs/benchmarks/mocker-trace-replay.md
View file @
b2c59aa4
...
...
@@ -7,8 +7,8 @@ subtitle: Replay Mooncake-style traces through the mocker in offline or online m
This guide covers trace replay support for Mooncake-style JSONL traces via
`python -m dynamo.replay`
,
which prints an AIPerf-style summary table, writes the full replay report JSON to disk, and exposes
`offline|online`
,
`round_robin|kv_router`
,
`arrival_speedup_ratio`
,
and synthetic replay inputs
directly.
`offline|online`
,
`round_robin|kv_router`
,
`arrival_speedup_ratio`
,
closed-loop concurrency, and
synthetic workload inputs
directly.
Unlike normal
`dynamo.mocker`
usage, offline replay does not launch workers, register endpoints, or
require NATS, etcd, or a frontend. Online replay does exercise the live mock-worker runtime path.
...
...
@@ -47,6 +47,24 @@ python -m dynamo.replay \
--report-json
/tmp/replay-report.json
```
Run synthetic workload replay when you want shared-prefix or multi-turn structure without a trace
file:
```
bash
python
-m
dynamo.replay
\
--input-tokens
5000
\
--output-tokens
500
\
--request-count
200
\
--turns-per-session
3
\
--shared-prefix-ratio
0.5
\
--num-prefix-groups
8
\
--inter-turn-delay-ms
250
\
--replay-mode
offline
\
--replay-concurrency
32
\
--extra-engine-args
'{"block_size":512}'
\
--report-json
/tmp/replay-report.json
```
`python -m dynamo.replay`
prints an AIPerf-style summary table to stdout and writes the full replay
report JSON to disk.
...
...
@@ -65,12 +83,29 @@ Example:
{
"timestamp"
:
0
,
"input_length"
:
6755
,
"output_length"
:
500
,
"hash_ids"
:
[
0
,
1
,
2
,
3
]}
```
The mocker synthesizes token blocks from
`hash_ids`
using the configured
`--block-size`
, so the
Replay also supports multi-turn sessions. Use the same
`session_id`
on all turns in a session. The
first turn uses
`timestamp`
or
`created_time`
; later turns may use either:
-
`delay`
or
`delay_ms`
directly
-
or an absolute later
`timestamp`
, in which case replay infers the inter-turn delay from the
previous turn timestamp
Example:
```
json
{
"session_id"
:
"session-a"
,
"timestamp"
:
1000
,
"input_length"
:
2048
,
"output_length"
:
128
,
"hash_ids"
:[
1
,
2
,
3
,
4
]}
{
"session_id"
:
"session-a"
,
"delay"
:
250
,
"input_length"
:
2560
,
"output_length"
:
128
,
"hash_ids"
:[
1
,
2
,
3
,
4
,
5
]}
{
"session_id"
:
"session-b"
,
"timestamp"
:
1010
,
"input_length"
:
1024
,
"output_length"
:
64
,
"hash_ids"
:[
9
,
10
]}
{
"session_id"
:
"session-b"
,
"delay_ms"
:
50
,
"input_length"
:
1536
,
"output_length"
:
64
,
"hash_ids"
:[
9
,
10
,
11
]}
```
The mocker synthesizes token blocks from
`hash_ids`
using the configured mocker
`block_size`
, so the
replay block size must match the block size used when the trace was generated. Public Mooncake
traces are commonly block-level hashes at
`512`
tokens per hash ID, so replaying them with the
default mocker
`block_size=64`
will fail once
`input_length > len(hash_ids) * 64`
. For
`engine_type=sglang`
, replay still uses canonical
`block_size`
internally;
`sglang.page_size`
is
accepted as a compatibility alias and is normalized into
`block_size`
before replay starts.
default mocker
`block_size=64`
will fail once
`input_length > len(hash_ids) * 64`
. Set that
through
`--extra-engine-args '{"block_size":512}'`
. For
`engine_type=sglang`
, replay still uses
canonical
`block_size`
internally;
`sglang.page_size`
is accepted as a compatibility alias and is
normalized into
`block_size`
before replay starts.
## Replay Surfaces
...
...
@@ -85,10 +120,19 @@ The dedicated replay CLI exposes:
-
`--replay-concurrency`
-
`--arrival-interval-ms`
-
`--arrival-speedup-ratio`
-
`--turns-per-session`
-
`--shared-prefix-ratio`
-
`--num-prefix-groups`
-
`--inter-turn-delay-ms`
-
`--extra-engine-args`
(JSON string)
-
`--router-config`
(JSON string)
-
`--report-json`
Defaults:
-
`--replay-mode offline`
-
`--router-mode round_robin`
Example:
```
bash
...
...
@@ -115,9 +159,10 @@ SGLang replay uses the same CLI surface. A minimal extra-engine-args file can us
}
```
Both
`--extra-engine-args`
and
`--router-config`
accept partial JSON objects. Unspecified fields
fall back to the same defaults used by
`MockEngineArgs::default()`
and
`KvRouterConfig::default()`
.
Both
`--extra-engine-args`
and
`--router-config`
accept partial JSON objects. Engine settings such
as
`block_size`
,
`engine_type`
,
`dp_size`
,
`speedup_ratio`
, and
`decode_speedup_ratio`
belong in
`--extra-engine-args`
, not as top-level replay CLI flags. Unspecified fields fall back to the same
defaults used by
`MockEngineArgs::default()`
and
`KvRouterConfig::default()`
.
### Synthetic Replay
...
...
@@ -137,6 +182,19 @@ python -m dynamo.replay \
This is useful for parameter sweeps where Mooncake-style prefix structure is not required.
When
`--turns-per-session > 1`
,
`--request-count`
is interpreted as the number of sessions rather
than the total number of emitted turns. The total completed request count becomes:
-
`request_count * turns_per_session`
Synthetic workload options:
-
`--turns-per-session`
: number of turns in each synthetic session
-
`--shared-prefix-ratio`
: fraction of prompt blocks shared inside a prefix group
-
`--num-prefix-groups`
: number of shared-prefix groups;
`0`
disables grouping
-
`--inter-turn-delay-ms`
: constant delay applied after each completed turn before the next turn in
the same session becomes eligible
## Modes
### Fixed-Schedule Replay
...
...
@@ -155,8 +213,8 @@ This is the right mode when you want deterministic replay of the original arriva
### Closed-Loop Concurrency Replay
Use
`--replay-concurrency`
to ignore trace arrival timing and keep a fixed number of
requests in
flight:
Use
`--replay-concurrency`
to ignore
first-turn
trace arrival timing and keep a fixed number of
requests in
flight:
```
bash
python
-m
dynamo.replay /path/to/mooncake_trace.jsonl
\
...
...
@@ -167,6 +225,13 @@ python -m dynamo.replay /path/to/mooncake_trace.jsonl \
This mode is useful when you want to compare scheduler behavior under a fixed offered concurrency rather than the original trace schedule.
For multi-turn sessions, concurrency mode still enforces session order and inter-turn delays:
-
first-turn timestamps are ignored
-
turn
`n+1`
is not eligible until turn
`n`
completes
-
`delay`
/
`delay_ms`
/ synthetic
`--inter-turn-delay-ms`
are still applied after completion
-
TTFT is measured from actual dispatch under the cap, not from the ignored trace timestamp
### Online Replay
Online replay launches the mock workers and replays the trace against the live runtime path. This
...
...
@@ -256,14 +321,15 @@ If `--report-json` is not provided, `python -m dynamo.replay` writes a timestamp
Shared replay constraints:
-
aggregated mode
-
`
--
engine
-
type
vllm|
sglang`
-
`
--data-parallel-size
1`
-
`
extra_engine_args.
engine
_
type
`
must be
`vllm`
or
`
sglang`
-
`
extra_engine_args.dp_size`
must be
`
1`
Additional offline constraints:
-
offline
`kv_router`
requires
`num_workers > 1`
-
public single-worker offline replay still uses the legacy single-worker runtime for
`vllm`
while
`sglang`
goes through the shared multi-worker replay runtime even when
`num_workers=1`
-
single-worker offline replay is still a dedicated fast path for
`vllm`
, but it now supports both
flat request replay and workload-driven multi-turn replay
-
`sglang`
still goes through the shared multi-worker replay runtime even when
`num_workers=1`
Additional online constraints:
...
...
@@ -276,9 +342,12 @@ If you violate those constraints, replay fails immediately with a validation err
-
`python -m dynamo.replay`
requires exactly one of:
either a trace file, or all of
`--input-tokens`
,
`--output-tokens`
, and
`--request-count`
-
`--replay-concurrency`
works with both trace replay and synthetic replay
-
`--speedup-ratio`
still affects simulated timing
-
mocker compute-speed knobs such as
`speedup_ratio`
still affect simulated timing when passed via
`--extra-engine-args`
-
`--arrival-speedup-ratio`
affects trace timestamps, not worker compute speed
-
`--arrival-interval-ms`
only applies to synthetic replay
-
`--turns-per-session`
,
`--shared-prefix-ratio`
,
`--num-prefix-groups`
, and
`--inter-turn-delay-ms`
only apply to synthetic replay
-
`--extra-engine-args`
and
`--router-config`
are JSON strings on the standalone replay CLI
-
offline replay does not need planner runtime setup, router registration, or external event transport
-
the replay block size should match the trace block size, because token synthesis expands
`hash_ids`
...
...
docs/mocker/mocker.md
View file @
b2c59aa4
...
...
@@ -125,8 +125,11 @@ python -m dynamo.mocker \
## Trace Replay
The mocker supports replaying Mooncake-style traces through the dedicated replay CLI, which exposes
`offline|online`
,
`round_robin|kv_router`
,
`arrival_speedup_ratio`
, and the synthetic replay path
directly:
`offline|online`
,
`round_robin|kv_router`
,
`arrival_speedup_ratio`
, closed-loop concurrency
admission, and synthetic workload generation directly:
The replay CLI defaults to
`--replay-mode offline`
and
`--router-mode round_robin`
. Engine settings
such as
`block_size`
,
`engine_type`
, and compute speedups still belong in
`--extra-engine-args`
.
```
bash
python
-m
dynamo.replay /path/to/mooncake_trace.jsonl
\
...
...
@@ -154,9 +157,40 @@ python -m dynamo.replay \
--report-json
/tmp/replay-report.json
```
Synthetic replay also supports workload-style generation for shared-prefix and multi-turn tests:
```
bash
python
-m
dynamo.replay
\
--input-tokens
5000
\
--output-tokens
500
\
--request-count
200
\
--turns-per-session
3
\
--shared-prefix-ratio
0.5
\
--num-prefix-groups
8
\
--inter-turn-delay-ms
250
\
--replay-mode
offline
\
--replay-concurrency
32
\
--extra-engine-args
'{"block_size":512}'
\
--report-json
/tmp/replay-report.json
```
For trace files, replay also understands multi-turn sessions when records share
`session_id`
. The
first turn uses
`timestamp`
/
`created_time`
; later turns can use
`delay`
or
`delay_ms`
:
```
json
{
"session_id"
:
"session-a"
,
"timestamp"
:
1000
,
"input_length"
:
2048
,
"output_length"
:
128
,
"hash_ids"
:[
1
,
2
,
3
,
4
]}
{
"session_id"
:
"session-a"
,
"delay"
:
250
,
"input_length"
:
2560
,
"output_length"
:
128
,
"hash_ids"
:[
1
,
2
,
3
,
4
,
5
]}
```
The standalone replay CLI prints an AIPerf-style summary table to stdout and writes the full replay
report JSON to disk.
Timing semantics:
-
trace mode honors first-turn timestamps and inter-turn delays
-
concurrency mode ignores first-turn timestamps but still enforces inter-turn delays
-
in concurrency mode, TTFT is measured from actual dispatch under the in-flight cap
For full usage, constraints, and benchmarking guidance, see
[
Mocker Trace Replay
](
../benchmarks/mocker-trace-replay.md
)
.
Replay supports aggregated
`vllm`
and
`sglang`
engine configs. Internally replay uses canonical
...
...
lib/bench/Cargo.toml
View file @
b2c59aa4
...
...
@@ -40,11 +40,11 @@ reqwest = { workspace = true }
serde
=
{
workspace
=
true
}
serde_json
=
{
workspace
=
true
}
tokio
=
{
workspace
=
true
}
dynamo-mocker
=
{
workspace
=
true
}
[dev-dependencies]
async-trait
=
{
workspace
=
true
}
dynamo-kv-router
=
{
workspace
=
true
,
features
=
["bench"]
}
dynamo-mocker
=
{
workspace
=
true
}
dynamo-tokens
=
{
workspace
=
true
}
minstant
=
"0.1.7"
plotters
=
{
version
=
"0.3"
,
default-features
=
false
,
features
=
[
"svg_backend"
,
"line_series"
,
"point_series"
,
"full_palette"
]
}
...
...
lib/bench/kv_router/active_sequences_bench.rs
View file @
b2c59aa4
...
...
@@ -9,16 +9,11 @@ use clap::Parser;
use
common
::
NoopSequencePublisher
;
use
dynamo_kv_router
::
protocols
::
WorkerWithDpRank
;
use
dynamo_kv_router
::{
ActiveSequencesMultiWorker
,
OverlapScores
,
SequenceRequest
};
use
dynamo_mocker
::
common
::
protocols
::{
DirectRequest
,
KvEventPublishers
,
OutputSignal
};
use
dynamo_mocker
::
scheduler
::
Scheduler
;
use
dynamo_mocker
::
scheduler
::
SchedulerHandle
;
use
dynamo_mocker
::
loadgen
::
Trace
;
use
dynamo_tokens
::
SequenceHash
;
use
std
::
collections
::
HashMap
;
use
std
::
sync
::
Arc
;
use
tokio
::
sync
::
mpsc
;
use
tokio
::
task
::
JoinHandle
;
use
tokio
::
time
::{
Duration
,
Instant
};
use
uuid
::
Uuid
;
#[derive(Parser,
Debug)]
#[clap(
...
...
@@ -76,69 +71,46 @@ struct SequenceTrace {
/// completed=true → Free
/// 4. Collect timestamps for later replay
async
fn
generate_sequence_events
(
traces
:
&
[
Vec
<
MooncakeRequest
>
],
traces
:
&
[
Trace
],
num_gpu_blocks
:
usize
,
block_size
:
u32
,
trace_simulation_duration_ms
:
u64
,
)
->
anyhow
::
Result
<
Vec
<
Vec
<
SequenceTrace
>>>
{
println!
(
"Generating sequence events..."
);
let
sched_args
=
default_mock_engine_args
(
num_gpu_blocks
,
block_size
as
usize
)
?
;
let
scaled_traces
:
Vec
<
_
>
=
traces
.iter
()
.map
(|
worker_trace
|
scale_mooncake_trace
(
worker_trace
,
trace_simulation_duration_ms
))
.collect
();
let
progress
=
make_progress_bar
(
Some
(
traces
.iter
()
.map
(|
w
|
w
.len
()
as
u64
)
.sum
::
<
u64
>
()));
let
mut
tasks
:
Vec
<
JoinHandle
<
anyhow
::
Result
<
Vec
<
SequenceTrace
>>>>
=
Vec
::
new
();
for
worker_trace
in
scaled_traces
{
let
sched_args
=
sched_args
.clone
();
let
progress
=
progress
.clone
();
tasks
.push
(
tokio
::
spawn
(
async
move
{
let
(
output_tx
,
mut
output_rx
)
=
mpsc
::
unbounded_channel
::
<
OutputSignal
>
();
// No KvCacheEventSink — we only need output signals
let
scheduler
=
Scheduler
::
new
(
sched_args
,
0
,
Some
(
output_tx
),
KvEventPublishers
::
default
(),
None
,
);
let
artifacts
=
generate_replay_artifacts
(
traces
,
num_gpu_blocks
,
block_size
,
trace_simulation_duration_ms
,
)
.await
?
;
let
mut
all_traces
=
Vec
::
with_capacity
(
artifacts
.len
());
// Pre-compute metadata for each request before submission
let
mut
metadata
:
HashMap
<
Uuid
,
RequestMetadata
>
=
HashMap
::
new
();
for
req
in
&
worker_trace
{
let
block_hashes
:
Vec
<
SequenceHash
>
=
req
.hash_ids
for
artifact
in
artifacts
{
let
metadata
=
artifact
.requests
.iter
()
.map
(|
&
id
|
local_block_hash_from_id
(
id
,
block_size
)
.0
)
.collect
();
let
isl
=
req
.hash_ids
.len
()
*
block_size
as
usize
;
metadata
.insert
(
req
.uuid
,
.map
(|
request
|
{
(
request
.uuid
,
RequestMetadata
{
block_hashes
,
isl
,
output_length
:
req
.output_length
,
block_hashes
:
request
.replay_hashes.sequence_hashes
.clone
()
,
isl
:
request
.input_length
,
output_length
:
req
uest
.output_length
as
u64
,
},
);
}
)
})
.collect
::
<
HashMap
<
_
,
_
>>
();
// Spawn drain task that converts OutputSignals → SequenceTrace entries
let
drain_handle
:
JoinHandle
<
Vec
<
SequenceTrace
>>
=
tokio
::
spawn
(
async
move
{
let
mut
entries
=
Vec
::
new
();
let
mut
seen
:
HashMap
<
Uuid
,
bool
>
=
HashMap
::
new
();
let
mut
seen
=
HashMap
::
new
();
while
let
Some
(
signal
)
=
output_rx
.recv
()
.await
{
for
timed_signal
in
artifact
.output_signals
{
let
signal
=
timed_signal
.signal
;
let
request_id
=
signal
.uuid
.to_string
();
if
let
std
::
collections
::
hash_map
::
Entry
::
Vacant
(
e
)
=
seen
.entry
(
signal
.uuid
)
{
e
.insert
(
false
);
if
let
std
::
collections
::
hash_map
::
Entry
::
Vacant
(
entry
)
=
seen
.entry
(
signal
.uuid
)
{
entry
.insert
(());
if
let
Some
(
meta
)
=
metadata
.get
(
&
signal
.uuid
)
{
entries
.push
(
SequenceTrace
{
entry
:
SequenceTraceEntry
::
Add
{
...
...
@@ -147,92 +119,26 @@ async fn generate_sequence_events(
isl
:
meta
.isl
,
output_length
:
meta
.output_length
,
},
timestamp_us
:
0
,
// rescaled later
timestamp_us
:
timed_signal
.timestamp_us
,
});
entries
.push
(
SequenceTrace
{
entry
:
SequenceTraceEntry
::
PrefillComplete
{
request_id
:
request_id
.clone
(),
},
timestamp_us
:
0
,
timestamp_us
:
timed_signal
.timestamp_us
,
});
}
}
if
signal
.completed
{
seen
.insert
(
signal
.uuid
,
true
);
entries
.push
(
SequenceTrace
{
entry
:
SequenceTraceEntry
::
Free
{
request_id
},
timestamp_us
:
0
,
});
}
}
entries
});
// Submit requests at scaled timing
let
mut
i
=
0
;
let
mut
target
=
Instant
::
now
();
let
start
=
target
;
while
i
<
worker_trace
.len
()
{
let
prev_i
=
i
;
scheduler
.receive
(
DirectRequest
{
tokens
:
tokens_from_request
(
&
worker_trace
[
i
],
block_size
),
max_output_tokens
:
worker_trace
[
i
]
.output_length
as
usize
,
uuid
:
Some
(
worker_trace
[
i
]
.uuid
),
dp_rank
:
0
,
arrival_timestamp_ms
:
None
,
});
i
+=
1
;
while
i
<
worker_trace
.len
()
&&
worker_trace
[
i
]
.timestamp
==
worker_trace
[
i
-
1
]
.timestamp
{
scheduler
.receive
(
DirectRequest
{
tokens
:
tokens_from_request
(
&
worker_trace
[
i
],
block_size
),
max_output_tokens
:
worker_trace
[
i
]
.output_length
as
usize
,
uuid
:
Some
(
worker_trace
[
i
]
.uuid
),
dp_rank
:
0
,
arrival_timestamp_ms
:
None
,
timestamp_us
:
timed_signal
.timestamp_us
,
});
i
+=
1
;
}
if
i
<
worker_trace
.len
()
{
target
+=
Duration
::
from_millis
(
worker_trace
[
i
]
.timestamp
-
worker_trace
[
i
-
1
]
.timestamp
,
);
}
tokio
::
time
::
sleep_until
(
target
)
.await
;
progress
.inc
((
i
-
prev_i
)
as
u64
);
}
// Drop scheduler → CancelGuard fires → background task exits →
// output_tx dropped → drain task sees None
drop
(
scheduler
);
let
mut
entries
=
drain_handle
.await
?
;
// Assign monotonically increasing timestamps based on entry order
let
total_us
=
(
Instant
::
now
()
-
start
)
.as_micros
()
as
u64
;
let
num_entries
=
entries
.len
()
as
u64
;
for
(
idx
,
entry
)
in
entries
.iter_mut
()
.enumerate
()
{
entry
.timestamp_us
=
if
num_entries
>
1
{
idx
as
u64
*
total_us
/
(
num_entries
-
1
)
}
else
{
0
};
}
Ok
(
entries
)
}));
}
let
mut
all_traces
=
Vec
::
new
();
for
task
in
tasks
{
all_traces
.push
(
task
.await
??
);
all_traces
.push
(
entries
);
}
let
total_adds
=
all_traces
...
...
@@ -503,30 +409,44 @@ async fn run_tests() -> anyhow::Result<()> {
));
{
let
mut
f
=
File
::
create
(
&
path
)
?
;
for
(
i
,
(
hash_ids
,
output_length
))
in
[(
&
[
0u64
,
1
,
2
]
as
&
[
u64
],
10u64
),
(
&
[
0
,
1
,
3
,
4
],
10
)]
.iter
()
.enumerate
()
{
writeln!
(
f
,
"{}"
,
serde_json
::
json!
({
"timestamp"
:
i
as
u64
,
"hash_ids"
:
hash_ids
,
"output_length"
:
output_length
,
"session_id"
:
"session-a"
,
"timestamp"
:
0
,
"input_length"
:
4
,
"hash_ids"
:
[
0u64
,
1
,
2
,
3
],
"output_length"
:
10u64
,
})
)
?
;
writeln!
(
f
,
"{}"
,
serde_json
::
json!
({
"session_id"
:
"session-a"
,
"delay"
:
5.0
,
"input_length"
:
4
,
"hash_ids"
:
[
4u64
,
5
,
6
,
7
],
"output_length"
:
10u64
,
})
)
?
;
}
}
let
traces
=
process_mooncake_trace
(
path
.to_str
()
.unwrap
(),
1
,
1
,
2
,
42
)
?
;
let
traces
=
process_mooncake_trace
(
path
.to_str
()
.unwrap
(),
512
,
1
,
1
,
1
,
42
)
?
;
std
::
fs
::
remove_file
(
&
path
)
.ok
();
println!
(
"Loaded {} workers, {} total requests"
,
traces
.len
(),
traces
.iter
()
.map
(|
t
|
t
.len
())
.sum
::
<
usize
>
()
traces
.iter
()
.map
(|
trace
|
trace
.sessions
.iter
()
.map
(|
session
|
session
.turns
.len
())
.sum
::
<
usize
>
())
.sum
::
<
usize
>
()
);
let
seq_traces
=
generate_sequence_events
(
&
traces
,
1048576
,
512
,
100
)
.await
?
;
...
...
@@ -545,6 +465,29 @@ async fn run_tests() -> anyhow::Result<()> {
assert
!
(
total_adds
>
0
,
"expected at least one Add event"
);
assert
!
(
total_frees
>
0
,
"expected at least one Free event"
);
assert_eq!
(
total_adds
,
total_frees
,
"adds and frees should match"
);
for
trace
in
&
seq_traces
{
assert
!
(
trace
.windows
(
2
)
.all
(|
window
|
window
[
1
]
.timestamp_us
>=
window
[
0
]
.timestamp_us
)
);
}
let
first_free_us
=
seq_traces
[
0
]
.iter
()
.find_map
(|
entry
|
match
entry
.entry
{
SequenceTraceEntry
::
Free
{
..
}
=>
Some
(
entry
.timestamp_us
),
_
=>
None
,
})
.unwrap
();
let
second_add_us
=
seq_traces
[
0
]
.iter
()
.filter_map
(|
entry
|
match
entry
.entry
{
SequenceTraceEntry
::
Add
{
..
}
=>
Some
(
entry
.timestamp_us
),
_
=>
None
,
})
.nth
(
1
)
.unwrap
();
assert
!
(
second_add_us
>=
first_free_us
);
println!
(
"All tests passed."
);
Ok
(())
...
...
@@ -567,6 +510,7 @@ async fn main() -> anyhow::Result<()> {
};
let
traces
=
process_mooncake_trace
(
path
,
args
.common.block_size
,
args
.common.trace_length_factor
,
args
.common.trace_duplication_factor
,
args
.common.num_unique_inference_workers
,
...
...
lib/bench/kv_router/common/mod.rs
View file @
b2c59aa4
...
...
@@ -12,7 +12,11 @@ use dynamo_kv_router::protocols::{
};
pub
use
dynamo_kv_router
::
test_utils
::{
NoopSequencePublisher
,
SimpleWorkerConfig
};
use
dynamo_mocker
::
common
::
protocols
::{
DirectRequest
,
KvCacheEventSink
,
KvEventPublishers
,
MockEngineArgs
,
DirectRequest
,
KvCacheEventSink
,
KvEventPublishers
,
MockEngineArgs
,
OutputSignal
,
};
use
dynamo_mocker
::
loadgen
::{
ArrivalSpec
,
DelaySpec
,
LengthSpec
,
ReplayRequestHashes
,
RouterSequence
,
SequenceHashMode
,
SessionPartitionSpec
,
SyntheticTraceSpec
,
Trace
,
};
use
dynamo_mocker
::
scheduler
::
Scheduler
;
use
dynamo_mocker
::
scheduler
::
SchedulerHandle
;
...
...
@@ -24,6 +28,7 @@ use serde::{Deserialize, Serialize};
use
std
::
fs
::
File
;
use
std
::
io
::{
BufRead
,
BufReader
};
use
std
::
sync
::{
Arc
,
Mutex
};
use
tokio
::
sync
::
mpsc
;
use
tokio
::
task
::
JoinHandle
;
use
tokio
::
time
::
Instant
;
use
uuid
::
Uuid
;
...
...
@@ -101,6 +106,8 @@ pub struct MooncakeRequest {
#[serde(default
=
"Uuid::new_v4"
)]
pub
uuid
:
uuid
::
Uuid
,
pub
timestamp
:
u64
,
#[serde(default)]
pub
input_length
:
usize
,
pub
hash_ids
:
Vec
<
u64
>
,
pub
output_length
:
u64
,
}
...
...
@@ -133,6 +140,35 @@ impl KvCacheEventSink for EventCollector {
}
}
#[derive(Clone)]
pub
struct
TimedReplayRequest
{
pub
uuid
:
Uuid
,
pub
timestamp_us
:
u64
,
pub
scheduled_ready_at_ms
:
f64
,
pub
input_length
:
usize
,
pub
output_length
:
usize
,
pub
replay_hashes
:
ReplayRequestHashes
,
}
#[derive(Clone)]
pub
struct
TimedOutputSignal
{
pub
signal
:
OutputSignal
,
pub
timestamp_us
:
u64
,
}
#[derive(Clone)]
pub
struct
TimedKvEvent
{
pub
event
:
KvCacheEvent
,
pub
timestamp_us
:
u64
,
}
#[derive(Clone)]
pub
struct
WorkerReplayArtifacts
{
pub
requests
:
Vec
<
TimedReplayRequest
>
,
pub
output_signals
:
Vec
<
TimedOutputSignal
>
,
pub
kv_events
:
Vec
<
TimedKvEvent
>
,
}
/// Load the mooncake trace from disk into a flat list of requests.
pub
fn
load_mooncake_trace
(
path
:
&
str
)
->
anyhow
::
Result
<
Vec
<
MooncakeRequest
>>
{
let
file
=
File
::
open
(
path
)
?
;
...
...
@@ -257,11 +293,15 @@ pub fn duplicate_traces(requests: Vec<MooncakeRequest>, factor: usize) -> Vec<Mo
/// Expand a request's block-level hash_ids into per-token IDs by repeating each
/// hash_id `block_size` times.
pub
fn
tokens_from_request
(
request
:
&
MooncakeRequest
,
block_size
:
u32
)
->
Vec
<
u32
>
{
request
let
mut
tokens
=
request
.hash_ids
.iter
()
.flat_map
(|
id
|
(
0
..
block_size
)
.map
(|
_
|
*
id
as
u32
))
.collect
()
.collect
::
<
Vec
<
_
>>
();
if
request
.input_length
>
0
&&
request
.input_length
<
tokens
.len
()
{
tokens
.truncate
(
request
.input_length
);
}
tokens
}
/// Compute the LocalBlockHash for a block-level hash_id the same way the mock
...
...
@@ -304,15 +344,19 @@ pub struct BenchmarkResults {
/// Load, transform, and partition the mooncake trace into per-worker request lists.
pub
fn
process_mooncake_trace
(
path
:
&
str
,
block_size
:
u32
,
trace_length_factor
:
usize
,
trace_duplication_factor
:
usize
,
num_workers
:
usize
,
seed
:
u64
,
)
->
anyhow
::
Result
<
Vec
<
Vec
<
MooncakeRequest
>>>
{
let
requests
=
load_mooncake_trace
(
path
)
?
;
let
requests
=
expand_trace_lengths
(
requests
,
trace_length_factor
);
let
requests
=
duplicate_traces
(
requests
,
trace_duplication_factor
);
Ok
(
partition_trace
(
requests
,
num_workers
,
seed
))
)
->
anyhow
::
Result
<
Vec
<
Trace
>>
{
let
trace
=
Trace
::
from_mooncake
(
std
::
path
::
Path
::
new
(
path
),
block_size
as
usize
)
?
.expand_hash_prefix_depth
(
trace_length_factor
)
.duplicate_hash_space
(
trace_duplication_factor
);
Ok
(
trace
.partition_by_session
(
SessionPartitionSpec
::
Random
{
num_partitions
:
num_workers
,
seed
,
}))
}
/// Build default MockEngineArgs suitable for event generation.
...
...
@@ -330,98 +374,155 @@ pub fn default_mock_engine_args(
.build
()
?
)
}
/// Replay each worker's request trace through a mock engine in real-time to
/// produce the KV cache events (store/remove/clear) that the engine would emit.
///
/// Returns one event list per worker, each entry paired with the wall-clock
/// instant it was produced.
pub
async
fn
generate_kv_events
(
traces
:
&
[
Vec
<
MooncakeRequest
>
],
num_gpu_blocks
:
usize
,
block_size
:
u32
,
async
fn
replay_worker_trace
(
trace
:
Trace
,
sched_args
:
MockEngineArgs
,
trace_simulation_duration_ms
:
u64
,
)
->
anyhow
::
Result
<
Vec
<
Vec
<
(
KvCacheEvent
,
Instant
)
>>>
{
println!
(
"Generating events..."
);
let
sched_args
=
default_mock_engine_args
(
num_gpu_blocks
,
block_size
as
usize
)
?
;
let
scaled_traces
=
traces
progress
:
ProgressBar
,
)
->
anyhow
::
Result
<
WorkerReplayArtifacts
>
{
let
total_turns
=
trace
.sessions
.iter
()
.map
(|
worker_trace
|
scale_mooncake_trace
(
worker_trace
,
trace_simulation_duration_ms
));
let
progress
=
make_progress_bar
(
Some
(
traces
.iter
()
.map
(|
worker
|
worker
.len
()
as
u64
)
.sum
::
<
u64
>
(),
));
let
mut
tasks
:
Vec
<
JoinHandle
<
Vec
<
(
KvCacheEvent
,
Instant
)
>>>
=
Vec
::
new
();
for
worker_trace
in
scaled_traces
{
let
sched_args
=
sched_args
.clone
();
let
progress
=
progress
.clone
();
tasks
.push
(
tokio
::
spawn
(
async
move
{
.map
(|
session
|
session
.turns
.len
())
.sum
::
<
usize
>
();
let
mut
driver
=
trace
.rescale_ready_span
(
trace_simulation_duration_ms
)
?
.into_trace_driver
()
?
;
let
collector
=
EventCollector
::
new
();
let
(
output_tx
,
mut
output_rx
)
=
mpsc
::
unbounded_channel
::
<
OutputSignal
>
();
let
scheduler
=
Scheduler
::
new
(
sched_args
,
0
,
None
,
Some
(
output_tx
)
,
KvEventPublishers
::
new
(
Some
(
collector
.clone
()),
None
),
None
,
);
let
start
=
Instant
::
now
();
let
mut
requests
=
Vec
::
with_capacity
(
total_turns
);
let
mut
output_signals
=
Vec
::
new
();
let
mut
completed_turns
=
0u
size
;
while
completed_turns
<
total_turns
{
let
now_ms
=
start
.elapsed
()
.as_secs_f64
()
*
1000.0
;
for
ready_turn
in
driver
.pop_ready
(
now_ms
,
usize
::
MAX
)
{
let
replay_hashes
=
ready_turn
.replay_hashes
.ok_or_else
(||
{
anyhow
::
anyhow!
(
"bench replay requires synthesized request hashes"
)
})
?
;
requests
.push
(
TimedReplayRequest
{
uuid
:
ready_turn
.request_uuid
,
timestamp_us
:
start
.elapsed
()
.as_micros
()
as
u64
,
scheduled_ready_at_ms
:
ready_turn
.scheduled_ready_at_ms
,
input_length
:
ready_turn
.request.tokens
.len
(),
output_length
:
ready_turn
.request.max_output_tokens
,
replay_hashes
,
});
scheduler
.receive
(
ready_turn
.request
);
progress
.inc
(
1
);
}
let
mut
i
=
0
;
let
mut
target
=
Instant
::
now
();
if
completed_turns
>=
total_turns
{
break
;
}
while
i
<
worker_trace
.len
()
{
let
prev_i
=
i
;
scheduler
.receive
(
DirectRequest
{
tokens
:
tokens_from_request
(
&
worker_trace
[
i
],
block_size
),
max_output_tokens
:
worker_trace
[
i
]
.output_length
as
usize
,
uuid
:
Some
(
worker_trace
[
i
]
.uuid
),
dp_rank
:
0
,
arrival_timestamp_ms
:
None
,
match
driver
.next_ready_time_ms
()
{
Some
(
next_ready_ms
)
=>
{
let
deadline
=
start
+
Duration
::
from_secs_f64
((
next_ready_ms
.max
(
0.0
))
/
1000.0
);
tokio
::
select!
{
maybe_signal
=
output_rx
.recv
()
=>
{
let
Some
(
signal
)
=
maybe_signal
else
{
anyhow
::
bail!
(
"scheduler ended before workload replay drained"
);
};
output_signals
.push
(
TimedOutputSignal
{
signal
:
signal
.clone
(),
timestamp_us
:
start
.elapsed
()
.as_micros
()
as
u64
,
});
i
+=
1
;
while
i
<
worker_trace
.len
()
&&
worker_trace
[
i
]
.timestamp
==
worker_trace
[
i
-
1
]
.timestamp
{
scheduler
.receive
(
DirectRequest
{
tokens
:
tokens_from_request
(
&
worker_trace
[
i
],
block_size
),
max_output_tokens
:
worker_trace
[
i
]
.output_length
as
usize
,
uuid
:
Some
(
worker_trace
[
i
]
.uuid
),
dp_rank
:
0
,
arrival_timestamp_ms
:
None
,
if
signal
.completed
{
completed_turns
+=
1
;
driver
.on_complete
(
signal
.uuid
,
start
.elapsed
()
.as_secs_f64
()
*
1000.0
)
?
;
}
}
_
=
tokio
::
time
::
sleep_until
(
deadline
)
=>
{}
}
}
None
=>
{
let
Some
(
signal
)
=
output_rx
.recv
()
.await
else
{
anyhow
::
bail!
(
"scheduler ended before workload replay drained"
);
};
output_signals
.push
(
TimedOutputSignal
{
signal
:
signal
.clone
(),
timestamp_us
:
start
.elapsed
()
.as_micros
()
as
u64
,
});
i
+=
1
;
if
signal
.completed
{
completed_turns
+=
1
;
driver
.on_complete
(
signal
.uuid
,
start
.elapsed
()
.as_secs_f64
()
*
1000.0
)
?
;
}
if
i
<
worker_trace
.len
()
{
target
+=
Duration
::
from_millis
(
worker_trace
[
i
]
.timestamp
-
worker_trace
[
i
-
1
]
.timestamp
,
);
}
tokio
::
time
::
sleep_until
(
target
)
.await
;
progress
.inc
((
i
-
prev_i
)
as
u64
);
}
}
drop
(
scheduler
);
Ok
(
WorkerReplayArtifacts
{
requests
,
output_signals
,
kv_events
:
collector
.get_events
()
.into_iter
()
.map
(|(
event
,
timestamp
)|
TimedKvEvent
{
event
,
timestamp_us
:
timestamp
.saturating_duration_since
(
start
)
.as_micros
()
as
u64
,
})
.collect
(),
})
}
pub
async
fn
generate_replay_artifacts
(
traces
:
&
[
Trace
],
num_gpu_blocks
:
usize
,
block_size
:
u32
,
trace_simulation_duration_ms
:
u64
,
)
->
anyhow
::
Result
<
Vec
<
WorkerReplayArtifacts
>>
{
println!
(
"Generating events..."
);
let
sched_args
=
default_mock_engine_args
(
num_gpu_blocks
,
block_size
as
usize
)
?
;
let
progress
=
make_progress_bar
(
Some
(
traces
.iter
()
.map
(|
trace
|
{
trace
.sessions
.iter
()
.map
(|
session
|
session
.turns
.len
()
as
u64
)
.sum
::
<
u64
>
()
})
.sum
::
<
u64
>
(),
));
collector
.get_events
()
let
mut
tasks
:
Vec
<
JoinHandle
<
anyhow
::
Result
<
WorkerReplayArtifacts
>>>
=
Vec
::
new
();
for
trace
in
traces
.iter
()
.cloned
()
{
let
sched_args
=
sched_args
.clone
();
let
progress
=
progress
.clone
();
tasks
.push
(
tokio
::
spawn
(
async
move
{
replay_worker_trace
(
trace
,
sched_args
,
trace_simulation_duration_ms
,
progress
)
.await
}));
}
let
mut
even
ts
=
Vec
::
new
();
let
mut
artifac
ts
=
Vec
::
new
();
for
task
in
tasks
{
even
ts
.push
(
task
.await
?
);
artifac
ts
.push
(
task
.await
?
?
);
}
for
worker_events
in
&
events
{
for
worker_events
in
artifacts
.iter
()
.map
(|
artifact
|
&
artifact
.kv_
events
)
{
for
i
in
1
..
worker_events
.len
()
{
assert
!
(
worker_events
[
i
]
.
1
>=
worker_events
[
i
-
1
]
.
1
);
assert
!
(
worker_events
[
i
]
.
timestamp_us
>=
worker_events
[
i
-
1
]
.
timestamp_us
);
}
}
println!
(
"Generated {} events. Processing..."
,
events
.iter
()
.map
(|
e
|
e
.len
())
.sum
::
<
usize
>
()
artifacts
.iter
()
.map
(|
artifact
|
artifact
.kv_events
.len
())
.sum
::
<
usize
>
()
);
if
progress
.elapsed
()
>
Duration
::
from_millis
(
trace_simulation_duration_ms
*
11
/
10
)
{
...
...
@@ -432,8 +533,11 @@ pub async fn generate_kv_events(
let
mut
num_stored_events
=
0
;
let
mut
num_removed_events
=
0
;
for
event
in
events
.iter
()
.flatten
()
{
match
event
.0
.data
{
for
event
in
artifacts
.iter
()
.flat_map
(|
artifact
|
artifact
.kv_events
.iter
())
{
match
event
.event.data
{
KvCacheEventData
::
Stored
(
_
)
=>
num_stored_events
+=
1
,
KvCacheEventData
::
Removed
(
_
)
=>
num_removed_events
+=
1
,
_
=>
(),
...
...
@@ -443,7 +547,25 @@ pub async fn generate_kv_events(
println!
(
"Store events: {}"
,
num_stored_events
);
println!
(
"Remove events: {}"
,
num_removed_events
);
Ok
(
events
)
Ok
(
artifacts
)
}
pub
async
fn
generate_kv_events
(
traces
:
&
[
Trace
],
num_gpu_blocks
:
usize
,
block_size
:
u32
,
trace_simulation_duration_ms
:
u64
,
)
->
anyhow
::
Result
<
Vec
<
Vec
<
TimedKvEvent
>>>
{
Ok
(
generate_replay_artifacts
(
traces
,
num_gpu_blocks
,
block_size
,
trace_simulation_duration_ms
,
)
.await
?
.into_iter
()
.map
(|
artifact
|
artifact
.kv_events
)
.collect
())
}
pub
fn
plot_sweep
(
...
...
@@ -591,6 +713,16 @@ pub struct SequenceData {
pub
external_hashes
:
Vec
<
ExternalSequenceBlockHash
>
,
}
impl
From
<
RouterSequence
>
for
SequenceData
{
fn
from
(
sequence
:
RouterSequence
)
->
Self
{
Self
{
worker_id
:
sequence
.worker_id
,
local_hashes
:
sequence
.local_hashes
,
external_hashes
:
sequence
.external_hashes
,
}
}
}
impl
SequenceData
{
/// Create a new sequence with synthetic hashes based on sequence ID.
pub
fn
new
(
seq_id
:
u64
,
worker_id
:
WorkerId
,
depth
:
usize
)
->
Self
{
...
...
@@ -673,58 +805,46 @@ pub fn generate_sequences(
seed
:
u64
,
use_cumulative_hash
:
bool
,
)
->
Vec
<
SequenceData
>
{
let
mut
sequences
=
Vec
::
with_capacity
(
num_sequences
);
let
prefix_length
=
(
depth
as
f64
*
prefix_ratio
)
.round
()
as
usize
;
let
mut
rng
:
StdRng
=
StdRng
::
seed_from_u64
(
seed
);
for
seq_id
in
0
..
num_sequences
{
let
seq_id_u64
=
seq_id
as
u64
;
let
worker_id
=
(
seq_id
%
num_workers
)
as
WorkerId
;
let
group_id
=
if
num_prefix_groups
>
0
&&
prefix_length
>
0
{
Some
(
rng
.random_range
(
0
..
num_prefix_groups
)
as
u64
)
let
trace
=
Trace
::
synthetic
(
SyntheticTraceSpec
{
block_size
:
1
,
num_sessions
:
num_sequences
,
turns_per_session
:
1
,
input_tokens
:
LengthSpec
{
mean
:
depth
,
stddev
:
0.0
,
},
output_tokens
:
LengthSpec
{
mean
:
1
,
stddev
:
0.0
,
},
shared_prefix_ratio
:
prefix_ratio
,
num_prefix_groups
,
first_turn_arrivals
:
ArrivalSpec
::
Burst
,
inter_turn_delays
:
DelaySpec
::
None
,
seed
,
})
.expect
(
"sequence generation spec must be valid"
);
let
hash_mode
=
if
use_cumulative_hash
{
SequenceHashMode
::
Cumulative
}
else
{
None
SequenceHashMode
::
Raw
};
let
local_hashes
:
Vec
<
LocalBlockHash
>
=
(
0
..
depth
)
.map
(|
block_idx
|
{
let
block_idx_u64
=
block_idx
as
u64
;
if
let
Some
(
gid
)
=
group_id
&&
block_idx
<
prefix_length
{
return
LocalBlockHash
(
0xDEAD_BEEF_0000_0000
|
(
gid
<<
32
)
|
block_idx_u64
);
}
LocalBlockHash
((
seq_id_u64
<<
32
)
|
block_idx_u64
)
trace
.partition_by_session
(
SessionPartitionSpec
::
RoundRobin
{
num_partitions
:
num_workers
,
})
.collect
();
if
use_cumulative_hash
{
sequences
.push
(
SequenceData
::
from_local_hashes
(
worker_id
,
local_hashes
));
}
else
{
let
external_hashes
:
Vec
<
ExternalSequenceBlockHash
>
=
(
0
..
depth
)
.map
(|
block_idx
|
{
let
block_idx_u64
=
block_idx
as
u64
;
if
let
Some
(
gid
)
=
group_id
&&
block_idx
<
prefix_length
{
return
ExternalSequenceBlockHash
(
0xDEAD_BEEF_0000_0000
|
(
gid
<<
32
)
|
block_idx_u64
,
);
}
ExternalSequenceBlockHash
((
seq_id_u64
<<
32
)
|
block_idx_u64
)
.into_iter
()
.enumerate
()
.flat_map
(|(
worker_idx
,
partition
)|
{
partition
.to_router_sequences
(
worker_idx
as
WorkerId
,
hash_mode
)
.expect
(
"synthetic trace conversion must succeed"
)
.into_iter
()
.map
(
SequenceData
::
from
)
.collect
::
<
Vec
<
_
>>
()
})
.collect
();
sequences
.push
(
SequenceData
{
worker_id
,
local_hashes
,
external_hashes
,
});
}
}
sequences
.collect
()
}
/// Compute median of durations.
...
...
@@ -736,3 +856,60 @@ pub fn median(durations: &[Duration]) -> Duration {
sorted
.sort
();
sorted
[
sorted
.len
()
/
2
]
}
#[cfg(test)]
mod
tests
{
use
super
::
*
;
fn
multiturn_trace
()
->
Trace
{
Trace
{
block_size
:
2
,
sessions
:
vec!
[
dynamo_mocker
::
loadgen
::
SessionTrace
{
session_id
:
"session-a"
.to_string
(),
first_arrival_timestamp_ms
:
Some
(
0.0
),
turns
:
vec!
[
dynamo_mocker
::
loadgen
::
TurnTrace
{
input_length
:
4
,
max_output_tokens
:
2
,
hash_ids
:
vec!
[
1
,
2
],
delay_after_previous_ms
:
0.0
,
},
dynamo_mocker
::
loadgen
::
TurnTrace
{
input_length
:
4
,
max_output_tokens
:
2
,
hash_ids
:
vec!
[
3
,
4
],
delay_after_previous_ms
:
5.0
,
},
],
}],
}
}
#[tokio::test]
async
fn
test_replay_worker_trace_releases_follow_up_turn_after_completion_delay
()
{
let
artifacts
=
replay_worker_trace
(
multiturn_trace
(),
default_mock_engine_args
(
1024
,
2
)
.unwrap
(),
5
,
make_progress_bar
(
Some
(
2
)),
)
.await
.unwrap
();
assert_eq!
(
artifacts
.requests
.len
(),
2
);
let
first_uuid
=
artifacts
.requests
[
0
]
.uuid
;
let
first_completion_ms
=
artifacts
.output_signals
.iter
()
.find
(|
signal
|
signal
.signal.uuid
==
first_uuid
&&
signal
.signal.completed
)
.unwrap
()
.timestamp_us
as
f64
/
1000.0
;
assert
!
(
artifacts
.requests
[
1
]
.scheduled_ready_at_ms
+
0.1
>=
first_completion_ms
+
5.0
,
"expected follow-up turn to wait for completion plus delay, got ready_at={} completion_at={}"
,
artifacts
.requests
[
1
]
.scheduled_ready_at_ms
,
first_completion_ms
);
}
}
lib/bench/kv_router/mooncake_bench.rs
View file @
b2c59aa4
...
...
@@ -14,6 +14,7 @@ use dynamo_kv_router::protocols::{KvCacheEvent, KvCacheEventData, RouterEvent};
use
dynamo_kv_router
::{
ConcurrentRadixTree
,
ConcurrentRadixTreeCompressed
,
PositionalIndexer
,
ThreadPoolIndexer
,
};
use
dynamo_mocker
::
loadgen
::
Trace
;
use
serde
::
Serialize
;
use
std
::
sync
::
Arc
;
use
tokio
::
time
::{
Duration
,
Instant
};
...
...
@@ -194,68 +195,33 @@ struct WorkerTrace {
/// Timestamps are rescaled from the original trace / simulation durations
/// into the benchmark duration (microseconds).
fn
prepare_worker_traces
(
traces
:
Vec
<
Vec
<
MooncakeRequest
>>
,
events
:
Vec
<
Vec
<
(
KvCacheEvent
,
Instant
)
>>
,
block_size
:
u32
,
artifacts
:
Vec
<
WorkerReplayArtifacts
>
,
benchmark_duration_ms
:
u64
,
trace_simulation_duration_ms
:
u64
,
)
->
Vec
<
Vec
<
WorkerTrace
>>
{
assert
!
(
traces
.len
()
==
events
.len
());
let
scaled_request_traces
:
Vec
<
_
>
=
traces
artifacts
.into_iter
()
.map
(|
trace
|
{
let
Some
(
first
)
=
trace
.first
()
else
{
return
Vec
::
new
();
};
let
first_ts
=
first
.timestamp
;
let
trace_duration_ms
=
trace
.last
()
.unwrap
()
.timestamp
-
first_ts
;
trace
.map
(|
artifact
|
{
let
mut
merged
=
artifact
.requests
.into_iter
()
.map
(|
request
|
WorkerTrace
{
timestamp_us
:
if
trace_duration_ms
==
0
{
timestamp_us
:
request
.timestamp_us
,
entry
:
WorkerTraceEntry
::
Request
(
request
.replay_hashes.local_block_hashes
),
})
.chain
(
artifact
.kv_events
.into_iter
()
.map
(|
event
|
WorkerTrace
{
timestamp_us
:
event
.timestamp_us
,
entry
:
WorkerTraceEntry
::
Event
(
event
.event
),
}))
.collect
::
<
Vec
<
_
>>
();
merged
.sort_by_key
(|
entry
|
entry
.timestamp_us
);
let
max_timestamp_us
=
merged
.last
()
.map
(|
entry
|
entry
.timestamp_us
)
.unwrap_or
(
0
);
for
entry
in
&
mut
merged
{
entry
.timestamp_us
=
if
max_timestamp_us
==
0
{
0
}
else
{
(
request
.timestamp
-
first_ts
)
*
1000
*
benchmark_duration_ms
/
trace_duration_ms
},
entry
:
WorkerTraceEntry
::
Request
(
request
.hash_ids
.iter
()
.map
(|
id
|
local_block_hash_from_id
(
*
id
,
block_size
))
.collect
(),
),
})
.collect
::
<
Vec
<
_
>>
()
})
.collect
();
let
scaled_event_traces
:
Vec
<
_
>
=
events
.into_iter
()
.map
(|
worker_events
|
{
let
Some
(
&
(
_
,
start_instant
))
=
worker_events
.first
()
else
{
return
Vec
::
new
();
entry
.timestamp_us
*
benchmark_duration_ms
*
1000
/
max_timestamp_us
};
worker_events
.into_iter
()
.map
(|(
event
,
timestamp
)|
WorkerTrace
{
timestamp_us
:
(
timestamp
-
start_instant
)
.as_micros
()
as
u64
*
benchmark_duration_ms
/
trace_simulation_duration_ms
,
entry
:
WorkerTraceEntry
::
Event
(
event
),
})
.collect
::
<
Vec
<
_
>>
()
})
.collect
();
scaled_request_traces
.into_iter
()
.zip
(
scaled_event_traces
)
.map
(|(
request_trace
,
event_trace
)|
{
let
mut
merged
:
Vec
<
WorkerTrace
>
=
request_trace
.into_iter
()
.chain
(
event_trace
)
.collect
();
merged
.sort_by_key
(|
entry
|
entry
.timestamp_us
);
}
merged
})
.collect
()
...
...
@@ -276,19 +242,12 @@ struct SweepStepResult {
/// flushed and latency percentiles / throughput stats are printed.
async
fn
run_benchmark
(
indexer
:
Arc
<
dyn
KvIndexerInterface
+
Send
+
Sync
>
,
traces
:
Vec
<
Vec
<
MooncakeRequest
>>
,
events
:
Vec
<
Vec
<
(
KvCacheEvent
,
Instant
)
>>
,
artifacts
:
Vec
<
WorkerReplayArtifacts
>
,
args
:
&
Args
,
benchmark_duration_ms
:
u64
,
count_events
:
bool
,
)
->
anyhow
::
Result
<
BenchmarkResults
>
{
let
worker_traces
=
prepare_worker_traces
(
traces
,
events
,
args
.common.block_size
,
benchmark_duration_ms
,
args
.common.trace_simulation_duration_ms
,
);
let
worker_traces
=
prepare_worker_traces
(
artifacts
,
benchmark_duration_ms
);
let
worker_traces
=
worker_traces
.into_iter
()
.map
(
Arc
::
new
)
.collect
::
<
Vec
<
_
>>
();
let
progress
=
make_progress_bar
(
Some
(
...
...
@@ -460,7 +419,7 @@ async fn run_benchmark(
})
}
fn
run_tests
()
->
anyhow
::
Result
<
()
>
{
async
fn
run_tests
()
->
anyhow
::
Result
<
()
>
{
use
std
::
collections
::
HashSet
;
use
std
::
fs
::
File
;
use
std
::
io
::
Write
;
...
...
@@ -479,6 +438,7 @@ fn run_tests() -> anyhow::Result<()> {
"{}"
,
serde_json
::
json!
({
"timestamp"
:
i
as
u64
,
"input_length"
:
hash_ids
.len
(),
"hash_ids"
:
hash_ids
,
"output_length"
:
output_length
,
})
...
...
@@ -486,12 +446,13 @@ fn run_tests() -> anyhow::Result<()> {
}
}
let
traces
=
process_mooncake_trace
(
path
.to_str
()
.unwrap
(),
2
,
2
,
2
,
42
)
?
;
let
traces
=
process_mooncake_trace
(
path
.to_str
()
.unwrap
(),
512
,
2
,
2
,
2
,
42
)
?
;
std
::
fs
::
remove_file
(
&
path
)
.ok
();
let
mut
all_hashes
:
Vec
<
Vec
<
u64
>>
=
traces
.into_iter
()
.flat_map
(|
w
|
w
.into_iter
()
.map
(|
r
|
r
.hash_ids
))
.flat_map
(|
worker
|
worker
.sessions
.into_iter
())
.flat_map
(|
session
|
session
.turns
.into_iter
()
.map
(|
turn
|
turn
.hash_ids
))
.collect
();
all_hashes
.sort
();
...
...
@@ -519,6 +480,43 @@ fn run_tests() -> anyhow::Result<()> {
let
set1
:
HashSet
<
u64
>
=
copy1
.iter
()
.flat_map
(|
h
|
h
.iter
()
.copied
())
.collect
();
assert
!
(
set0
.is_disjoint
(
&
set1
),
"copies are not hash-disjoint"
);
let
replay_trace
=
Trace
{
block_size
:
2
,
sessions
:
vec!
[
dynamo_mocker
::
loadgen
::
SessionTrace
{
session_id
:
"session-a"
.to_string
(),
first_arrival_timestamp_ms
:
Some
(
0.0
),
turns
:
vec!
[
dynamo_mocker
::
loadgen
::
TurnTrace
{
input_length
:
4
,
max_output_tokens
:
2
,
hash_ids
:
vec!
[
1
,
2
],
delay_after_previous_ms
:
0.0
,
},
dynamo_mocker
::
loadgen
::
TurnTrace
{
input_length
:
4
,
max_output_tokens
:
2
,
hash_ids
:
vec!
[
3
,
4
],
delay_after_previous_ms
:
5.0
,
},
],
}],
};
let
artifacts
=
generate_replay_artifacts
(
&
[
replay_trace
],
1024
,
2
,
5
)
.await
?
;
assert_eq!
(
artifacts
.len
(),
1
);
assert_eq!
(
artifacts
[
0
]
.requests
.len
(),
2
);
let
first_uuid
=
artifacts
[
0
]
.requests
[
0
]
.uuid
;
let
first_completion_ms
=
artifacts
[
0
]
.output_signals
.iter
()
.find
(|
signal
|
signal
.signal.uuid
==
first_uuid
&&
signal
.signal.completed
)
.expect
(
"first request must complete"
)
.timestamp_us
as
f64
/
1000.0
;
assert
!
(
artifacts
[
0
]
.requests
[
1
]
.scheduled_ready_at_ms
+
0.1
>=
first_completion_ms
+
5.0
,
"expected second request to wait for completion plus delay"
);
println!
(
"All tests passed."
);
Ok
(())
}
...
...
@@ -528,7 +526,7 @@ async fn main() -> anyhow::Result<()> {
let
args
=
Args
::
parse
();
if
args
.common.test
{
return
run_tests
();
return
run_tests
()
.await
;
}
let
path
=
match
args
.common.mooncake_trace_path
.as_deref
()
{
...
...
@@ -540,12 +538,13 @@ async fn main() -> anyhow::Result<()> {
};
let
traces
=
process_mooncake_trace
(
path
,
args
.common.block_size
,
args
.common.trace_length_factor
,
args
.common.trace_duplication_factor
,
args
.common.num_unique_inference_workers
,
args
.common.seed
,
)
?
;
let
even
ts
=
generate_
kv_even
ts
(
let
artifac
ts
=
generate_
replay_artifac
ts
(
&
traces
,
args
.common.num_gpu_blocks
,
args
.common.block_size
,
...
...
@@ -599,15 +598,8 @@ async fn main() -> anyhow::Result<()> {
IndexerArgs
::
from_name
(
name
,
args
.common.block_size
,
args
.num_event_workers
)
?
};
let
count_events
=
IndexerArgs
::
supports_remove
(
name
);
let
result
=
run_benchmark
(
indexer
,
traces
.clone
(),
events
.clone
(),
&
args
,
dur_ms
,
count_events
,
)
.await
?
;
let
result
=
run_benchmark
(
indexer
,
artifacts
.clone
(),
&
args
,
dur_ms
,
count_events
)
.await
?
;
if
multi_threaded
{
if
result
.block_throughput
>=
result
.offered_block_throughput
*
0.95
{
...
...
@@ -674,8 +666,7 @@ async fn main() -> anyhow::Result<()> {
let
count_events
=
IndexerArgs
::
supports_remove
(
name
);
run_benchmark
(
indexer
,
traces
.clone
(),
events
.clone
(),
artifacts
.clone
(),
&
args
,
args
.common.benchmark_duration_ms
,
count_events
,
...
...
lib/bench/src/bin/multiturn_bench.rs
View file @
b2c59aa4
...
...
@@ -13,6 +13,9 @@
use
anyhow
::{
Context
,
Result
};
use
clap
::
Parser
;
use
dynamo_bench
::
common
::{
ChatMessage
,
LatencyStats
,
fetch_model_name
};
use
dynamo_mocker
::
loadgen
::{
ArrivalSpec
,
DelaySpec
,
LengthSpec
,
SessionTrace
,
SyntheticTraceSpec
,
Trace
,
};
use
futures_util
::
StreamExt
;
use
indicatif
::{
ProgressBar
,
ProgressStyle
};
use
rand
::
rngs
::
StdRng
;
...
...
@@ -283,10 +286,10 @@ async fn run_user(
model
:
String
,
args
:
Arc
<
Args
>
,
user_id
:
usize
,
session
:
SessionTrace
,
progress
:
ProgressBar
,
)
->
Vec
<
TurnResult
>
{
let
mut
rng
=
StdRng
::
seed_from_u64
(
args
.seed
.wrapping_add
(
user_id
as
u64
));
let
mean_delay
=
args
.mean_delay_ms
as
f64
;
let
system_prompt
=
generate_system_prompt
(
user_id
);
let
mut
messages
=
vec!
[
ChatMessage
{
...
...
@@ -294,11 +297,10 @@ async fn run_user(
content
:
system_prompt
,
}];
let
mut
results
=
Vec
::
with_capacity
(
args
.num_turns
);
let
mut
results
=
Vec
::
with_capacity
(
session
.turns
.len
()
);
for
turn
in
0
..
args
.num_turns
{
// Generate user prompt
let
user_text
=
generate_lorem
(
&
mut
rng
,
args
.num_user_tokens
);
for
(
turn
,
turn_spec
)
in
session
.turns
.iter
()
.enumerate
()
{
let
user_text
=
generate_lorem
(
&
mut
rng
,
turn_spec
.input_length
);
messages
.push
(
ChatMessage
{
role
:
"user"
.to_string
(),
content
:
user_text
,
...
...
@@ -307,7 +309,7 @@ async fn run_user(
let
body
=
MultiturnRequest
{
model
:
model
.clone
(),
messages
:
messages
.clone
(),
max_completion_tokens
:
args
.max_completion_tokens
,
max_completion_tokens
:
turn_spec
.max_output_tokens
as
u32
,
ignore_eos
:
if
args
.ignore_eos
{
Some
(
true
)
}
else
{
None
},
stream
:
true
,
nvext
:
if
args
.speculative_prefill
{
...
...
@@ -392,7 +394,7 @@ async fn run_user(
" [user {}][turn {}/{}] ttft={:.1}ms total={:.1}s ok={}"
,
user_id
,
turn
+
1
,
args
.num_turns
,
session
.turns
.len
()
,
result
.ttft_us
as
f64
/
1000.0
,
result
.total_latency_us
as
f64
/
1_000_000.0
,
result
.success
,
...
...
@@ -404,10 +406,13 @@ async fn run_user(
// Exponential inter-turn delay (skip after last turn)
// Exp(1/mean) = -mean * ln(U), U ~ Uniform(0,1)
if
turn
+
1
<
args
.num_turns
{
let
u
:
f64
=
rng
.random
();
let
delay_ms
=
(
-
mean_delay
*
u
.ln
())
.max
(
0.0
);
tokio
::
time
::
sleep
(
Duration
::
from_millis
(
delay_ms
as
u64
))
.await
;
if
let
Some
(
next_turn
)
=
session
.turns
.get
(
turn
+
1
)
&&
next_turn
.delay_after_previous_ms
>
0.0
{
tokio
::
time
::
sleep
(
Duration
::
from_secs_f64
(
next_turn
.delay_after_previous_ms
/
1000.0
,
))
.await
;
}
}
...
...
@@ -569,6 +574,32 @@ async fn main() -> Result<()> {
.build
()
.context
(
"Failed to create HTTP client"
)
?
;
let
workload
=
Trace
::
synthetic
(
SyntheticTraceSpec
{
block_size
:
1
,
num_sessions
:
args
.num_users
,
turns_per_session
:
args
.num_turns
,
input_tokens
:
LengthSpec
{
mean
:
args
.num_user_tokens
,
stddev
:
0.0
,
},
output_tokens
:
LengthSpec
{
mean
:
args
.max_completion_tokens
as
usize
,
stddev
:
0.0
,
},
shared_prefix_ratio
:
0.0
,
num_prefix_groups
:
0
,
first_turn_arrivals
:
ArrivalSpec
::
Burst
,
inter_turn_delays
:
if
args
.mean_delay_ms
==
0
{
DelaySpec
::
None
}
else
{
DelaySpec
::
ExponentialMs
{
mean_ms
:
args
.mean_delay_ms
as
f64
,
}
},
seed
:
args
.seed
,
})
?
;
let
sessions
=
workload
.sessions
;
let
args
=
Arc
::
new
(
args
);
let
chat_url
=
format!
(
"{}/v1/chat/completions"
,
args
.url
);
...
...
@@ -592,14 +623,18 @@ async fn main() -> Result<()> {
.progress_chars
(
"#>-"
),
);
let
handles
:
Vec
<
_
>
=
(
0
..
args
.num_users
)
.map
(|
user_id
|
{
let
handles
:
Vec
<
_
>
=
sessions
.into_iter
()
.enumerate
()
.map
(|(
user_id
,
session
)|
{
let
client
=
client
.clone
();
let
url
=
chat_url
.clone
();
let
model
=
model
.clone
();
let
args
=
args
.clone
();
let
progress
=
progress
.clone
();
tokio
::
spawn
(
async
move
{
run_user
(
client
,
url
,
model
,
args
,
user_id
,
progress
)
.await
})
tokio
::
spawn
(
async
move
{
run_user
(
client
,
url
,
model
,
args
,
user_id
,
session
,
progress
)
.await
})
})
.collect
();
...
...
lib/bindings/python/rust/llm/replay.rs
View file @
b2c59aa4
...
...
@@ -10,6 +10,9 @@ use dynamo_mocker::common::protocols::{
PreemptionMode
as
RsPreemptionMode
,
ReasoningConfig
as
RsReasoningConfig
,
SglangArgs
as
RsSglangArgs
,
WorkerType
as
RsWorkerType
,
};
use
dynamo_mocker
::
loadgen
::{
ArrivalSpec
,
DelaySpec
,
LengthSpec
,
SyntheticTraceSpec
,
Trace
as
RsTrace
,
};
use
pyo3
::{
exceptions
::
PyException
,
prelude
::
*
};
use
pythonize
::
pythonize
;
use
uuid
::
Uuid
;
...
...
@@ -356,7 +359,7 @@ pub fn run_mocker_trace_replay(
}
#[pyfunction]
#[pyo3(signature
=
(input_tokens,
output_tokens,
request_count,
extra_engine_args=None,
router_config=None,
num_workers=
1
,
replay_concurrency=None,
replay_mode=
"offline"
,
router_mode=
"round_robin"
,
arrival_speedup_ratio=
1.0
,
arrival_interval_ms=
1.0
))]
#[pyo3(signature
=
(input_tokens,
output_tokens,
request_count,
extra_engine_args=None,
router_config=None,
num_workers=
1
,
replay_concurrency=None,
replay_mode=
"offline"
,
router_mode=
"round_robin"
,
arrival_speedup_ratio=
1.0
,
arrival_interval_ms=
1.0
,
turns_per_session=
1
,
shared_prefix_ratio=
0.0
,
num_prefix_groups=
0
,
inter_turn_delay_ms=
0.0
))]
#[allow(clippy::too_many_arguments)]
pub
fn
run_mocker_synthetic_trace_replay
(
py
:
Python
<
'_
>
,
...
...
@@ -371,6 +374,10 @@ pub fn run_mocker_synthetic_trace_replay(
router_mode
:
&
str
,
arrival_speedup_ratio
:
f64
,
arrival_interval_ms
:
f64
,
turns_per_session
:
usize
,
shared_prefix_ratio
:
f64
,
num_prefix_groups
:
usize
,
inter_turn_delay_ms
:
f64
,
)
->
PyResult
<
PyObject
>
{
let
args
=
load_replay_mocker_args
(
py
,
extra_engine_args
)
?
;
let
router_config
=
load_replay_router_config
(
router_config
);
...
...
@@ -378,6 +385,73 @@ pub fn run_mocker_synthetic_trace_replay(
let
router_mode
=
parse_replay_router_mode
(
router_mode
)
?
;
let
report
=
py
.allow_threads
(
move
||
{
let
replay_concurrency
=
parse_replay_concurrency
(
replay_concurrency
)
?
;
let
use_workload
=
turns_per_session
>
1
||
shared_prefix_ratio
>
0.0
||
num_prefix_groups
>
0
||
inter_turn_delay_ms
>
0.0
;
if
use_workload
{
let
mut
trace
=
build_synthetic_workload
(
args
.block_size
.max
(
1
),
input_tokens
,
output_tokens
,
request_count
,
arrival_interval_ms
,
turns_per_session
,
shared_prefix_ratio
,
num_prefix_groups
,
inter_turn_delay_ms
,
)
?
;
if
replay_concurrency
.is_none
()
{
trace
=
trace
.speed_up_timing
(
arrival_speedup_ratio
)
?
;
}
return
match
(
replay_mode
.as_str
(),
replay_concurrency
)
{
(
"offline"
,
Some
(
max_in_flight
))
=>
{
dynamo_mocker
::
replay
::
simulate_concurrency_workload_with_router_mode
(
args
,
router_config
.clone
(),
trace
,
max_in_flight
,
num_workers
,
router_mode
,
)
}
(
"offline"
,
None
)
=>
{
dynamo_mocker
::
replay
::
simulate_trace_workload_with_router_mode
(
args
,
router_config
.clone
(),
trace
,
num_workers
,
router_mode
,
)
}
(
"online"
,
Some
(
max_in_flight
))
=>
{
dynamo_mocker
::
replay
::
simulate_concurrency_live_workload_with_router_mode
(
args
,
router_config
.clone
(),
trace
,
max_in_flight
,
num_workers
,
router_mode
,
)
}
(
"online"
,
None
)
=>
{
dynamo_mocker
::
replay
::
simulate_trace_live_workload_with_router_mode
(
args
,
router_config
.clone
(),
trace
,
num_workers
,
router_mode
,
)
}
(
other
,
_
)
=>
anyhow
::
bail!
(
"replay_mode must be either 'offline' or 'online', got '{}'"
,
other
),
};
}
let
requests
=
build_synthetic_requests
(
input_tokens
,
output_tokens
,
...
...
@@ -509,6 +583,69 @@ fn parse_replay_concurrency(replay_concurrency: Option<isize>) -> anyhow::Result
}
}
#[allow(clippy::too_many_arguments)]
fn
build_synthetic_workload
(
block_size
:
usize
,
input_tokens
:
usize
,
output_tokens
:
usize
,
request_count
:
usize
,
arrival_interval_ms
:
f64
,
turns_per_session
:
usize
,
shared_prefix_ratio
:
f64
,
num_prefix_groups
:
usize
,
inter_turn_delay_ms
:
f64
,
)
->
anyhow
::
Result
<
RsTrace
>
{
if
input_tokens
==
0
{
anyhow
::
bail!
(
"input_tokens must be at least 1"
);
}
if
output_tokens
==
0
{
anyhow
::
bail!
(
"output_tokens must be at least 1"
);
}
if
request_count
==
0
{
anyhow
::
bail!
(
"request_count must be at least 1"
);
}
if
turns_per_session
==
0
{
anyhow
::
bail!
(
"turns_per_session must be at least 1"
);
}
if
!
arrival_interval_ms
.is_finite
()
||
arrival_interval_ms
<
0.0
{
anyhow
::
bail!
(
"arrival_interval_ms must be a finite non-negative number"
);
}
if
!
inter_turn_delay_ms
.is_finite
()
||
inter_turn_delay_ms
<
0.0
{
anyhow
::
bail!
(
"inter_turn_delay_ms must be a finite non-negative number"
);
}
let
first_turn_arrivals
=
if
arrival_interval_ms
==
0.0
{
ArrivalSpec
::
Burst
}
else
{
ArrivalSpec
::
ConstantQps
{
qps
:
1000.0
/
arrival_interval_ms
,
}
};
RsTrace
::
synthetic
(
SyntheticTraceSpec
{
block_size
,
num_sessions
:
request_count
,
turns_per_session
,
input_tokens
:
LengthSpec
{
mean
:
input_tokens
,
stddev
:
0.0
,
},
output_tokens
:
LengthSpec
{
mean
:
output_tokens
,
stddev
:
0.0
,
},
shared_prefix_ratio
,
num_prefix_groups
,
first_turn_arrivals
,
inter_turn_delays
:
if
inter_turn_delay_ms
==
0.0
{
DelaySpec
::
None
}
else
{
DelaySpec
::
ConstantMs
(
inter_turn_delay_ms
)
},
seed
:
42
,
})
}
fn
build_synthetic_requests
(
input_tokens
:
usize
,
output_tokens
:
usize
,
...
...
lib/bindings/python/src/dynamo/_core.pyi
View file @
b2c59aa4
...
...
@@ -1388,6 +1388,10 @@ def run_mocker_synthetic_trace_replay(
router_mode: Literal["round_robin", "kv_router"] = "round_robin",
arrival_speedup_ratio: float = 1.0,
arrival_interval_ms: float = 1.0,
turns_per_session: int = 1,
shared_prefix_ratio: float = 0.0,
num_prefix_groups: int = 0,
inter_turn_delay_ms: float = 0.0,
) -> Dict[str, Any]:
"""Replay a synthetic mocker workload without requiring a trace file."""
...
...
...
lib/bindings/python/src/dynamo/replay/api.py
View file @
b2c59aa4
...
...
@@ -43,6 +43,10 @@ def run_synthetic_trace_replay(
router_mode
=
"round_robin"
,
arrival_speedup_ratio
=
1.0
,
arrival_interval_ms
=
1.0
,
turns_per_session
=
1
,
shared_prefix_ratio
=
0.0
,
num_prefix_groups
=
0
,
inter_turn_delay_ms
=
0.0
,
):
return
_run_mocker_synthetic_trace_replay
(
input_tokens
,
...
...
@@ -56,4 +60,8 @@ def run_synthetic_trace_replay(
router_mode
=
router_mode
,
arrival_speedup_ratio
=
arrival_speedup_ratio
,
arrival_interval_ms
=
arrival_interval_ms
,
turns_per_session
=
turns_per_session
,
shared_prefix_ratio
=
shared_prefix_ratio
,
num_prefix_groups
=
num_prefix_groups
,
inter_turn_delay_ms
=
inter_turn_delay_ms
,
)
lib/bindings/python/src/dynamo/replay/main.py
View file @
b2c59aa4
...
...
@@ -22,8 +22,16 @@ def main(argv: Sequence[str] | None = None) -> int:
parser
.
add_argument
(
"--router-config"
)
parser
.
add_argument
(
"--input-tokens"
,
type
=
int
)
parser
.
add_argument
(
"--output-tokens"
,
type
=
int
)
parser
.
add_argument
(
"--request-count"
,
type
=
int
)
parser
.
add_argument
(
"--request-count"
,
type
=
int
,
help
=
"number of synthetic requests; when --turns-per-session > 1, this is the number of sessions"
,
)
parser
.
add_argument
(
"--arrival-interval-ms"
,
type
=
float
,
default
=
1.0
)
parser
.
add_argument
(
"--turns-per-session"
,
type
=
int
,
default
=
1
)
parser
.
add_argument
(
"--shared-prefix-ratio"
,
type
=
float
,
default
=
0.0
)
parser
.
add_argument
(
"--num-prefix-groups"
,
type
=
int
,
default
=
0
)
parser
.
add_argument
(
"--inter-turn-delay-ms"
,
type
=
float
,
default
=
0.0
)
parser
.
add_argument
(
"--num-workers"
,
type
=
int
,
default
=
1
)
parser
.
add_argument
(
"--replay-concurrency"
,
type
=
int
)
parser
.
add_argument
(
...
...
@@ -45,7 +53,14 @@ def main(argv: Sequence[str] | None = None) -> int:
using_trace_file
=
args
.
trace_file
is
not
None
synthetic_args
=
(
args
.
input_tokens
,
args
.
output_tokens
,
args
.
request_count
)
using_synthetic
=
any
(
value
is
not
None
for
value
in
synthetic_args
)
using_synthetic
=
any
(
value
is
not
None
for
value
in
synthetic_args
)
or
any
(
(
args
.
turns_per_session
!=
1
,
args
.
shared_prefix_ratio
!=
0.0
,
args
.
num_prefix_groups
!=
0
,
args
.
inter_turn_delay_ms
!=
0.0
,
)
)
if
using_trace_file
==
using_synthetic
:
parser
.
error
(
...
...
@@ -91,6 +106,10 @@ def main(argv: Sequence[str] | None = None) -> int:
router_mode
=
args
.
router_mode
,
arrival_speedup_ratio
=
args
.
arrival_speedup_ratio
,
arrival_interval_ms
=
args
.
arrival_interval_ms
,
turns_per_session
=
args
.
turns_per_session
,
shared_prefix_ratio
=
args
.
shared_prefix_ratio
,
num_prefix_groups
=
args
.
num_prefix_groups
,
inter_turn_delay_ms
=
args
.
inter_turn_delay_ms
,
)
report_path
=
write_report_json
(
report
,
args
.
report_json
)
...
...
lib/bindings/python/tests/test_replay.py
View file @
b2c59aa4
...
...
@@ -110,6 +110,45 @@ def _write_trace_and_args(tmp_path):
return
trace_path
def
_write_multiturn_trace
(
tmp_path
):
trace_path
=
tmp_path
/
"multiturn_trace.jsonl"
records
=
[
{
"session_id"
:
"session-a"
,
"timestamp"
:
1000.0
,
"input_length"
:
64
,
"output_length"
:
2
,
"hash_ids"
:
[
101
],
},
{
"session_id"
:
"session-b"
,
"timestamp"
:
1002.0
,
"input_length"
:
64
,
"output_length"
:
2
,
"hash_ids"
:
[
202
],
},
{
"session_id"
:
"session-a"
,
"delay"
:
5.0
,
"input_length"
:
64
,
"output_length"
:
2
,
"hash_ids"
:
[
303
],
},
{
"session_id"
:
"session-b"
,
"delay"
:
1.0
,
"input_length"
:
64
,
"output_length"
:
2
,
"hash_ids"
:
[
404
],
},
]
trace_path
.
write_text
(
"
\n
"
.
join
(
json
.
dumps
(
record
)
for
record
in
records
)
+
"
\n
"
,
encoding
=
"utf-8"
,
)
return
trace_path
def
_write_cli_smoke_trace
(
tmp_path
):
trace_path
=
tmp_path
/
"cli_smoke_trace.jsonl"
records
=
[]
...
...
@@ -283,6 +322,26 @@ def test_run_trace_replay_invariant_counts_match(tmp_path, engine_type, replay_m
assert
single
[
field
]
==
multi_kv_router
[
field
]
@
pytest
.
mark
.
parametrize
(
"replay_mode"
,
[
"offline"
,
"online"
])
def
test_run_trace_replay_supports_multiturn_sessions
(
tmp_path
,
replay_mode
):
trace_path
=
_write_multiturn_trace
(
tmp_path
)
report
=
run_trace_replay
(
trace_path
,
extra_engine_args
=
_vllm_args
(),
num_workers
=
2
,
replay_mode
=
replay_mode
,
router_mode
=
"kv_router"
,
)
_assert_basic_report_counts
(
report
,
num_requests
=
4
,
input_tokens
=
64
,
output_tokens
=
2
,
)
@
pytest
.
mark
.
parametrize
(
"engine_type"
,
[
"vllm"
,
"sglang"
])
@
pytest
.
mark
.
parametrize
(
"replay_mode"
,
[
"offline"
,
"online"
])
@
pytest
.
mark
.
parametrize
(
"router_mode"
,
[
"round_robin"
,
"kv_router"
])
...
...
@@ -358,6 +417,53 @@ def test_run_synthetic_trace_replay_invariant_counts_match(
assert
single
[
field
]
==
multi_kv_router
[
field
]
@
pytest
.
mark
.
parametrize
(
"replay_mode"
,
[
"offline"
,
"online"
])
def
test_run_synthetic_trace_replay_supports_multiturn_workloads
(
tmp_path
,
replay_mode
):
report
=
run_synthetic_trace_replay
(
64
,
2
,
3
,
extra_engine_args
=
_vllm_args
(),
num_workers
=
2
,
replay_mode
=
replay_mode
,
router_mode
=
"kv_router"
,
turns_per_session
=
2
,
inter_turn_delay_ms
=
5.0
,
shared_prefix_ratio
=
0.5
,
num_prefix_groups
=
2
,
)
_assert_basic_report_counts
(
report
,
num_requests
=
6
,
input_tokens
=
64
,
output_tokens
=
2
,
)
@
pytest
.
mark
.
parametrize
(
(
"input_tokens"
,
"output_tokens"
,
"expected_message"
),
[
(
0
,
2
,
"input_tokens must be at least 1"
),
(
2
,
0
,
"output_tokens must be at least 1"
),
],
)
def
test_run_synthetic_trace_replay_workload_validates_zero_token_lengths
(
input_tokens
,
output_tokens
,
expected_message
):
with
pytest
.
raises
(
Exception
,
match
=
expected_message
):
run_synthetic_trace_replay
(
input_tokens
,
output_tokens
,
2
,
extra_engine_args
=
_vllm_args
(),
num_workers
=
2
,
replay_mode
=
"offline"
,
router_mode
=
"kv_router"
,
turns_per_session
=
2
,
)
@
pytest
.
mark
.
parametrize
(
"engine_type"
,
[
"vllm"
,
"sglang"
])
@
pytest
.
mark
.
parametrize
(
"replay_mode"
,
[
"offline"
,
"online"
])
def
test_run_synthetic_concurrency_replay_counts_match
(
...
...
@@ -551,6 +657,48 @@ def test_replay_cli_prints_table_and_saves_json(tmp_path, monkeypatch, capsys):
assert
json
.
loads
(
report_path
.
read_text
(
encoding
=
"utf-8"
))
==
report
def
test_replay_cli_passes_multiturn_workload_kwargs
(
monkeypatch
):
captured
=
{}
def
fake_run
(
*
args
,
**
kwargs
):
captured
[
"args"
]
=
args
captured
[
"kwargs"
]
=
kwargs
return
{
"completed_requests"
:
4
,
"request_throughput_rps"
:
1.0
,
"output_throughput_tok_s"
:
1.0
,
}
monkeypatch
.
setattr
(
"dynamo.replay.main.run_synthetic_trace_replay"
,
fake_run
)
exit_code
=
main
(
[
"--input-tokens"
,
"16"
,
"--output-tokens"
,
"8"
,
"--request-count"
,
"2"
,
"--turns-per-session"
,
"2"
,
"--shared-prefix-ratio"
,
"0.5"
,
"--num-prefix-groups"
,
"3"
,
"--inter-turn-delay-ms"
,
"7.0"
,
]
)
assert
exit_code
==
0
assert
captured
[
"args"
]
==
(
16
,
8
,
2
)
assert
captured
[
"kwargs"
][
"turns_per_session"
]
==
2
assert
captured
[
"kwargs"
][
"shared_prefix_ratio"
]
==
0.5
assert
captured
[
"kwargs"
][
"num_prefix_groups"
]
==
3
assert
captured
[
"kwargs"
][
"inter_turn_delay_ms"
]
==
7.0
@
pytest
.
mark
.
timeout
(
30
)
def
test_replay_cli_subprocess_synthetic_smoke
(
tmp_path
):
report_path
=
tmp_path
/
"synthetic_report.json"
...
...
@@ -582,6 +730,45 @@ def test_replay_cli_subprocess_synthetic_smoke(tmp_path):
_assert_basic_report_metrics
(
report
)
@
pytest
.
mark
.
timeout
(
30
)
def
test_replay_cli_subprocess_synthetic_multiturn_smoke
(
tmp_path
):
report_path
=
tmp_path
/
"synthetic_multiturn_report.json"
completed
=
_run_replay_cli
(
tmp_path
,
"--input-tokens"
,
"64"
,
"--output-tokens"
,
"4"
,
"--request-count"
,
"3"
,
"--turns-per-session"
,
"2"
,
"--shared-prefix-ratio"
,
"0.5"
,
"--num-prefix-groups"
,
"2"
,
"--inter-turn-delay-ms"
,
"5.0"
,
"--num-workers"
,
"2"
,
"--report-json"
,
str
(
report_path
),
"--extra-engine-args"
,
'{"block_size":64,"speedup_ratio":1000.0}'
,
)
report
=
_assert_replay_cli_outputs
(
completed
,
report_path
)
_assert_basic_report_counts
(
report
,
num_requests
=
6
,
input_tokens
=
64
,
output_tokens
=
4
,
)
_assert_basic_report_metrics
(
report
)
@
pytest
.
mark
.
timeout
(
30
)
def
test_replay_cli_subprocess_trace_smoke
(
tmp_path
):
trace_path
=
_write_cli_smoke_trace
(
tmp_path
)
report_path
=
tmp_path
/
"trace_report.json"
...
...
@@ -609,3 +796,33 @@ def test_replay_cli_subprocess_trace_smoke(tmp_path):
output_tokens
=
25
,
)
_assert_basic_report_metrics
(
report
)
@
pytest
.
mark
.
timeout
(
30
)
def
test_replay_cli_subprocess_multiturn_trace_smoke
(
tmp_path
):
trace_path
=
_write_multiturn_trace
(
tmp_path
)
report_path
=
tmp_path
/
"multiturn_trace_report.json"
completed
=
_run_replay_cli
(
tmp_path
,
str
(
trace_path
),
"--replay-mode"
,
"online"
,
"--router-mode"
,
"kv_router"
,
"--num-workers"
,
"2"
,
"--report-json"
,
str
(
report_path
),
"--extra-engine-args"
,
'{"block_size":64,"speedup_ratio":1000.0}'
,
)
report
=
_assert_replay_cli_outputs
(
completed
,
report_path
)
_assert_basic_report_counts
(
report
,
num_requests
=
4
,
input_tokens
=
64
,
output_tokens
=
2
,
)
_assert_basic_report_metrics
(
report
)
lib/kv-router/src/sequences/multi_worker.rs
View file @
b2c59aa4
...
...
@@ -86,6 +86,9 @@ pub enum SequenceError {
#[error(
"Failed to publish event: {0}"
)]
PublishFailed
(
#[from]
anyhow
::
Error
),
#[error(
"Synchronous mutation requires replica_sync=false"
)]
SyncMutationRequiresNoReplicaSync
,
}
/// Bundled parameters for adding a request to the sequence tracker.
...
...
@@ -364,7 +367,14 @@ impl<P: SequencePublisher + 'static> ActiveSequencesMultiWorker<P> {
}
}
pub
async
fn
add_request
(
&
self
,
req
:
SequenceRequest
)
->
Result
<
(),
SequenceError
>
{
fn
ensure_sync_mutation_allowed
(
&
self
)
->
Result
<
(),
SequenceError
>
{
if
self
.replica_sync
{
return
Err
(
SequenceError
::
SyncMutationRequiresNoReplicaSync
);
}
Ok
(())
}
fn
add_request_local
(
&
self
,
req
:
SequenceRequest
)
->
Result
<
(),
SequenceError
>
{
let
SequenceRequest
{
request_id
,
token_sequence
,
...
...
@@ -386,22 +396,6 @@ impl<P: SequencePublisher + 'static> ActiveSequencesMultiWorker<P> {
});
}
if
self
.replica_sync
{
let
event
=
ActiveSequenceEvent
{
request_id
:
request_id
.clone
(),
worker
,
data
:
ActiveSequenceEventData
::
AddRequest
{
token_sequence
:
token_sequence
.clone
(),
isl
,
overlap
,
expected_output_tokens
,
},
router_id
:
self
.router_id
,
lora_name
:
lora_name
.clone
(),
};
self
.publisher
.publish_event
(
&
event
)
.await
?
;
}
self
.request_to_worker
.insert
(
request_id
.clone
(),
worker
);
if
let
Some
(
lora
)
=
lora_name
{
...
...
@@ -434,12 +428,36 @@ impl<P: SequencePublisher + 'static> ActiveSequencesMultiWorker<P> {
Ok
(())
}
pub
async
fn
add_request
(
&
self
,
req
:
SequenceRequest
)
->
Result
<
(),
SequenceError
>
{
if
self
.replica_sync
{
let
event
=
ActiveSequenceEvent
{
request_id
:
req
.request_id
.clone
(),
worker
:
req
.worker
,
data
:
ActiveSequenceEventData
::
AddRequest
{
token_sequence
:
req
.token_sequence
.clone
(),
isl
:
req
.isl
,
overlap
:
req
.overlap
,
expected_output_tokens
:
req
.expected_output_tokens
,
},
router_id
:
self
.router_id
,
lora_name
:
req
.lora_name
.clone
(),
};
self
.publisher
.publish_event
(
&
event
)
.await
?
;
}
self
.add_request_local
(
req
)
}
pub
fn
add_request_sync
(
&
self
,
req
:
SequenceRequest
)
->
Result
<
(),
SequenceError
>
{
self
.ensure_sync_mutation_allowed
()
?
;
self
.add_request_local
(
req
)
}
/// Send a mutation to the worker assigned to a request, optionally publishing
/// a replica-sync event and cleaning up request mappings afterward.
async
fn
mutate_request_worker
(
fn
mutate_request_worker
_local
(
&
self
,
request_id
:
&
RequestId
,
event_data
:
ActiveSequenceEventData
,
mutate_fn
:
impl
FnOnce
(
&
mut
ActiveSequences
,
&
RequestId
),
remove_mapping
:
bool
,
)
->
Result
<
(),
SequenceError
>
{
...
...
@@ -451,22 +469,6 @@ impl<P: SequencePublisher + 'static> ActiveSequencesMultiWorker<P> {
request_id
:
request_id
.clone
(),
})
?
;
if
self
.replica_sync
{
let
lora_name
=
self
.request_to_lora
.get
(
request_id
)
.map
(|
entry
|
entry
.value
()
.clone
());
let
event
=
ActiveSequenceEvent
{
request_id
:
request_id
.clone
(),
worker
,
data
:
event_data
,
router_id
:
self
.router_id
,
lora_name
,
};
self
.publisher
.publish_event
(
&
event
)
.await
?
;
}
{
let
table
=
self
.workers
.read
();
let
&
idx
=
table
...
...
@@ -487,6 +489,40 @@ impl<P: SequencePublisher + 'static> ActiveSequencesMultiWorker<P> {
Ok
(())
}
async
fn
mutate_request_worker
(
&
self
,
request_id
:
&
RequestId
,
event_data
:
ActiveSequenceEventData
,
mutate_fn
:
impl
FnOnce
(
&
mut
ActiveSequences
,
&
RequestId
),
remove_mapping
:
bool
,
)
->
Result
<
(),
SequenceError
>
{
let
worker
=
self
.request_to_worker
.get
(
request_id
)
.map
(|
entry
|
*
entry
)
.ok_or_else
(||
SequenceError
::
RequestNotFound
{
request_id
:
request_id
.clone
(),
})
?
;
if
self
.replica_sync
{
let
lora_name
=
self
.request_to_lora
.get
(
request_id
)
.map
(|
entry
|
entry
.value
()
.clone
());
let
event
=
ActiveSequenceEvent
{
request_id
:
request_id
.clone
(),
worker
,
data
:
event_data
,
router_id
:
self
.router_id
,
lora_name
,
};
self
.publisher
.publish_event
(
&
event
)
.await
?
;
}
self
.mutate_request_worker_local
(
request_id
,
mutate_fn
,
remove_mapping
)
}
/// Free all blocks associated with a request.
///
/// Note: This operation is idempotent. Calling it multiple times for the same request
...
...
@@ -508,6 +544,21 @@ impl<P: SequencePublisher + 'static> ActiveSequencesMultiWorker<P> {
.await
}
pub
fn
free_sync
(
&
self
,
request_id
:
&
RequestId
)
->
Result
<
(),
SequenceError
>
{
self
.ensure_sync_mutation_allowed
()
?
;
if
!
self
.request_to_worker
.contains_key
(
request_id
)
{
tracing
::
debug!
(
"Request {request_id} not found, already freed (idempotent)"
);
return
Ok
(());
}
self
.mutate_request_worker_local
(
request_id
,
|
seqs
,
rid
|
{
seqs
.free
(
rid
);
},
true
,
)
}
/// Mark prefill as completed for a request.
///
/// Note: Calling this multiple times for the same request is allowed and will be a no-op
...
...
@@ -527,6 +578,17 @@ impl<P: SequencePublisher + 'static> ActiveSequencesMultiWorker<P> {
.await
}
pub
fn
mark_prefill_completed_sync
(
&
self
,
request_id
:
&
RequestId
)
->
Result
<
(),
SequenceError
>
{
self
.ensure_sync_mutation_allowed
()
?
;
self
.mutate_request_worker_local
(
request_id
,
|
seqs
,
rid
|
{
seqs
.mark_prefill_completed
(
rid
);
},
false
,
)
}
/// Add an output block with optional fractional decay weight.
///
/// This is used during generation to track output blocks as they are created.
...
...
lib/llm/benches/kv_router_bench.rs
View file @
b2c59aa4
...
...
@@ -40,6 +40,7 @@ use dynamo_llm::model_card::ModelDeploymentCard;
use
dynamo_llm
::
preprocessor
::
prompt
::{
ChatTemplate
,
ContextMixins
,
OAIChatLikeRequest
,
PromptFormatter
,
};
use
dynamo_mocker
::
loadgen
::
RouterSequence
;
/// KV Router event subject suffix (appended to Component.subject())
/// Full subject format: namespace.{namespace}.component.{component}.kv-events
...
...
@@ -532,41 +533,34 @@ impl PrefixData {
}
/// Pre-generated sequence data for benchmarking
#[derive(Clone)]
struct
SequenceData
{
worker_id
:
WorkerId
,
local_hashes
:
Vec
<
LocalBlockHash
>
,
external_hashes
:
Vec
<
ExternalSequenceBlockHash
>
,
}
type
SequenceData
=
RouterSequence
;
impl
SequenceData
{
/// Create a sequence from the exact request content.
fn
from_request_content
(
fn
sequence_from_request_content
(
content
:
&
str
,
worker_id
:
WorkerId
,
kv_block_size
:
u32
,
tokenizer
:
&
Tokenizer
,
prompt_renderer
:
Option
<&
PromptRenderer
>
,
)
->
Result
<
Se
lf
>
{
)
->
Result
<
Se
quenceData
>
{
let
(
local_hashes
,
external_hashes
)
=
compute_hashes_for_content
(
content
,
tokenizer
,
kv_block_size
,
prompt_renderer
)
?
;
Ok
(
Se
lf
{
Ok
(
Se
quenceData
{
worker_id
,
local_hashes
,
external_hashes
,
})
}
}
fn
to_router_event
(
&
self
,
event_id
:
u64
)
->
RouterEvent
{
fn
sequence_
to_router_event
(
sequence
:
&
SequenceData
,
event_id
:
u64
)
->
RouterEvent
{
let
kv_event
=
KvCacheEvent
{
event_id
,
data
:
KvCacheEventData
::
Stored
(
KvCacheStoreData
{
parent_hash
:
None
,
blocks
:
se
lf
blocks
:
se
quence
.local_hashes
.iter
()
.zip
(
se
lf
.external_hashes
.iter
())
.zip
(
se
quence
.external_hashes
.iter
())
.map
(|(
local
,
ext
)|
KvCacheStoredBlockData
{
block_hash
:
*
ext
,
tokens_hash
:
*
local
,
...
...
@@ -576,8 +570,7 @@ impl SequenceData {
}),
dp_rank
:
0
,
};
RouterEvent
::
new
(
self
.worker_id
,
kv_event
)
}
RouterEvent
::
new
(
sequence
.worker_id
,
kv_event
)
}
/// Response from the frontend's /health endpoint
...
...
@@ -692,7 +685,7 @@ fn generate_sequences_for_requests(
num_prefix_prompts
,
seed
,
);
let
seq
=
S
equence
Data
::
from_request_content
(
let
seq
=
s
equence
_
from_request_content
(
&
content
,
worker_id
,
kv_block_size
,
...
...
@@ -749,7 +742,7 @@ async fn build_tree_via_nats(
};
for
(
event_id
,
seq
)
in
sequences
.iter
()
.enumerate
()
{
let
event
=
seq
.
to_router_event
(
event_id
as
u64
);
let
event
=
seq
uence_
to_router_event
(
seq
,
event_id
as
u64
);
let
data
=
encode_event_with_envelope
(
&
event
,
KV_EVENT_SUBJECT
)
?
;
nats_client
.publish
(
subject
.clone
(),
data
.into
())
...
...
@@ -1165,7 +1158,7 @@ async fn publish_events_at_rate(
while
start
.elapsed
()
<
duration
{
let
seq
=
&
sequences
[(
event_id
as
usize
)
%
sequences
.len
()];
let
event
=
seq
.
to_router_event
(
event_id
);
let
event
=
seq
uence_
to_router_event
(
seq
,
event_id
);
match
encode_event_with_envelope
(
&
event
,
KV_EVENT_SUBJECT
)
{
Ok
(
data
)
=>
{
...
...
lib/mocker/src/lib.rs
View file @
b2c59aa4
...
...
@@ -11,5 +11,6 @@ pub mod cache;
pub
mod
common
;
pub
mod
engine
;
pub
mod
kv_manager
;
pub
mod
loadgen
;
pub
mod
replay
;
pub
mod
scheduler
;
lib/mocker/src/loadgen/driver.rs
0 → 100644
View file @
b2c59aa4
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
use
std
::
cmp
::
Ordering
;
use
std
::
collections
::{
BinaryHeap
,
HashMap
};
use
anyhow
::{
Result
,
anyhow
,
bail
};
use
uuid
::
Uuid
;
use
super
::
types
::{
ReadyTurn
,
Trace
,
TurnTrace
};
#[derive(Debug,
Clone,
Copy,
PartialEq,
Eq)]
enum
DriverMode
{
Trace
,
Concurrency
,
}
#[derive(Debug)]
struct
SessionRuntime
{
session_id
:
String
,
turns
:
Vec
<
TurnTrace
>
,
next_turn_index
:
usize
,
next_ready_at_ms
:
Option
<
f64
>
,
in_flight
:
Option
<
Uuid
>
,
}
#[derive(Debug)]
struct
InFlightTurn
{
session_index
:
usize
,
turn_index
:
usize
,
}
#[derive(Debug,
Clone,
Copy)]
struct
ReadySession
{
ready_at_ms
:
f64
,
session_index
:
usize
,
turn_index
:
usize
,
}
impl
PartialEq
for
ReadySession
{
fn
eq
(
&
self
,
other
:
&
Self
)
->
bool
{
self
.ready_at_ms
.to_bits
()
==
other
.ready_at_ms
.to_bits
()
&&
self
.session_index
==
other
.session_index
&&
self
.turn_index
==
other
.turn_index
}
}
impl
Eq
for
ReadySession
{}
impl
Ord
for
ReadySession
{
fn
cmp
(
&
self
,
other
:
&
Self
)
->
Ordering
{
other
.ready_at_ms
.total_cmp
(
&
self
.ready_at_ms
)
.then_with
(||
other
.session_index
.cmp
(
&
self
.session_index
))
.then_with
(||
other
.turn_index
.cmp
(
&
self
.turn_index
))
}
}
impl
PartialOrd
for
ReadySession
{
fn
partial_cmp
(
&
self
,
other
:
&
Self
)
->
Option
<
Ordering
>
{
Some
(
self
.cmp
(
other
))
}
}
#[derive(Debug)]
pub
struct
WorkloadDriver
{
mode
:
DriverMode
,
block_size
:
usize
,
sessions
:
Vec
<
SessionRuntime
>
,
in_flight
:
HashMap
<
Uuid
,
InFlightTurn
>
,
ready_sessions
:
BinaryHeap
<
ReadySession
>
,
}
impl
WorkloadDriver
{
pub
(
crate
)
fn
new_trace
(
trace
:
Trace
)
->
Result
<
Self
>
{
Self
::
new
(
trace
,
DriverMode
::
Trace
)
}
pub
(
crate
)
fn
new_concurrency
(
trace
:
Trace
)
->
Result
<
Self
>
{
Self
::
new
(
trace
,
DriverMode
::
Concurrency
)
}
fn
new
(
trace
:
Trace
,
mode
:
DriverMode
)
->
Result
<
Self
>
{
let
sessions
:
Vec
<
SessionRuntime
>
=
trace
.sessions
.into_iter
()
.map
(|
session
|
SessionRuntime
{
session_id
:
session
.session_id
,
turns
:
session
.turns
,
next_turn_index
:
0
,
next_ready_at_ms
:
Some
(
match
mode
{
DriverMode
::
Trace
=>
session
.first_arrival_timestamp_ms
.unwrap_or
(
0.0
),
DriverMode
::
Concurrency
=>
0.0
,
}),
in_flight
:
None
,
})
.collect
();
let
ready_sessions
=
sessions
.iter
()
.enumerate
()
.filter_map
(|(
session_index
,
session
)|
{
Some
(
ReadySession
{
ready_at_ms
:
session
.next_ready_at_ms
?
,
session_index
,
turn_index
:
session
.next_turn_index
,
})
})
.collect
();
Ok
(
Self
{
mode
,
block_size
:
trace
.block_size
,
sessions
,
in_flight
:
HashMap
::
new
(),
ready_sessions
,
})
}
pub
fn
pop_ready
(
&
mut
self
,
now_ms
:
f64
,
limit
:
usize
)
->
Vec
<
ReadyTurn
>
{
if
limit
==
0
{
return
Vec
::
new
();
}
let
mut
emitted
=
Vec
::
new
();
while
emitted
.len
()
<
limit
{
let
Some
(
ready_session
)
=
self
.ready_sessions
.pop
()
else
{
break
;
};
if
ready_session
.ready_at_ms
>
now_ms
{
self
.ready_sessions
.push
(
ready_session
);
break
;
}
let
session_index
=
ready_session
.session_index
;
let
session
=
&
mut
self
.sessions
[
session_index
];
if
session
.in_flight
.is_some
()
||
session
.next_turn_index
!=
ready_session
.turn_index
||
session
.next_ready_at_ms
!=
Some
(
ready_session
.ready_at_ms
)
{
continue
;
}
let
turn_index
=
session
.next_turn_index
;
let
scheduled_ready_at_ms
=
session
.next_ready_at_ms
.expect
(
"ready session must have a timestamp"
);
let
request_uuid
=
Uuid
::
new_v4
();
let
replay_hashes
=
session
.turns
[
turn_index
]
.to_replay_hashes
(
self
.block_size
)
.expect
(
"validated trace should always synthesize replay hashes"
);
let
arrival_timestamp_ms
=
match
self
.mode
{
DriverMode
::
Trace
=>
Some
(
scheduled_ready_at_ms
),
DriverMode
::
Concurrency
=>
None
,
};
let
request
=
session
.turns
[
turn_index
]
.to_direct_request
(
self
.block_size
,
request_uuid
,
arrival_timestamp_ms
)
.expect
(
"validated trace should always synthesize into a direct request"
);
session
.in_flight
=
Some
(
request_uuid
);
session
.next_ready_at_ms
=
None
;
self
.in_flight
.insert
(
request_uuid
,
InFlightTurn
{
session_index
,
turn_index
,
},
);
emitted
.push
(
ReadyTurn
{
request_uuid
,
session_id
:
session
.session_id
.clone
(),
turn_index
,
scheduled_ready_at_ms
,
replay_hashes
:
Some
(
replay_hashes
),
request
,
});
}
emitted
}
pub
fn
on_complete
(
&
mut
self
,
request_uuid
:
Uuid
,
now_ms
:
f64
)
->
Result
<
()
>
{
let
in_flight
=
self
.in_flight
.remove
(
&
request_uuid
)
.ok_or_else
(||
anyhow!
(
"unknown workload request completion for {request_uuid}"
))
?
;
let
session
=
self
.sessions
.get_mut
(
in_flight
.session_index
)
.ok_or_else
(||
anyhow!
(
"unknown workload session {}"
,
in_flight
.session_index
))
?
;
if
session
.in_flight
!=
Some
(
request_uuid
)
{
bail!
(
"session {} completion for {} does not match in-flight request {:?}"
,
session
.session_id
,
request_uuid
,
session
.in_flight
);
}
session
.in_flight
=
None
;
session
.next_turn_index
=
in_flight
.turn_index
+
1
;
if
session
.next_turn_index
<
session
.turns
.len
()
{
let
ready_at_ms
=
now_ms
+
session
.turns
[
session
.next_turn_index
]
.delay_after_previous_ms
;
session
.next_ready_at_ms
=
Some
(
ready_at_ms
);
self
.ready_sessions
.push
(
ReadySession
{
ready_at_ms
,
session_index
:
in_flight
.session_index
,
turn_index
:
session
.next_turn_index
,
});
}
else
{
session
.next_ready_at_ms
=
None
;
}
Ok
(())
}
pub
fn
next_ready_time_ms
(
&
mut
self
)
->
Option
<
f64
>
{
loop
{
let
ready_session
=
*
self
.ready_sessions
.peek
()
?
;
let
session
=
&
self
.sessions
[
ready_session
.session_index
];
if
session
.in_flight
.is_some
()
||
session
.next_turn_index
!=
ready_session
.turn_index
||
session
.next_ready_at_ms
!=
Some
(
ready_session
.ready_at_ms
)
{
self
.ready_sessions
.pop
();
continue
;
}
return
Some
(
ready_session
.ready_at_ms
);
}
}
pub
fn
is_drained
(
&
self
)
->
bool
{
self
.in_flight
.is_empty
()
&&
self
.sessions
.iter
()
.all
(|
session
|
session
.next_turn_index
>=
session
.turns
.len
())
}
}
lib/mocker/src/loadgen/mod.rs
0 → 100644
View file @
b2c59aa4
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
mod
driver
;
mod
trace
;
mod
types
;
pub
use
driver
::
WorkloadDriver
;
pub
use
types
::{
ArrivalSpec
,
DelaySpec
,
LengthSpec
,
ReadyTurn
,
ReplayRequestHashes
,
RouterSequence
,
SequenceHashMode
,
SessionPartitionSpec
,
SessionTrace
,
SyntheticTraceSpec
,
Trace
,
TurnTrace
,
};
#[cfg(test)]
mod
tests
;
lib/mocker/src/loadgen/tests.rs
0 → 100644
View file @
b2c59aa4
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
use
dynamo_kv_router
::
protocols
::{
compute_block_hash_for_seq
,
compute_seq_hash_for_block
};
use
tempfile
::
NamedTempFile
;
use
uuid
::
Uuid
;
use
super
::
*
;
fn
write_trace
(
lines
:
&
[
serde_json
::
Value
])
->
NamedTempFile
{
let
mut
file
=
NamedTempFile
::
new
()
.unwrap
();
for
line
in
lines
{
use
std
::
io
::
Write
;
writeln!
(
file
,
"{}"
,
serde_json
::
to_string
(
line
)
.unwrap
())
.unwrap
();
}
file
}
#[test]
fn
test_from_mooncake_single_turn_preserves_fields
()
{
let
file
=
write_trace
(
&
[
serde_json
::
json!
({
"timestamp"
:
123.0
,
"input_length"
:
8
,
"output_length"
:
4
,
"hash_ids"
:
[
7
,
8
],
})]);
let
trace
=
Trace
::
from_mooncake
(
file
.path
(),
4
)
.unwrap
();
assert_eq!
(
trace
.sessions
.len
(),
1
);
let
session
=
&
trace
.sessions
[
0
];
assert_eq!
(
session
.first_arrival_timestamp_ms
,
Some
(
123.0
));
assert_eq!
(
session
.turns
.len
(),
1
);
assert_eq!
(
session
.turns
[
0
]
.input_length
,
8
);
assert_eq!
(
session
.turns
[
0
]
.max_output_tokens
,
4
);
assert_eq!
(
session
.turns
[
0
]
.hash_ids
,
vec!
[
7
,
8
]);
}
#[test]
fn
test_from_mooncake_multi_turn_uses_session_id_and_delay
()
{
let
file
=
write_trace
(
&
[
serde_json
::
json!
({
"session_id"
:
"a"
,
"timestamp"
:
10.0
,
"input_length"
:
4
,
"output_length"
:
1
,
"hash_ids"
:
[
1
],
}),
serde_json
::
json!
({
"session_id"
:
"a"
,
"delay"
:
25.0
,
"input_length"
:
8
,
"output_length"
:
2
,
"hash_ids"
:
[
1
,
2
],
}),
serde_json
::
json!
({
"session_id"
:
"b"
,
"timestamp"
:
20.0
,
"input_length"
:
4
,
"output_length"
:
1
,
"hash_ids"
:
[
3
],
}),
]);
let
trace
=
Trace
::
from_mooncake
(
file
.path
(),
4
)
.unwrap
();
assert_eq!
(
trace
.sessions
.len
(),
2
);
assert_eq!
(
trace
.sessions
[
0
]
.session_id
,
"a"
);
assert_eq!
(
trace
.sessions
[
0
]
.turns
.len
(),
2
);
assert_eq!
(
trace
.sessions
[
0
]
.turns
[
1
]
.delay_after_previous_ms
,
25.0
);
assert_eq!
(
trace
.sessions
[
1
]
.session_id
,
"b"
);
}
#[test]
fn
test_from_mooncake_defaults_missing_input_length_from_hash_capacity
()
{
let
file
=
write_trace
(
&
[
serde_json
::
json!
({
"timestamp"
:
7.0
,
"output_length"
:
3
,
"hash_ids"
:
[
5
,
6
],
})]);
let
trace
=
Trace
::
from_mooncake
(
file
.path
(),
4
)
.unwrap
();
assert_eq!
(
trace
.sessions
.len
(),
1
);
assert_eq!
(
trace
.sessions
[
0
]
.turns
[
0
]
.input_length
,
8
);
}
#[test]
fn
test_turn_to_direct_request_repeats_hash_ids_by_block_size
()
{
let
turn
=
TurnTrace
{
input_length
:
6
,
max_output_tokens
:
3
,
hash_ids
:
vec!
[
1
,
2
],
delay_after_previous_ms
:
0.0
,
};
let
request
=
turn
.to_direct_request
(
4
,
Uuid
::
from_u128
(
1
),
Some
(
5.0
))
.unwrap
();
assert_eq!
(
request
.tokens
,
vec!
[
1
,
1
,
1
,
1
,
2
,
2
]);
assert_eq!
(
request
.arrival_timestamp_ms
,
Some
(
5.0
));
}
#[test]
fn
test_turn_replay_hashes_match_full_blocks_only
()
{
let
turn
=
TurnTrace
{
input_length
:
6
,
max_output_tokens
:
3
,
hash_ids
:
vec!
[
1
,
2
],
delay_after_previous_ms
:
0.0
,
};
let
request
=
turn
.to_direct_request
(
4
,
Uuid
::
from_u128
(
1
),
Some
(
5.0
))
.unwrap
();
let
replay_hashes
=
turn
.to_replay_hashes
(
4
)
.unwrap
();
let
expected_local
=
compute_block_hash_for_seq
(
&
request
.tokens
,
4
,
None
,
None
);
assert_eq!
(
replay_hashes
.local_block_hashes
,
expected_local
);
assert_eq!
(
replay_hashes
.sequence_hashes
,
compute_seq_hash_for_block
(
&
expected_local
)
);
assert_eq!
(
replay_hashes
.local_block_hashes
.len
(),
1
);
}
#[test]
fn
test_partition_by_session_round_robin_keeps_sessions_intact
()
{
let
trace
=
Trace
::
synthetic
(
SyntheticTraceSpec
{
block_size
:
4
,
num_sessions
:
4
,
turns_per_session
:
2
,
input_tokens
:
LengthSpec
{
mean
:
8
,
stddev
:
0.0
,
},
output_tokens
:
LengthSpec
{
mean
:
2
,
stddev
:
0.0
,
},
shared_prefix_ratio
:
0.5
,
num_prefix_groups
:
2
,
first_turn_arrivals
:
ArrivalSpec
::
Burst
,
inter_turn_delays
:
DelaySpec
::
ConstantMs
(
5.0
),
seed
:
7
,
})
.unwrap
();
let
partitions
=
trace
.partition_by_session
(
SessionPartitionSpec
::
RoundRobin
{
num_partitions
:
2
});
assert_eq!
(
partitions
.len
(),
2
);
assert_eq!
(
partitions
[
0
]
.sessions
.len
(),
2
);
assert_eq!
(
partitions
[
1
]
.sessions
.len
(),
2
);
assert
!
(
partitions
.iter
()
.flat_map
(|
partition
|
partition
.sessions
.iter
())
.all
(|
session
|
session
.turns
.len
()
==
2
)
);
}
#[test]
fn
test_synthetic_prefix_groups_share_prefixes_within_group
()
{
let
trace
=
Trace
::
synthetic
(
SyntheticTraceSpec
{
block_size
:
4
,
num_sessions
:
6
,
turns_per_session
:
1
,
input_tokens
:
LengthSpec
{
mean
:
16
,
stddev
:
0.0
,
},
output_tokens
:
LengthSpec
{
mean
:
2
,
stddev
:
0.0
,
},
shared_prefix_ratio
:
0.5
,
num_prefix_groups
:
2
,
first_turn_arrivals
:
ArrivalSpec
::
Burst
,
inter_turn_delays
:
DelaySpec
::
None
,
seed
:
42
,
})
.unwrap
();
let
prefix_len
=
2
;
let
prefixes
=
trace
.sessions
.iter
()
.map
(|
session
|
session
.turns
[
0
]
.hash_ids
[
..
prefix_len
]
.to_vec
())
.collect
::
<
Vec
<
_
>>
();
assert
!
(
prefixes
.windows
(
2
)
.any
(|
window
|
window
[
0
]
==
window
[
1
]));
}
#[test]
fn
test_expand_hash_prefix_depth_scales_hashes_and_input_length
()
{
let
trace
=
Trace
{
block_size
:
4
,
sessions
:
vec!
[
SessionTrace
{
session_id
:
"session"
.to_string
(),
first_arrival_timestamp_ms
:
Some
(
10.0
),
turns
:
vec!
[
TurnTrace
{
input_length
:
6
,
max_output_tokens
:
2
,
hash_ids
:
vec!
[
7
,
8
],
delay_after_previous_ms
:
0.0
,
}],
}],
}
.expand_hash_prefix_depth
(
3
);
let
turn
=
&
trace
.sessions
[
0
]
.turns
[
0
];
assert_eq!
(
turn
.input_length
,
18
);
assert_eq!
(
turn
.hash_ids
,
vec!
[
21
,
22
,
23
,
24
,
25
,
26
]);
let
request
=
turn
.to_direct_request
(
trace
.block_size
,
Uuid
::
from_u128
(
2
),
Some
(
10.0
))
.unwrap
();
assert_eq!
(
request
.tokens
.len
(),
18
);
}
#[test]
fn
test_rescale_ready_span_scales_session_starts_and_inter_turn_delays
()
{
let
trace
=
Trace
{
block_size
:
4
,
sessions
:
vec!
[
SessionTrace
{
session_id
:
"a"
.to_string
(),
first_arrival_timestamp_ms
:
Some
(
10.0
),
turns
:
vec!
[
TurnTrace
{
input_length
:
4
,
max_output_tokens
:
1
,
hash_ids
:
vec!
[
1
],
delay_after_previous_ms
:
0.0
,
},
TurnTrace
{
input_length
:
4
,
max_output_tokens
:
1
,
hash_ids
:
vec!
[
2
],
delay_after_previous_ms
:
20.0
,
},
],
},
SessionTrace
{
session_id
:
"b"
.to_string
(),
first_arrival_timestamp_ms
:
Some
(
30.0
),
turns
:
vec!
[
TurnTrace
{
input_length
:
4
,
max_output_tokens
:
1
,
hash_ids
:
vec!
[
3
],
delay_after_previous_ms
:
0.0
,
}],
},
],
}
.rescale_ready_span
(
100
)
.unwrap
();
assert_eq!
(
trace
.sessions
[
0
]
.first_arrival_timestamp_ms
,
Some
(
0.0
));
assert_eq!
(
trace
.sessions
[
1
]
.first_arrival_timestamp_ms
,
Some
(
100.0
));
assert_eq!
(
trace
.sessions
[
0
]
.turns
[
1
]
.delay_after_previous_ms
,
100.0
);
}
#[test]
fn
test_driver_requires_completion_before_follow_up_turn
()
{
let
trace
=
Trace
{
block_size
:
4
,
sessions
:
vec!
[
SessionTrace
{
session_id
:
"s"
.to_string
(),
first_arrival_timestamp_ms
:
Some
(
0.0
),
turns
:
vec!
[
TurnTrace
{
input_length
:
4
,
max_output_tokens
:
1
,
hash_ids
:
vec!
[
1
],
delay_after_previous_ms
:
0.0
,
},
TurnTrace
{
input_length
:
4
,
max_output_tokens
:
1
,
hash_ids
:
vec!
[
2
],
delay_after_previous_ms
:
10.0
,
},
],
}],
};
let
mut
driver
=
trace
.into_trace_driver
()
.unwrap
();
let
first
=
driver
.pop_ready
(
0.0
,
1
);
assert_eq!
(
first
.len
(),
1
);
assert
!
(
driver
.pop_ready
(
100.0
,
1
)
.is_empty
());
driver
.on_complete
(
first
[
0
]
.request_uuid
,
5.0
)
.unwrap
();
assert
!
(
driver
.pop_ready
(
14.0
,
1
)
.is_empty
());
let
second
=
driver
.pop_ready
(
15.0
,
1
);
assert_eq!
(
second
.len
(),
1
);
assert_eq!
(
second
[
0
]
.turn_index
,
1
);
}
#[test]
fn
test_driver_next_ready_time_tracks_earliest_pending_turn
()
{
let
trace
=
Trace
{
block_size
:
4
,
sessions
:
vec!
[
SessionTrace
{
session_id
:
"a"
.to_string
(),
first_arrival_timestamp_ms
:
Some
(
10.0
),
turns
:
vec!
[
TurnTrace
{
input_length
:
4
,
max_output_tokens
:
1
,
hash_ids
:
vec!
[
1
],
delay_after_previous_ms
:
0.0
,
},
TurnTrace
{
input_length
:
4
,
max_output_tokens
:
1
,
hash_ids
:
vec!
[
2
],
delay_after_previous_ms
:
5.0
,
},
],
},
SessionTrace
{
session_id
:
"b"
.to_string
(),
first_arrival_timestamp_ms
:
Some
(
20.0
),
turns
:
vec!
[
TurnTrace
{
input_length
:
4
,
max_output_tokens
:
1
,
hash_ids
:
vec!
[
3
],
delay_after_previous_ms
:
0.0
,
}],
},
],
};
let
mut
driver
=
trace
.into_trace_driver
()
.unwrap
();
assert_eq!
(
driver
.next_ready_time_ms
(),
Some
(
10.0
));
let
first
=
driver
.pop_ready
(
10.0
,
1
);
assert_eq!
(
first
.len
(),
1
);
assert_eq!
(
driver
.next_ready_time_ms
(),
Some
(
20.0
));
driver
.on_complete
(
first
[
0
]
.request_uuid
,
25.0
)
.unwrap
();
assert_eq!
(
driver
.next_ready_time_ms
(),
Some
(
20.0
));
let
second
=
driver
.pop_ready
(
20.0
,
1
);
assert_eq!
(
second
.len
(),
1
);
assert_eq!
(
driver
.next_ready_time_ms
(),
Some
(
30.0
));
}
#[test]
fn
test_trace_driver_round_trips_turn_semantics_into_ready_requests
()
{
let
trace
=
Trace
{
block_size
:
2
,
sessions
:
vec!
[
SessionTrace
{
session_id
:
"session-a"
.to_string
(),
first_arrival_timestamp_ms
:
Some
(
10.0
),
turns
:
vec!
[
TurnTrace
{
input_length
:
4
,
max_output_tokens
:
2
,
hash_ids
:
vec!
[
1
,
2
],
delay_after_previous_ms
:
0.0
,
},
TurnTrace
{
input_length
:
2
,
max_output_tokens
:
3
,
hash_ids
:
vec!
[
3
],
delay_after_previous_ms
:
5.0
,
},
],
},
SessionTrace
{
session_id
:
"session-b"
.to_string
(),
first_arrival_timestamp_ms
:
Some
(
12.0
),
turns
:
vec!
[
TurnTrace
{
input_length
:
2
,
max_output_tokens
:
1
,
hash_ids
:
vec!
[
4
],
delay_after_previous_ms
:
0.0
,
}],
},
],
};
let
expected
=
trace
.clone
();
let
mut
driver
=
trace
.into_trace_driver
()
.unwrap
();
assert
!
(
driver
.pop_ready
(
9.0
,
usize
::
MAX
)
.is_empty
());
let
first
=
driver
.pop_ready
(
10.0
,
usize
::
MAX
);
assert_eq!
(
first
.len
(),
1
);
let
first
=
&
first
[
0
];
assert_eq!
(
first
.session_id
,
"session-a"
);
assert_eq!
(
first
.turn_index
,
0
);
assert_eq!
(
first
.scheduled_ready_at_ms
,
10.0
);
assert_eq!
(
first
.request.tokens
.len
(),
expected
.sessions
[
0
]
.turns
[
0
]
.input_length
);
assert_eq!
(
first
.request.max_output_tokens
,
expected
.sessions
[
0
]
.turns
[
0
]
.max_output_tokens
);
assert_eq!
(
first
.request.arrival_timestamp_ms
,
Some
(
10.0
));
assert_eq!
(
first
.replay_hashes
.as_ref
(),
Some
(
&
expected
.sessions
[
0
]
.turns
[
0
]
.to_replay_hashes
(
expected
.block_size
)
.unwrap
()
)
);
let
expected_first_request
=
expected
.sessions
[
0
]
.turns
[
0
]
.to_direct_request
(
expected
.block_size
,
first
.request_uuid
,
Some
(
10.0
))
.unwrap
();
assert_eq!
(
first
.request.tokens
,
expected_first_request
.tokens
);
assert_eq!
(
first
.request.max_output_tokens
,
expected_first_request
.max_output_tokens
);
assert_eq!
(
first
.request.uuid
,
expected_first_request
.uuid
);
assert_eq!
(
first
.request.arrival_timestamp_ms
,
expected_first_request
.arrival_timestamp_ms
);
let
second
=
driver
.pop_ready
(
12.0
,
usize
::
MAX
);
assert_eq!
(
second
.len
(),
1
);
let
second
=
&
second
[
0
];
assert_eq!
(
second
.session_id
,
"session-b"
);
assert_eq!
(
second
.turn_index
,
0
);
assert_eq!
(
second
.scheduled_ready_at_ms
,
12.0
);
assert_eq!
(
second
.request.tokens
.len
(),
expected
.sessions
[
1
]
.turns
[
0
]
.input_length
);
assert_eq!
(
second
.request.max_output_tokens
,
expected
.sessions
[
1
]
.turns
[
0
]
.max_output_tokens
);
assert_eq!
(
second
.request.arrival_timestamp_ms
,
Some
(
12.0
));
assert_eq!
(
second
.replay_hashes
.as_ref
(),
Some
(
&
expected
.sessions
[
1
]
.turns
[
0
]
.to_replay_hashes
(
expected
.block_size
)
.unwrap
()
)
);
driver
.on_complete
(
first
.request_uuid
,
20.0
)
.unwrap
();
assert
!
(
driver
.pop_ready
(
24.0
,
usize
::
MAX
)
.is_empty
());
let
third
=
driver
.pop_ready
(
25.0
,
usize
::
MAX
);
assert_eq!
(
third
.len
(),
1
);
let
third
=
&
third
[
0
];
assert_eq!
(
third
.session_id
,
"session-a"
);
assert_eq!
(
third
.turn_index
,
1
);
assert_eq!
(
third
.scheduled_ready_at_ms
,
25.0
);
assert_eq!
(
third
.request.tokens
.len
(),
expected
.sessions
[
0
]
.turns
[
1
]
.input_length
);
assert_eq!
(
third
.request.max_output_tokens
,
expected
.sessions
[
0
]
.turns
[
1
]
.max_output_tokens
);
assert_eq!
(
third
.request.arrival_timestamp_ms
,
Some
(
25.0
));
assert_eq!
(
third
.replay_hashes
.as_ref
(),
Some
(
&
expected
.sessions
[
0
]
.turns
[
1
]
.to_replay_hashes
(
expected
.block_size
)
.unwrap
()
)
);
let
expected_third_request
=
expected
.sessions
[
0
]
.turns
[
1
]
.to_direct_request
(
expected
.block_size
,
third
.request_uuid
,
Some
(
25.0
))
.unwrap
();
assert_eq!
(
third
.request.tokens
,
expected_third_request
.tokens
);
assert_eq!
(
third
.request.max_output_tokens
,
expected_third_request
.max_output_tokens
);
assert_eq!
(
third
.request.uuid
,
expected_third_request
.uuid
);
assert_eq!
(
third
.request.arrival_timestamp_ms
,
expected_third_request
.arrival_timestamp_ms
);
}
lib/mocker/src/loadgen/trace.rs
0 → 100644
View file @
b2c59aa4
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
use
std
::
collections
::
HashMap
;
use
std
::
fs
::
File
;
use
std
::
io
::{
BufRead
,
BufReader
};
use
std
::
path
::
Path
;
use
anyhow
::{
Context
,
Result
,
anyhow
,
bail
};
use
dynamo_kv_router
::
LocalBlockHash
;
use
dynamo_kv_router
::
protocols
::{
ExternalSequenceBlockHash
,
WorkerId
,
XXH3_SEED
,
compute_seq_hash_for_block
,
};
use
dynamo_tokens
::
compute_hash_v2
;
use
rand
::
rngs
::
StdRng
;
use
rand
::{
Rng
,
SeedableRng
};
use
serde
::
Deserialize
;
use
uuid
::
Uuid
;
use
super
::
driver
::
WorkloadDriver
;
use
super
::
types
::{
ArrivalSpec
,
DelaySpec
,
LengthSpec
,
ReplayRequestHashes
,
RouterSequence
,
SequenceHashMode
,
SessionPartitionSpec
,
SessionTrace
,
SyntheticTraceSpec
,
Trace
,
TurnTrace
,
};
use
crate
::
common
::
protocols
::
DirectRequest
;
#[derive(Debug,
Deserialize)]
struct
RawMooncakeRecord
{
#[serde(default)]
session_id
:
Option
<
String
>
,
#[serde(default)]
timestamp
:
Option
<
f64
>
,
#[serde(default)]
created_time
:
Option
<
f64
>
,
#[serde(default,
alias
=
"input_tokens"
)]
input_length
:
Option
<
usize
>
,
#[serde(default,
alias
=
"output_tokens"
)]
output_length
:
Option
<
usize
>
,
#[serde(default)]
hash_ids
:
Option
<
Vec
<
u64
>>
,
#[serde(default)]
delay
:
Option
<
f64
>
,
#[serde(default)]
delay_ms
:
Option
<
f64
>
,
}
impl
TurnTrace
{
fn
validate_block_size_and_capacity
(
&
self
,
block_size
:
usize
)
->
Result
<
()
>
{
if
block_size
==
0
{
bail!
(
"block_size must be greater than 0"
);
}
if
self
.hash_ids
.len
()
*
block_size
<
self
.input_length
{
bail!
(
"input_length {} exceeds synthesized capacity {}"
,
self
.input_length
,
self
.hash_ids
.len
()
*
block_size
);
}
Ok
(())
}
pub
fn
to_direct_request
(
&
self
,
block_size
:
usize
,
request_uuid
:
Uuid
,
arrival_timestamp_ms
:
Option
<
f64
>
,
)
->
Result
<
DirectRequest
>
{
self
.validate_block_size_and_capacity
(
block_size
)
?
;
let
mut
tokens
=
Vec
::
with_capacity
(
self
.input_length
);
for
&
hash_id
in
&
self
.hash_ids
{
let
token_id
=
hash_id
as
u32
;
tokens
.extend
((
0
..
block_size
)
.map
(|
_
|
token_id
));
if
tokens
.len
()
>=
self
.input_length
{
tokens
.truncate
(
self
.input_length
);
break
;
}
}
if
tokens
.len
()
!=
self
.input_length
{
bail!
(
"failed to synthesize {} tokens from {} hash_ids"
,
self
.input_length
,
self
.hash_ids
.len
()
);
}
Ok
(
DirectRequest
{
tokens
,
max_output_tokens
:
self
.max_output_tokens
,
uuid
:
Some
(
request_uuid
),
dp_rank
:
0
,
arrival_timestamp_ms
,
})
}
pub
fn
to_replay_hashes
(
&
self
,
block_size
:
usize
)
->
Result
<
ReplayRequestHashes
>
{
self
.validate_block_size_and_capacity
(
block_size
)
?
;
let
num_full_blocks
=
self
.input_length
/
block_size
;
let
local_block_hashes
=
self
.hash_ids
.iter
()
.take
(
num_full_blocks
)
.map
(|
&
hash_id
|
local_block_hash_from_id
(
hash_id
,
block_size
))
.collect
::
<
Vec
<
_
>>
();
let
sequence_hashes
=
compute_seq_hash_for_block
(
&
local_block_hashes
);
Ok
(
ReplayRequestHashes
{
local_block_hashes
,
sequence_hashes
,
})
}
}
impl
Trace
{
pub
fn
from_mooncake
(
path
:
&
Path
,
block_size
:
usize
)
->
Result
<
Self
>
{
if
block_size
==
0
{
bail!
(
"block_size must be greater than 0"
);
}
let
file
=
File
::
open
(
path
)
.with_context
(||
format!
(
"failed to open trace file {}"
,
path
.display
()))
?
;
let
reader
=
BufReader
::
new
(
file
);
let
mut
sessions
=
Vec
::
new
();
let
mut
session_indices
=
HashMap
::
new
();
let
mut
last_timestamps
:
Vec
<
Option
<
f64
>>
=
Vec
::
new
();
for
(
line_idx
,
line
)
in
reader
.lines
()
.enumerate
()
{
let
line
=
line
.with_context
(||
{
format!
(
"failed to read line {} from {}"
,
line_idx
+
1
,
path
.display
()
)
})
?
;
if
line
.trim
()
.is_empty
()
{
continue
;
}
let
raw
:
RawMooncakeRecord
=
serde_json
::
from_str
(
&
line
)
.with_context
(||
{
format!
(
"failed to parse line {} from {} as JSON"
,
line_idx
+
1
,
path
.display
()
)
})
?
;
let
session_id
=
raw
.session_id
.unwrap_or_else
(||
format!
(
"request_{}"
,
line_idx
+
1
));
let
hash_ids
=
raw
.hash_ids
.ok_or_else
(||
anyhow!
(
"trace line {} is missing hash_ids"
,
line_idx
+
1
))
?
;
let
input_length
=
raw
.input_length
.unwrap_or
(
hash_ids
.len
()
*
block_size
);
let
output_length
=
raw
.output_length
.ok_or_else
(||
anyhow!
(
"trace line {} is missing output_length"
,
line_idx
+
1
))
?
;
let
timestamp_ms
=
raw
.timestamp
.or
(
raw
.created_time
);
let
explicit_delay_ms
=
raw
.delay
.or
(
raw
.delay_ms
);
let
session_index
=
*
session_indices
.entry
(
session_id
.clone
())
.or_insert_with
(||
{
let
idx
=
sessions
.len
();
sessions
.push
(
SessionTrace
{
session_id
:
session_id
.clone
(),
first_arrival_timestamp_ms
:
timestamp_ms
,
turns
:
Vec
::
new
(),
});
last_timestamps
.push
(
timestamp_ms
);
idx
});
let
session
=
sessions
.get_mut
(
session_index
)
.expect
(
"newly inserted session must exist"
);
let
turn_idx
=
session
.turns
.len
();
let
delay_after_previous_ms
=
if
turn_idx
==
0
{
let
delay
=
explicit_delay_ms
.unwrap_or
(
0.0
);
if
delay
!=
0.0
{
bail!
(
"trace line {} sets delay on the first turn of session {}"
,
line_idx
+
1
,
session
.session_id
);
}
0.0
}
else
if
let
Some
(
delay_ms
)
=
explicit_delay_ms
{
delay_ms
}
else
if
let
Some
(
timestamp_ms
)
=
timestamp_ms
{
let
previous_timestamp_ms
=
last_timestamps
[
session_index
]
.ok_or_else
(||
{
anyhow!
(
"trace line {} for session {} cannot infer delay without a previous timestamp"
,
line_idx
+
1
,
session
.session_id
)
})
?
;
timestamp_ms
-
previous_timestamp_ms
}
else
{
0.0
};
if
!
delay_after_previous_ms
.is_finite
()
||
delay_after_previous_ms
<
0.0
{
bail!
(
"trace line {} has invalid delay {}"
,
line_idx
+
1
,
delay_after_previous_ms
);
}
if
hash_ids
.len
()
*
block_size
<
input_length
{
bail!
(
"trace line {} input_length {} exceeds synthesized capacity {}"
,
line_idx
+
1
,
input_length
,
hash_ids
.len
()
*
block_size
);
}
session
.turns
.push
(
TurnTrace
{
input_length
,
max_output_tokens
:
output_length
,
hash_ids
,
delay_after_previous_ms
,
});
if
let
Some
(
timestamp_ms
)
=
timestamp_ms
{
last_timestamps
[
session_index
]
=
Some
(
timestamp_ms
);
}
}
if
sessions
.is_empty
()
{
bail!
(
"trace file {} did not contain any requests"
,
path
.display
());
}
Ok
(
Self
{
block_size
,
sessions
,
})
}
pub
fn
synthetic
(
spec
:
SyntheticTraceSpec
)
->
Result
<
Self
>
{
if
spec
.block_size
==
0
{
bail!
(
"block_size must be greater than 0"
);
}
if
spec
.num_sessions
==
0
{
bail!
(
"num_sessions must be greater than 0"
);
}
if
spec
.turns_per_session
==
0
{
bail!
(
"turns_per_session must be greater than 0"
);
}
if
!
(
0.0
..=
1.0
)
.contains
(
&
spec
.shared_prefix_ratio
)
{
bail!
(
"shared_prefix_ratio must be between 0.0 and 1.0, got {}"
,
spec
.shared_prefix_ratio
);
}
let
mut
rng
=
StdRng
::
seed_from_u64
(
spec
.seed
);
let
mut
sessions
=
Vec
::
with_capacity
(
spec
.num_sessions
);
let
mut
first_arrivals
=
Vec
::
with_capacity
(
spec
.num_sessions
);
let
mean_gap_ms
=
arrival_spec_mean_gap_ms
(
&
spec
.first_turn_arrivals
)
?
;
let
mut
next_arrival_ms
=
0.0
;
for
session_idx
in
0
..
spec
.num_sessions
{
if
session_idx
==
0
{
first_arrivals
.push
(
0.0
);
continue
;
}
next_arrival_ms
+=
sample_arrival_gap_ms
(
&
spec
.first_turn_arrivals
,
mean_gap_ms
,
&
mut
rng
)
?
;
first_arrivals
.push
(
next_arrival_ms
);
}
let
mut
next_unique_hash
=
1_u64
;
for
(
session_idx
,
first_arrival_timestamp_ms
)
in
first_arrivals
.into_iter
()
.enumerate
()
{
let
group_id
=
if
spec
.num_prefix_groups
>
0
&&
spec
.shared_prefix_ratio
>
0.0
{
Some
(
rng
.random_range
(
0
..
spec
.num_prefix_groups
)
as
u64
)
}
else
{
None
};
let
mut
turns
=
Vec
::
with_capacity
(
spec
.turns_per_session
);
for
turn_idx
in
0
..
spec
.turns_per_session
{
let
input_length
=
sample_length
(
&
spec
.input_tokens
,
1
,
&
mut
rng
);
let
max_output_tokens
=
sample_length
(
&
spec
.output_tokens
,
1
,
&
mut
rng
);
let
num_blocks
=
input_length
.div_ceil
(
spec
.block_size
);
let
prefix_blocks
=
((
num_blocks
as
f64
)
*
spec
.shared_prefix_ratio
)
.round
()
as
usize
;
let
prefix_blocks
=
prefix_blocks
.min
(
num_blocks
);
let
mut
hash_ids
=
Vec
::
with_capacity
(
num_blocks
);
for
block_idx
in
0
..
prefix_blocks
{
if
let
Some
(
group_id
)
=
group_id
{
hash_ids
.push
(
0xD00D_0000_0000_0000
|
(
group_id
<<
32
)
|
block_idx
as
u64
);
}
}
while
hash_ids
.len
()
<
num_blocks
{
hash_ids
.push
(
next_unique_hash
);
next_unique_hash
=
next_unique_hash
.checked_add
(
1
)
.expect
(
"synthetic hash id overflow"
);
}
turns
.push
(
TurnTrace
{
input_length
,
max_output_tokens
,
hash_ids
,
delay_after_previous_ms
:
if
turn_idx
==
0
{
0.0
}
else
{
sample_delay_ms
(
&
spec
.inter_turn_delays
,
&
mut
rng
)
?
},
});
}
sessions
.push
(
SessionTrace
{
session_id
:
format!
(
"session_{session_idx}"
),
first_arrival_timestamp_ms
:
Some
(
first_arrival_timestamp_ms
),
turns
,
});
}
Ok
(
Self
{
block_size
:
spec
.block_size
,
sessions
,
})
}
pub
fn
validate_for_trace_mode
(
&
self
)
->
Result
<
()
>
{
self
.validate
(
false
)
}
pub
fn
validate_for_concurrency_mode
(
&
self
)
->
Result
<
()
>
{
self
.validate
(
true
)
}
pub
fn
normalize_session_starts
(
mut
self
)
->
Result
<
Self
>
{
let
Some
(
min_timestamp_ms
)
=
self
.sessions
.iter
()
.filter_map
(|
session
|
session
.first_arrival_timestamp_ms
)
.min_by
(|
left
,
right
|
left
.total_cmp
(
right
))
else
{
return
Ok
(
self
);
};
for
session
in
&
mut
self
.sessions
{
if
let
Some
(
timestamp_ms
)
=
session
.first_arrival_timestamp_ms
.as_mut
()
{
*
timestamp_ms
-=
min_timestamp_ms
;
}
}
Ok
(
self
)
}
pub
fn
speed_up_timing
(
mut
self
,
ratio
:
f64
)
->
Result
<
Self
>
{
if
!
ratio
.is_finite
()
||
ratio
<=
0.0
{
bail!
(
"ratio must be a finite positive number, got {ratio}"
);
}
for
session
in
&
mut
self
.sessions
{
if
let
Some
(
timestamp_ms
)
=
session
.first_arrival_timestamp_ms
.as_mut
()
{
*
timestamp_ms
/=
ratio
;
}
for
turn
in
&
mut
session
.turns
{
turn
.delay_after_previous_ms
/=
ratio
;
}
}
Ok
(
self
)
}
pub
fn
rescale_session_start_span
(
mut
self
,
duration_ms
:
u64
)
->
Result
<
Self
>
{
let
Some
(
min_timestamp_ms
)
=
self
.sessions
.iter
()
.filter_map
(|
session
|
session
.first_arrival_timestamp_ms
)
.min_by
(|
left
,
right
|
left
.total_cmp
(
right
))
else
{
return
Ok
(
self
);
};
let
Some
(
max_timestamp_ms
)
=
self
.sessions
.iter
()
.filter_map
(|
session
|
session
.first_arrival_timestamp_ms
)
.max_by
(|
left
,
right
|
left
.total_cmp
(
right
))
else
{
return
Ok
(
self
);
};
let
target_span_ms
=
duration_ms
as
f64
;
let
source_span_ms
=
max_timestamp_ms
-
min_timestamp_ms
;
for
session
in
&
mut
self
.sessions
{
if
let
Some
(
timestamp_ms
)
=
session
.first_arrival_timestamp_ms
.as_mut
()
{
*
timestamp_ms
=
if
source_span_ms
==
0.0
{
0.0
}
else
{
(
*
timestamp_ms
-
min_timestamp_ms
)
*
target_span_ms
/
source_span_ms
};
}
}
Ok
(
self
)
}
pub
fn
rescale_ready_span
(
mut
self
,
duration_ms
:
u64
)
->
Result
<
Self
>
{
let
Some
(
min_start_ms
)
=
self
.sessions
.iter
()
.map
(|
session
|
session
.first_arrival_timestamp_ms
.unwrap_or
(
0.0
))
.min_by
(|
left
,
right
|
left
.total_cmp
(
right
))
else
{
return
Ok
(
self
);
};
let
Some
(
max_ready_ms
)
=
self
.sessions
.iter
()
.map
(|
session
|
{
session
.first_arrival_timestamp_ms
.unwrap_or
(
0.0
)
+
session
.turns
.iter
()
.enumerate
()
.filter
(|(
turn_idx
,
_
)|
*
turn_idx
>
0
)
.map
(|(
_
,
turn
)|
turn
.delay_after_previous_ms
)
.sum
::
<
f64
>
()
})
.max_by
(|
left
,
right
|
left
.total_cmp
(
right
))
else
{
return
Ok
(
self
);
};
let
ratio
=
duration_ms
as
f64
/
(
max_ready_ms
-
min_start_ms
)
.max
(
1.0
);
for
session
in
&
mut
self
.sessions
{
if
let
Some
(
start_ms
)
=
session
.first_arrival_timestamp_ms
.as_mut
()
{
*
start_ms
=
(
*
start_ms
-
min_start_ms
)
*
ratio
;
}
for
(
turn_idx
,
turn
)
in
session
.turns
.iter_mut
()
.enumerate
()
{
if
turn_idx
>
0
{
turn
.delay_after_previous_ms
*=
ratio
;
}
}
}
Ok
(
self
)
}
pub
fn
expand_hash_prefix_depth
(
mut
self
,
factor
:
usize
)
->
Self
{
if
factor
<=
1
{
return
self
;
}
for
session
in
&
mut
self
.sessions
{
for
turn
in
&
mut
session
.turns
{
turn
.input_length
=
turn
.input_length
.checked_mul
(
factor
)
.expect
(
"input_length expansion overflow"
);
turn
.hash_ids
=
turn
.hash_ids
.iter
()
.flat_map
(|
&
hash_id
|
{
let
base
=
hash_id
.checked_mul
(
factor
as
u64
)
.expect
(
"hash prefix expansion overflow"
);
(
0
..
factor
as
u64
)
.map
(
move
|
offset
|
base
+
offset
)
})
.collect
();
}
}
self
}
pub
fn
duplicate_hash_space
(
mut
self
,
copies
:
usize
)
->
Self
{
if
copies
<=
1
{
return
self
;
}
let
max_hash_id
=
self
.sessions
.iter
()
.flat_map
(|
session
|
session
.turns
.iter
())
.flat_map
(|
turn
|
turn
.hash_ids
.iter
()
.copied
())
.max
()
.unwrap_or
(
0
);
let
offset_base
=
max_hash_id
+
1
;
let
original_sessions
=
self
.sessions
.clone
();
self
.sessions
.clear
();
for
copy_idx
in
0
..
copies
{
let
offset
=
offset_base
*
copy_idx
as
u64
;
for
session
in
&
original_sessions
{
let
mut
duplicated
=
session
.clone
();
duplicated
.session_id
=
format!
(
"{}:copy_{copy_idx}"
,
session
.session_id
);
for
turn
in
&
mut
duplicated
.turns
{
turn
.hash_ids
=
turn
.hash_ids
.iter
()
.map
(|
&
hash_id
|
{
hash_id
.checked_add
(
offset
)
.expect
(
"hash duplication overflow"
)
})
.collect
();
}
self
.sessions
.push
(
duplicated
);
}
}
self
}
pub
fn
partition_by_session
(
&
self
,
spec
:
SessionPartitionSpec
)
->
Vec
<
Self
>
{
let
num_partitions
=
match
spec
{
SessionPartitionSpec
::
Random
{
num_partitions
,
..
}
=>
num_partitions
,
SessionPartitionSpec
::
RoundRobin
{
num_partitions
}
=>
num_partitions
,
}
.max
(
1
);
let
mut
partitions
=
vec!
[
Self
{
block_size
:
self
.block_size
,
sessions
:
Vec
::
new
(),
};
num_partitions
];
let
mut
rng
=
match
spec
{
SessionPartitionSpec
::
Random
{
seed
,
..
}
=>
Some
(
StdRng
::
seed_from_u64
(
seed
)),
SessionPartitionSpec
::
RoundRobin
{
..
}
=>
None
,
};
for
(
session_idx
,
session
)
in
self
.sessions
.iter
()
.cloned
()
.enumerate
()
{
let
partition_idx
=
match
spec
{
SessionPartitionSpec
::
Random
{
..
}
=>
rng
.as_mut
()
.expect
(
"random partitioner must exist"
)
.random_range
(
0
..
num_partitions
),
SessionPartitionSpec
::
RoundRobin
{
..
}
=>
session_idx
%
num_partitions
,
};
partitions
[
partition_idx
]
.sessions
.push
(
session
);
}
partitions
}
pub
fn
to_single_turn_requests
(
&
self
)
->
Result
<
Vec
<
DirectRequest
>>
{
let
mut
requests
=
Vec
::
with_capacity
(
self
.sessions
.len
());
for
session
in
&
self
.sessions
{
if
session
.turns
.len
()
!=
1
{
bail!
(
"to_single_turn_requests requires exactly one turn per session, but session {} has {} turns"
,
session
.session_id
,
session
.turns
.len
()
);
}
requests
.push
(
session
.turns
[
0
]
.to_direct_request
(
self
.block_size
,
Uuid
::
new_v4
(),
session
.first_arrival_timestamp_ms
,
)
?
);
}
Ok
(
requests
)
}
pub
fn
to_router_sequences
(
&
self
,
worker_id
:
WorkerId
,
hash_mode
:
SequenceHashMode
,
)
->
Result
<
Vec
<
RouterSequence
>>
{
let
mut
sequences
=
Vec
::
new
();
for
session
in
&
self
.sessions
{
for
turn
in
&
session
.turns
{
let
local_hashes
=
turn
.hash_ids
.iter
()
.map
(|
&
hash_id
|
local_block_hash_from_id
(
hash_id
,
self
.block_size
))
.collect
::
<
Vec
<
_
>>
();
let
external_hashes
=
match
hash_mode
{
SequenceHashMode
::
Raw
=>
local_hashes
.iter
()
.map
(|
hash
|
ExternalSequenceBlockHash
(
hash
.0
))
.collect
(),
SequenceHashMode
::
Cumulative
=>
compute_seq_hash_for_block
(
&
local_hashes
)
.into_iter
()
.map
(
ExternalSequenceBlockHash
)
.collect
(),
};
sequences
.push
(
RouterSequence
{
worker_id
,
local_hashes
,
external_hashes
,
});
}
}
Ok
(
sequences
)
}
pub
fn
into_trace_driver
(
self
)
->
Result
<
WorkloadDriver
>
{
self
.validate_for_trace_mode
()
?
;
WorkloadDriver
::
new_trace
(
self
)
}
pub
fn
into_concurrency_driver
(
self
)
->
Result
<
WorkloadDriver
>
{
self
.validate_for_concurrency_mode
()
?
;
WorkloadDriver
::
new_concurrency
(
self
)
}
fn
validate
(
&
self
,
allow_missing_first_timestamp
:
bool
)
->
Result
<
()
>
{
if
self
.block_size
==
0
{
bail!
(
"block_size must be greater than 0"
);
}
if
self
.sessions
.is_empty
()
{
bail!
(
"trace must contain at least one session"
);
}
for
session
in
&
self
.sessions
{
if
session
.turns
.is_empty
()
{
bail!
(
"session {} must contain at least one turn"
,
session
.session_id
);
}
if
!
allow_missing_first_timestamp
{
let
timestamp_ms
=
session
.first_arrival_timestamp_ms
.ok_or_else
(||
{
anyhow!
(
"trace mode requires first_arrival_timestamp_ms for session {}"
,
session
.session_id
)
})
?
;
if
!
timestamp_ms
.is_finite
()
||
timestamp_ms
<
0.0
{
bail!
(
"session {} has invalid first_arrival_timestamp_ms {}"
,
session
.session_id
,
timestamp_ms
);
}
}
else
if
let
Some
(
timestamp_ms
)
=
session
.first_arrival_timestamp_ms
&&
(
!
timestamp_ms
.is_finite
()
||
timestamp_ms
<
0.0
)
{
bail!
(
"session {} has invalid first_arrival_timestamp_ms {}"
,
session
.session_id
,
timestamp_ms
);
}
for
(
turn_idx
,
turn
)
in
session
.turns
.iter
()
.enumerate
()
{
if
turn
.input_length
==
0
{
bail!
(
"session {} turn {} must have a positive input_length"
,
session
.session_id
,
turn_idx
);
}
if
turn
.hash_ids
.is_empty
()
{
bail!
(
"session {} turn {} must contain at least one hash id"
,
session
.session_id
,
turn_idx
);
}
if
turn
.hash_ids
.len
()
*
self
.block_size
<
turn
.input_length
{
bail!
(
"session {} turn {} input_length {} exceeds synthesized capacity {}"
,
session
.session_id
,
turn_idx
,
turn
.input_length
,
turn
.hash_ids
.len
()
*
self
.block_size
);
}
if
!
turn
.delay_after_previous_ms
.is_finite
()
||
turn
.delay_after_previous_ms
<
0.0
{
bail!
(
"session {} turn {} has invalid delay {}"
,
session
.session_id
,
turn_idx
,
turn
.delay_after_previous_ms
);
}
if
turn_idx
==
0
&&
turn
.delay_after_previous_ms
!=
0.0
{
bail!
(
"session {} first turn must have delay_after_previous_ms == 0.0"
,
session
.session_id
);
}
}
}
Ok
(())
}
}
fn
arrival_spec_mean_gap_ms
(
spec
:
&
ArrivalSpec
)
->
Result
<
f64
>
{
match
spec
{
ArrivalSpec
::
Burst
=>
Ok
(
0.0
),
ArrivalSpec
::
ConstantQps
{
qps
}
|
ArrivalSpec
::
PoissonQps
{
qps
}
|
ArrivalSpec
::
GammaQps
{
qps
,
..
}
=>
{
if
!
qps
.is_finite
()
||
*
qps
<=
0.0
{
bail!
(
"qps must be a finite positive number, got {qps}"
);
}
Ok
(
1000.0
/
qps
)
}
}
}
fn
sample_arrival_gap_ms
(
spec
:
&
ArrivalSpec
,
mean_gap_ms
:
f64
,
rng
:
&
mut
StdRng
)
->
Result
<
f64
>
{
match
spec
{
ArrivalSpec
::
Burst
=>
Ok
(
0.0
),
ArrivalSpec
::
ConstantQps
{
..
}
=>
Ok
(
mean_gap_ms
),
ArrivalSpec
::
PoissonQps
{
..
}
=>
Ok
(
sample_exponential_ms
(
mean_gap_ms
,
rng
)),
ArrivalSpec
::
GammaQps
{
smoothness
,
..
}
=>
{
if
!
smoothness
.is_finite
()
||
*
smoothness
<=
0.0
{
bail!
(
"gamma smoothness must be a finite positive number, got {smoothness}"
);
}
Ok
(
sample_gamma_ms
(
*
smoothness
,
mean_gap_ms
/
smoothness
,
rng
))
}
}
}
fn
sample_delay_ms
(
spec
:
&
DelaySpec
,
rng
:
&
mut
StdRng
)
->
Result
<
f64
>
{
match
spec
{
DelaySpec
::
None
=>
Ok
(
0.0
),
DelaySpec
::
ConstantMs
(
delay_ms
)
=>
{
if
!
delay_ms
.is_finite
()
||
*
delay_ms
<
0.0
{
bail!
(
"delay must be a finite non-negative number, got {delay_ms}"
);
}
Ok
(
*
delay_ms
)
}
DelaySpec
::
ExponentialMs
{
mean_ms
}
=>
{
if
!
mean_ms
.is_finite
()
||
*
mean_ms
<
0.0
{
bail!
(
"mean_ms must be a finite non-negative number, got {mean_ms}"
);
}
Ok
(
sample_exponential_ms
(
*
mean_ms
,
rng
))
}
}
}
fn
sample_length
(
spec
:
&
LengthSpec
,
min_value
:
usize
,
rng
:
&
mut
StdRng
)
->
usize
{
if
spec
.stddev
==
0.0
{
return
spec
.mean
.max
(
min_value
);
}
let
stddev
=
spec
.stddev
.abs
();
let
u1
=
(
1.0
-
rng
.random
::
<
f64
>
())
.clamp
(
f64
::
MIN_POSITIVE
,
1.0
);
let
u2
=
rng
.random
::
<
f64
>
();
let
z0
=
(
-
2.0
*
u1
.ln
())
.sqrt
()
*
(
std
::
f64
::
consts
::
TAU
*
u2
)
.cos
();
let
sample
=
spec
.mean
as
f64
+
z0
*
stddev
;
sample
.round
()
.max
(
min_value
as
f64
)
as
usize
}
fn
sample_exponential_ms
(
mean_ms
:
f64
,
rng
:
&
mut
StdRng
)
->
f64
{
if
mean_ms
==
0.0
{
return
0.0
;
}
let
u
=
(
1.0
-
rng
.random
::
<
f64
>
())
.clamp
(
f64
::
MIN_POSITIVE
,
1.0
);
-
mean_ms
*
u
.ln
()
}
fn
sample_gamma_ms
(
shape
:
f64
,
scale
:
f64
,
rng
:
&
mut
StdRng
)
->
f64
{
if
scale
==
0.0
{
return
0.0
;
}
if
shape
<
1.0
{
let
u
=
(
1.0
-
rng
.random
::
<
f64
>
())
.clamp
(
f64
::
MIN_POSITIVE
,
1.0
);
return
sample_gamma_ms
(
shape
+
1.0
,
scale
,
rng
)
*
u
.powf
(
1.0
/
shape
);
}
let
d
=
shape
-
1.0
/
3.0
;
let
c
=
(
1.0
/
(
9.0
*
d
))
.sqrt
();
loop
{
let
u1
=
(
1.0
-
rng
.random
::
<
f64
>
())
.clamp
(
f64
::
MIN_POSITIVE
,
1.0
);
let
u2
=
rng
.random
::
<
f64
>
();
let
z
=
(
-
2.0
*
u1
.ln
())
.sqrt
()
*
(
std
::
f64
::
consts
::
TAU
*
u2
)
.cos
();
let
v
=
(
1.0
+
c
*
z
)
.powi
(
3
);
if
v
<=
0.0
{
continue
;
}
let
u
=
rng
.random
::
<
f64
>
();
if
u
<
1.0
-
0.0331
*
z
.powi
(
4
)
{
return
d
*
v
*
scale
;
}
if
u
.ln
()
<
0.5
*
z
*
z
+
d
*
(
1.0
-
v
+
v
.ln
())
{
return
d
*
v
*
scale
;
}
}
}
fn
local_block_hash_from_id
(
hash_id
:
u64
,
block_size
:
usize
)
->
LocalBlockHash
{
let
tokens
:
Vec
<
u32
>
=
(
0
..
block_size
)
.map
(|
_
|
hash_id
as
u32
)
.collect
();
let
bytes
=
unsafe
{
std
::
slice
::
from_raw_parts
(
tokens
.as_ptr
()
as
*
const
u8
,
std
::
mem
::
size_of_val
(
tokens
.as_slice
()),
)
};
LocalBlockHash
(
compute_hash_v2
(
bytes
,
XXH3_SEED
))
}
lib/mocker/src/loadgen/types.rs
0 → 100644
View file @
b2c59aa4
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
use
dynamo_kv_router
::
LocalBlockHash
;
use
dynamo_kv_router
::
protocols
::{
ExternalSequenceBlockHash
,
WorkerId
};
use
dynamo_tokens
::
SequenceHash
;
use
uuid
::
Uuid
;
use
crate
::
common
::
protocols
::
DirectRequest
;
#[derive(Debug,
Clone)]
pub
struct
Trace
{
pub
block_size
:
usize
,
pub
sessions
:
Vec
<
SessionTrace
>
,
}
#[derive(Debug,
Clone)]
pub
struct
SessionTrace
{
pub
session_id
:
String
,
pub
first_arrival_timestamp_ms
:
Option
<
f64
>
,
pub
turns
:
Vec
<
TurnTrace
>
,
}
#[derive(Debug,
Clone)]
pub
struct
TurnTrace
{
pub
input_length
:
usize
,
pub
max_output_tokens
:
usize
,
pub
hash_ids
:
Vec
<
u64
>
,
pub
delay_after_previous_ms
:
f64
,
}
#[derive(Debug,
Clone)]
pub
struct
LengthSpec
{
pub
mean
:
usize
,
pub
stddev
:
f64
,
}
#[derive(Debug,
Clone)]
pub
enum
ArrivalSpec
{
Burst
,
ConstantQps
{
qps
:
f64
},
PoissonQps
{
qps
:
f64
},
GammaQps
{
qps
:
f64
,
smoothness
:
f64
},
}
#[derive(Debug,
Clone)]
pub
enum
DelaySpec
{
None
,
ConstantMs
(
f64
),
ExponentialMs
{
mean_ms
:
f64
},
}
#[derive(Debug,
Clone)]
pub
struct
SyntheticTraceSpec
{
pub
block_size
:
usize
,
pub
num_sessions
:
usize
,
pub
turns_per_session
:
usize
,
pub
input_tokens
:
LengthSpec
,
pub
output_tokens
:
LengthSpec
,
pub
shared_prefix_ratio
:
f64
,
pub
num_prefix_groups
:
usize
,
pub
first_turn_arrivals
:
ArrivalSpec
,
pub
inter_turn_delays
:
DelaySpec
,
pub
seed
:
u64
,
}
#[derive(Debug,
Clone,
Copy)]
pub
enum
SequenceHashMode
{
Raw
,
Cumulative
,
}
#[derive(Debug,
Clone,
Copy)]
pub
enum
SessionPartitionSpec
{
Random
{
num_partitions
:
usize
,
seed
:
u64
},
RoundRobin
{
num_partitions
:
usize
},
}
#[derive(Debug,
Clone)]
pub
struct
RouterSequence
{
pub
worker_id
:
WorkerId
,
pub
local_hashes
:
Vec
<
LocalBlockHash
>
,
pub
external_hashes
:
Vec
<
ExternalSequenceBlockHash
>
,
}
#[derive(Debug,
Clone,
PartialEq,
Eq)]
pub
struct
ReplayRequestHashes
{
pub
local_block_hashes
:
Vec
<
LocalBlockHash
>
,
pub
sequence_hashes
:
Vec
<
SequenceHash
>
,
}
#[derive(Debug,
Clone)]
pub
struct
ReadyTurn
{
pub
request_uuid
:
Uuid
,
pub
session_id
:
String
,
pub
turn_index
:
usize
,
pub
scheduled_ready_at_ms
:
f64
,
pub
replay_hashes
:
Option
<
ReplayRequestHashes
>
,
pub
request
:
DirectRequest
,
}
Prev
1
2
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment