Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
dynamo
Commits
d4e30a59
Unverified
Commit
d4e30a59
authored
Mar 23, 2026
by
Yan Ru Pei
Committed by
GitHub
Mar 23, 2026
Browse files
test(mocker): cut replay helper sleeps (#7577)
Signed-off-by:
PeaBrane
<
yanrpei@gmail.com
>
parent
1c65a588
Changes
19
Show whitespace changes
Inline
Side-by-side
Showing
19 changed files
with
542 additions
and
140 deletions
+542
-140
docs/benchmarks/mocker-trace-replay.md
docs/benchmarks/mocker-trace-replay.md
+34
-28
docs/mocker/mocker.md
docs/mocker/mocker.md
+9
-6
lib/bindings/python/src/dynamo/replay/main.py
lib/bindings/python/src/dynamo/replay/main.py
+8
-2
lib/bindings/python/src/dynamo/replay/reporting.py
lib/bindings/python/src/dynamo/replay/reporting.py
+142
-0
lib/bindings/python/tests/cancellation/test_example.py
lib/bindings/python/tests/cancellation/test_example.py
+5
-1
lib/bindings/python/tests/test_bindings_install.py
lib/bindings/python/tests/test_bindings_install.py
+6
-1
lib/bindings/python/tests/test_example_hello_world.py
lib/bindings/python/tests/test_example_hello_world.py
+5
-1
lib/bindings/python/tests/test_http_error.py
lib/bindings/python/tests/test_http_error.py
+6
-1
lib/bindings/python/tests/test_http_server.py
lib/bindings/python/tests/test_http_server.py
+5
-1
lib/bindings/python/tests/test_kserve_grpc.py
lib/bindings/python/tests/test_kserve_grpc.py
+5
-1
lib/bindings/python/tests/test_kv_bindings.py
lib/bindings/python/tests/test_kv_bindings.py
+6
-1
lib/bindings/python/tests/test_lora_utils.py
lib/bindings/python/tests/test_lora_utils.py
+7
-0
lib/bindings/python/tests/test_mm_kv_router.py
lib/bindings/python/tests/test_mm_kv_router.py
+6
-1
lib/bindings/python/tests/test_parsers.py
lib/bindings/python/tests/test_parsers.py
+8
-0
lib/bindings/python/tests/test_replay.py
lib/bindings/python/tests/test_replay.py
+278
-88
lib/bindings/python/tests/test_tensor.py
lib/bindings/python/tests/test_tensor.py
+6
-0
lib/mocker/src/scheduler/sglang/tests.rs
lib/mocker/src/scheduler/sglang/tests.rs
+3
-4
lib/mocker/src/scheduler/test_utils.rs
lib/mocker/src/scheduler/test_utils.rs
+2
-3
lib/mocker/src/scheduler/vllm/tests.rs
lib/mocker/src/scheduler/vllm/tests.rs
+1
-1
No files found.
docs/benchmarks/mocker-trace-replay.md
View file @
d4e30a59
...
@@ -9,9 +9,9 @@ This guide covers the mocker's trace replay support for Mooncake-style JSONL tra
...
@@ -9,9 +9,9 @@ This guide covers the mocker's trace replay support for Mooncake-style JSONL tra
surface is available in two forms:
surface is available in two forms:
-
`python -m dynamo.mocker --trace-file ...`
, which writes a report file and prints a replay summary
-
`python -m dynamo.mocker --trace-file ...`
, which writes a report file and prints a replay summary
-
`python -m dynamo.replay ...`
, which
returns the replay report JSON on stdout and exposes
-
`python -m dynamo.replay ...`
, which
prints an AIPerf-style summary table, writes the full
`offline|online`
,
`round_robin|kv_router`
,
`arrival_speedup_ratio`
, and synthetic replay inputs
replay report JSON to disk, and exposes
`offline|online`
,
`round_robin|kv_router`
,
directly
`arrival_speedup_ratio`
, and synthetic replay inputs
directly
Unlike normal
`dynamo.mocker`
usage, offline replay does not launch workers, register endpoints, or
Unlike normal
`dynamo.mocker`
usage, offline replay does not launch workers, register endpoints, or
require NATS, etcd, or a frontend. Online replay does exercise the live mock-worker runtime path.
require NATS, etcd, or a frontend. Online replay does exercise the live mock-worker runtime path.
...
@@ -31,7 +31,8 @@ python -m dynamo.replay /path/to/mooncake_trace.jsonl \
...
@@ -31,7 +31,8 @@ python -m dynamo.replay /path/to/mooncake_trace.jsonl \
--num-workers
4
\
--num-workers
4
\
--replay-mode
offline
\
--replay-mode
offline
\
--router-mode
round_robin
\
--router-mode
round_robin
\
--extra-engine-args
/path/to/mocker_args.json
--extra-engine-args
'{"block_size":512,"speedup_ratio":1000.0}'
\
--report-json
/tmp/replay-report.json
```
```
Run synthetic replay through the same CLI when you want fixed request shapes without a trace file:
Run synthetic replay through the same CLI when you want fixed request shapes without a trace file:
...
@@ -45,7 +46,8 @@ python -m dynamo.replay \
...
@@ -45,7 +46,8 @@ python -m dynamo.replay \
--num-workers
1
\
--num-workers
1
\
--replay-mode
offline
\
--replay-mode
offline
\
--replay-concurrency
100
\
--replay-concurrency
100
\
--extra-engine-args
/path/to/mocker_args.json
--extra-engine-args
'{"block_size":512,"speedup_ratio":1000.0}'
\
--report-json
/tmp/replay-report.json
```
```
You can also run replay through the mocker CLI by passing
`--trace-file`
:
You can also run replay through the mocker CLI by passing
`--trace-file`
:
...
@@ -62,8 +64,9 @@ This writes a JSON report next to the trace file by default:
...
@@ -62,8 +64,9 @@ This writes a JSON report next to the trace file by default:
/path/to/mooncake_trace.replay.json
/path/to/mooncake_trace.replay.json
```
```
`python -m dynamo.replay`
prints the replay report JSON directly to stdout. The mocker CLI prints a
`python -m dynamo.replay`
prints an AIPerf-style summary table to stdout and writes the full replay
`Replay Summary`
table to stdout and writes the report JSON to disk.
report JSON to disk. The mocker CLI prints a
`Replay Summary`
table to stdout and writes the report
JSON to disk.
## Input Format
## Input Format
...
@@ -96,15 +99,13 @@ The dedicated replay CLI exposes:
...
@@ -96,15 +99,13 @@ The dedicated replay CLI exposes:
-
either a positional
`trace_file`
, or all of
`--input-tokens`
,
`--output-tokens`
, and
`--request-count`
-
either a positional
`trace_file`
, or all of
`--input-tokens`
,
`--output-tokens`
, and
`--request-count`
-
`--replay-mode offline|online`
-
`--replay-mode offline|online`
-
`--router-mode round_robin|kv_router`
-
`--router-mode round_robin|kv_router`
-
`--router-queue-policy fcfs|wspt|lcfs`
-
`--num-workers`
-
`--num-workers`
-
`--replay-concurrency`
-
`--replay-concurrency`
-
`--arrival-interval-ms`
-
`--arrival-interval-ms`
-
`--arrival-speedup-ratio`
-
`--arrival-speedup-ratio`
-
`--extra-engine-args`
-
`--extra-engine-args`
(JSON string)
-
`--extra-engine-args-json`
-
`--router-config`
(JSON string)
-
`--router-config`
-
`--report-json`
-
`--router-config-json`
Example:
Example:
...
@@ -114,8 +115,9 @@ python -m dynamo.replay /path/to/mooncake_trace.jsonl \
...
@@ -114,8 +115,9 @@ python -m dynamo.replay /path/to/mooncake_trace.jsonl \
--router-mode
kv_router
\
--router-mode
kv_router
\
--num-workers
4
\
--num-workers
4
\
--arrival-speedup-ratio
10
\
--arrival-speedup-ratio
10
\
--extra-engine-args-json
'{"block_size":64,"speedup_ratio":1000.0}'
\
--extra-engine-args
'{"block_size":512,"speedup_ratio":1000.0}'
\
--router-config-json
'{"router_queue_policy":"fcfs","router_temperature":0.0}'
--router-config
'{"router_queue_policy":"fcfs","router_temperature":0.0}'
\
--report-json
/tmp/replay-report.json
```
```
SGLang replay uses the same CLI surface. A minimal extra-engine-args file can use either
SGLang replay uses the same CLI surface. A minimal extra-engine-args file can use either
...
@@ -132,9 +134,9 @@ SGLang replay uses the same CLI surface. A minimal extra-engine-args file can us
...
@@ -132,9 +134,9 @@ SGLang replay uses the same CLI surface. A minimal extra-engine-args file can us
}
}
```
```
For b
oth
`--extra-engine-args
-json
`
and
`--router-config
-json`
, replay
accept
s
partial JSON
B
oth
`--extra-engine-args`
and
`--router-config
`
accept partial JSON
objects. Unspecified fields
objects. Unspecified fields
fall back to the same defaults used by
`MockEngineArgs::default()`
fall back to the same defaults used by
`MockEngineArgs::default()`
and
and
`KvRouterConfig::default()`
.
`KvRouterConfig::default()`
.
### `python -m dynamo.mocker --trace-file`
### `python -m dynamo.mocker --trace-file`
...
@@ -154,7 +156,7 @@ python -m dynamo.replay \
...
@@ -154,7 +156,7 @@ python -m dynamo.replay \
--arrival-interval-ms
0.5
\
--arrival-interval-ms
0.5
\
--replay-mode
offline
\
--replay-mode
offline
\
--replay-concurrency
50
\
--replay-concurrency
50
\
--extra-engine-args
/path/to/mocker_args.json
--extra-engine-args
'{"block_size":512,"speedup_ratio":1000.0}'
```
```
This is useful for parameter sweeps where Mooncake-style prefix structure is not required.
This is useful for parameter sweeps where Mooncake-style prefix structure is not required.
...
@@ -170,7 +172,7 @@ those timestamps:
...
@@ -170,7 +172,7 @@ those timestamps:
python
-m
dynamo.replay /path/to/mooncake_trace.jsonl
\
python
-m
dynamo.replay /path/to/mooncake_trace.jsonl
\
--replay-mode
offline
\
--replay-mode
offline
\
--num-workers
4
\
--num-workers
4
\
--extra-engine-args
/path/to/mocker_args.json
--extra-engine-args
'{"block_size":512,"speedup_ratio":1000.0}'
```
```
This is the right mode when you want deterministic replay of the original arrival pattern.
This is the right mode when you want deterministic replay of the original arrival pattern.
...
@@ -201,7 +203,7 @@ python -m dynamo.replay /path/to/mooncake_trace.jsonl \
...
@@ -201,7 +203,7 @@ python -m dynamo.replay /path/to/mooncake_trace.jsonl \
--router-mode
kv_router
\
--router-mode
kv_router
\
--num-workers
4
\
--num-workers
4
\
--arrival-speedup-ratio
10
\
--arrival-speedup-ratio
10
\
--extra-engine-args
/path/to/mocker_args.json
--extra-engine-args
'{"block_size":512,"speedup_ratio":1000.0}'
```
```
### Arrival Speedup
### Arrival Speedup
...
@@ -214,7 +216,7 @@ python -m dynamo.replay /path/to/mooncake_trace.jsonl \
...
@@ -214,7 +216,7 @@ python -m dynamo.replay /path/to/mooncake_trace.jsonl \
--replay-mode
offline
\
--replay-mode
offline
\
--num-workers
4
\
--num-workers
4
\
--arrival-speedup-ratio
5
\
--arrival-speedup-ratio
5
\
--extra-engine-args
/path/to/mocker_args.json
--extra-engine-args
'{"block_size":512,"speedup_ratio":1000.0}'
```
```
### Router Modes
### Router Modes
...
@@ -224,7 +226,8 @@ Replay currently supports:
...
@@ -224,7 +226,8 @@ Replay currently supports:
-
`round_robin`
-
`round_robin`
-
`kv_router`
-
`kv_router`
`kv_router`
uses the shared local scheduler and an in-process KV indexer. In offline replay:
`kv_router`
uses the shared local scheduler and an in-process KV indexer. Router policy tuning is
provided through
`--router-config`
, not a dedicated top-level replay flag. In offline replay:
-
`kv_router`
is supported only when
`num_workers > 1`
-
`kv_router`
is supported only when
`num_workers > 1`
-
router queueing is enabled and uses simulation time rather than wall-clock time
-
router queueing is enabled and uses simulation time rather than wall-clock time
...
@@ -233,22 +236,22 @@ Replay currently supports:
...
@@ -233,22 +236,22 @@ Replay currently supports:
-
transient in-pass prefill occupancy is still approximated at the router level rather than modeled exactly
-
transient in-pass prefill occupancy is still approximated at the router level rather than modeled exactly
To compare queue policies manually, keep the same trace and engine args fixed and swap only
To compare queue policies manually, keep the same trace and engine args fixed and swap only
`
--
router
-
queue
-
policy`
:
`router
_
queue
_
policy`
inside
`--router-config`
:
```
bash
```
bash
python
-m
dynamo.replay /path/to/mooncake_trace.jsonl
\
python
-m
dynamo.replay /path/to/mooncake_trace.jsonl
\
--replay-mode
offline
\
--replay-mode
offline
\
--router-mode
kv_router
\
--router-mode
kv_router
\
--router-queue-policy
fcfs
\
--num-workers
4
\
--num-workers
4
\
--extra-engine-args
/path/to/mocker_args.json
--extra-engine-args
'{"block_size":512,"speedup_ratio":1000.0}'
\
--router-config
'{"router_queue_policy":"fcfs"}'
python
-m
dynamo.replay /path/to/mooncake_trace.jsonl
\
python
-m
dynamo.replay /path/to/mooncake_trace.jsonl
\
--replay-mode
offline
\
--replay-mode
offline
\
--router-mode
kv_router
\
--router-mode
kv_router
\
--router-queue-policy
lcfs
\
--num-workers
4
\
--num-workers
4
\
--extra-engine-args
/path/to/mocker_args.json
--extra-engine-args
'{"block_size":512,"speedup_ratio":1000.0}'
\
--router-config
'{"router_queue_policy":"lcfs"}'
```
```
`lcfs`
is intentionally a worse comparison policy under saturation; use it for experiments, not as
`lcfs`
is intentionally a worse comparison policy under saturation; use it for experiments, not as
...
@@ -280,6 +283,9 @@ The report contains:
...
@@ -280,6 +283,9 @@ The report contains:
The dedicated replay CLI returns the same report schema as the Python APIs
The dedicated replay CLI returns the same report schema as the Python APIs
`dynamo.replay.run_trace_replay(...)`
and
`dynamo.replay.run_synthetic_trace_replay(...)`
.
`dynamo.replay.run_trace_replay(...)`
and
`dynamo.replay.run_synthetic_trace_replay(...)`
.
If
`--report-json`
is not provided,
`python -m dynamo.replay`
writes a timestamped
`dynamo_replay_report_*.json`
file in the current working directory.
## Replay Constraints
## Replay Constraints
Shared replay constraints:
Shared replay constraints:
...
@@ -308,7 +314,7 @@ If you violate those constraints, replay fails immediately with a validation err
...
@@ -308,7 +314,7 @@ If you violate those constraints, replay fails immediately with a validation err
-
`--speedup-ratio`
still affects simulated timing
-
`--speedup-ratio`
still affects simulated timing
-
`--arrival-speedup-ratio`
affects trace timestamps, not worker compute speed
-
`--arrival-speedup-ratio`
affects trace timestamps, not worker compute speed
-
`--arrival-interval-ms`
only applies to synthetic replay
-
`--arrival-interval-ms`
only applies to synthetic replay
-
`--extra-engine-args`
c
an
be used to provide a full mock
er
config JSON
instead of individual CLI flags
-
`--extra-engine-args`
an
d
`--rout
er
-
config
`
are
JSON
strings on the standalone replay CLI
-
offline replay does not need planner runtime setup, router registration, or external event transport
-
offline replay does not need planner runtime setup, router registration, or external event transport
-
the replay block size should match the trace block size, because token synthesis expands
`hash_ids`
-
the replay block size should match the trace block size, because token synthesis expands
`hash_ids`
using the configured block size
using the configured block size
...
...
docs/mocker/mocker.md
View file @
d4e30a59
...
@@ -139,16 +139,17 @@ python -m dynamo.mocker \
...
@@ -139,16 +139,17 @@ python -m dynamo.mocker \
```
```
For the standalone replay CLI, which exposes
`offline|online`
,
`round_robin|kv_router`
,
For the standalone replay CLI, which exposes
`offline|online`
,
`round_robin|kv_router`
,
`arrival_speedup_ratio`
,
`router_queue_policy`
,
and the synthetic replay path directly:
`arrival_speedup_ratio`
, and the synthetic replay path directly:
```
bash
```
bash
python
-m
dynamo.replay /path/to/mooncake_trace.jsonl
\
python
-m
dynamo.replay /path/to/mooncake_trace.jsonl
\
--num-workers
4
\
--num-workers
4
\
--replay-mode
offline
\
--replay-mode
offline
\
--router-mode
kv_router
\
--router-mode
kv_router
\
--router-queue-policy
fcfs
\
--arrival-speedup-ratio
5
\
--arrival-speedup-ratio
5
\
--extra-engine-args
/path/to/mocker_args.json
--extra-engine-args
'{"block_size":512,"speedup_ratio":1000.0}'
\
--router-config
'{"router_queue_policy":"fcfs"}'
\
--report-json
/tmp/replay-report.json
```
```
The same CLI also supports synthetic replay without a trace file:
The same CLI also supports synthetic replay without a trace file:
...
@@ -162,11 +163,13 @@ python -m dynamo.replay \
...
@@ -162,11 +163,13 @@ python -m dynamo.replay \
--num-workers
1
\
--num-workers
1
\
--replay-mode
offline
\
--replay-mode
offline
\
--replay-concurrency
100
\
--replay-concurrency
100
\
--extra-engine-args
/path/to/mocker_args.json
--extra-engine-args
'{"block_size":512,"speedup_ratio":1000.0}'
\
--report-json
/tmp/replay-report.json
```
```
The standalone replay CLI prints the replay report JSON directly to stdout. The
`dynamo.mocker`
The standalone replay CLI prints an AIPerf-style summary table to stdout and writes the full replay
trace-file flow still writes a report file and prints a
`Replay Summary`
table.
report JSON to disk. The
`dynamo.mocker`
trace-file flow still writes a report file and prints a
`Replay Summary`
table.
For full usage, constraints, and benchmarking guidance, see
[
Mocker Trace Replay
](
../benchmarks/mocker-trace-replay.md
)
.
For full usage, constraints, and benchmarking guidance, see
[
Mocker Trace Replay
](
../benchmarks/mocker-trace-replay.md
)
.
...
...
lib/bindings/python/src/dynamo/replay/main.py
View file @
d4e30a59
...
@@ -4,7 +4,6 @@
...
@@ -4,7 +4,6 @@
from
__future__
import
annotations
from
__future__
import
annotations
import
argparse
import
argparse
import
json
import
os
import
os
import
sys
import
sys
from
collections.abc
import
Sequence
from
collections.abc
import
Sequence
...
@@ -13,6 +12,7 @@ os.environ.setdefault("DYNAMO_SKIP_PYTHON_LOG_INIT", "1")
...
@@ -13,6 +12,7 @@ os.environ.setdefault("DYNAMO_SKIP_PYTHON_LOG_INIT", "1")
from
dynamo.llm
import
KvRouterConfig
,
MockEngineArgs
from
dynamo.llm
import
KvRouterConfig
,
MockEngineArgs
from
dynamo.replay
import
run_synthetic_trace_replay
,
run_trace_replay
from
dynamo.replay
import
run_synthetic_trace_replay
,
run_trace_replay
from
dynamo.replay.reporting
import
format_report_table
,
write_report_json
def
main
(
argv
:
Sequence
[
str
]
|
None
=
None
)
->
int
:
def
main
(
argv
:
Sequence
[
str
]
|
None
=
None
)
->
int
:
...
@@ -37,6 +37,10 @@ def main(argv: Sequence[str] | None = None) -> int:
...
@@ -37,6 +37,10 @@ def main(argv: Sequence[str] | None = None) -> int:
default
=
"round_robin"
,
default
=
"round_robin"
,
)
)
parser
.
add_argument
(
"--arrival-speedup-ratio"
,
type
=
float
,
default
=
1.0
)
parser
.
add_argument
(
"--arrival-speedup-ratio"
,
type
=
float
,
default
=
1.0
)
parser
.
add_argument
(
"--report-json"
,
help
=
"path to save the full replay report JSON; defaults to a timestamped file in the current directory"
,
)
args
=
parser
.
parse_args
(
list
(
sys
.
argv
[
1
:]
if
argv
is
None
else
argv
))
args
=
parser
.
parse_args
(
list
(
sys
.
argv
[
1
:]
if
argv
is
None
else
argv
))
using_trace_file
=
args
.
trace_file
is
not
None
using_trace_file
=
args
.
trace_file
is
not
None
...
@@ -89,6 +93,8 @@ def main(argv: Sequence[str] | None = None) -> int:
...
@@ -89,6 +93,8 @@ def main(argv: Sequence[str] | None = None) -> int:
arrival_interval_ms
=
args
.
arrival_interval_ms
,
arrival_interval_ms
=
args
.
arrival_interval_ms
,
)
)
json
.
dump
(
report
,
sys
.
stdout
,
indent
=
2
,
sort_keys
=
True
)
report_path
=
write_report_json
(
report
,
args
.
report_json
)
sys
.
stdout
.
write
(
format_report_table
(
report
))
sys
.
stdout
.
write
(
"
\n
"
)
sys
.
stdout
.
write
(
"
\n
"
)
sys
.
stdout
.
write
(
f
"Saved full report to:
{
report_path
}
\n
"
)
return
0
return
0
lib/bindings/python/src/dynamo/replay/reporting.py
0 → 100644
View file @
d4e30a59
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
from
__future__
import
annotations
import
json
from
datetime
import
datetime
from
pathlib
import
Path
TITLE
=
"NVIDIA AIPerf | LLM Metrics"
STAT_COLUMNS
=
(
"avg"
,
"min"
,
"max"
,
"p99"
,
"p90"
,
"p75"
,
"std"
)
def
default_report_path
()
->
Path
:
timestamp
=
datetime
.
now
().
strftime
(
"%Y%m%d_%H%M%S"
)
return
Path
.
cwd
()
/
f
"dynamo_replay_report_
{
timestamp
}
.json"
def
write_report_json
(
report
:
dict
[
str
,
object
],
output_path
:
str
|
Path
|
None
)
->
Path
:
path
=
Path
(
output_path
)
if
output_path
is
not
None
else
default_report_path
()
if
path
.
exists
()
and
path
.
is_dir
():
timestamp
=
datetime
.
now
().
strftime
(
"%Y%m%d_%H%M%S"
)
path
=
path
/
f
"dynamo_replay_report_
{
timestamp
}
.json"
path
.
parent
.
mkdir
(
parents
=
True
,
exist_ok
=
True
)
path
.
write_text
(
json
.
dumps
(
report
,
indent
=
2
,
sort_keys
=
True
)
+
"
\n
"
,
encoding
=
"utf-8"
)
return
path
def
format_report_table
(
report
:
dict
[
str
,
object
])
->
str
:
rows
=
_build_rows
(
report
)
table
=
_render_table
(
rows
)
lines
=
[
TITLE
,
table
]
wall_time_ms
=
report
.
get
(
"wall_time_ms"
)
if
isinstance
(
wall_time_ms
,
int
|
float
):
lines
.
append
(
f
"Wall Time (ms):
{
_format_value
(
wall_time_ms
)
}
"
)
prefix_cache_reused_ratio
=
report
.
get
(
"prefix_cache_reused_ratio"
)
if
isinstance
(
prefix_cache_reused_ratio
,
int
|
float
):
lines
.
append
(
f
"Prefix Cache Reused Ratio:
{
_format_value
(
prefix_cache_reused_ratio
)
}
"
)
return
"
\n
"
.
join
(
lines
)
def
_build_rows
(
report
:
dict
[
str
,
object
])
->
list
[
list
[
str
]]:
rows
:
list
[
list
[
str
]]
=
[]
_append_stat_row
(
rows
,
report
,
"Time to First Token (ms)"
,
"ttft_ms"
)
_append_stat_row
(
rows
,
report
,
"Time to Second Token (ms)"
,
"ttst_ms"
)
_append_stat_row
(
rows
,
report
,
"Request Latency (ms)"
,
"e2e_latency_ms"
)
_append_stat_row
(
rows
,
report
,
"Inter Token Latency (ms)"
,
"itl_ms"
)
_append_stat_row
(
rows
,
report
,
"Output Token Throughput Per User (tokens/sec/user)"
,
"output_token_throughput_per_user"
,
)
rows
.
append
(
[
"Output Token Throughput (tokens/sec)"
,
_format_value
(
report
.
get
(
"output_throughput_tok_s"
)),
*
[
"N/A"
]
*
(
len
(
STAT_COLUMNS
)
-
1
),
]
)
rows
.
append
(
[
"Request Throughput (requests/sec)"
,
_format_value
(
report
.
get
(
"request_throughput_rps"
)),
*
[
"N/A"
]
*
(
len
(
STAT_COLUMNS
)
-
1
),
]
)
rows
.
append
(
[
"Request Count (requests)"
,
_format_value
(
report
.
get
(
"completed_requests"
,
report
.
get
(
"num_requests"
))),
*
[
"N/A"
]
*
(
len
(
STAT_COLUMNS
)
-
1
),
]
)
return
rows
def
_append_stat_row
(
rows
:
list
[
list
[
str
]],
report
:
dict
[
str
,
object
],
label
:
str
,
metric_suffix
:
str
)
->
None
:
mean_key
=
f
"mean_
{
metric_suffix
}
"
if
mean_key
not
in
report
:
return
rows
.
append
(
[
label
,
_format_value
(
report
.
get
(
mean_key
)),
_format_value
(
report
.
get
(
f
"min_
{
metric_suffix
}
"
)),
_format_value
(
report
.
get
(
f
"max_
{
metric_suffix
}
"
)),
_format_value
(
report
.
get
(
f
"p99_
{
metric_suffix
}
"
)),
_format_value
(
report
.
get
(
f
"p90_
{
metric_suffix
}
"
)),
_format_value
(
report
.
get
(
f
"p75_
{
metric_suffix
}
"
)),
_format_value
(
report
.
get
(
f
"std_
{
metric_suffix
}
"
)),
]
)
def
_render_table
(
rows
:
list
[
list
[
str
]])
->
str
:
headers
=
[
"Metric"
,
*
STAT_COLUMNS
]
widths
=
[
len
(
header
)
for
header
in
headers
]
for
row
in
rows
:
for
index
,
value
in
enumerate
(
row
):
widths
[
index
]
=
max
(
widths
[
index
],
len
(
value
))
def
render_separator
(
left
:
str
,
mid
:
str
,
right
:
str
)
->
str
:
return
left
+
mid
.
join
(
"━"
*
(
width
+
2
)
for
width
in
widths
)
+
right
def
render_row
(
row
:
list
[
str
])
->
str
:
padded
=
[]
for
index
,
value
in
enumerate
(
row
):
if
index
==
0
:
padded
.
append
(
f
"
{
value
.
ljust
(
widths
[
index
])
}
"
)
continue
padded
.
append
(
f
"
{
value
.
rjust
(
widths
[
index
])
}
"
)
return
"┃"
+
"┃"
.
join
(
padded
)
+
"┃"
lines
=
[
render_separator
(
"┏"
,
"┳"
,
"┓"
),
render_row
(
headers
),
render_separator
(
"┡"
,
"╇"
,
"┩"
),
]
lines
.
extend
(
render_row
(
row
)
for
row
in
rows
)
lines
.
append
(
render_separator
(
"└"
,
"┴"
,
"┘"
))
return
"
\n
"
.
join
(
lines
)
def
_format_value
(
value
:
object
)
->
str
:
if
value
is
None
:
return
"N/A"
if
isinstance
(
value
,
int
|
float
):
return
f
"
{
value
:,.
2
f
}
"
return
str
(
value
)
lib/bindings/python/tests/cancellation/test_example.py
View file @
d4e30a59
...
@@ -11,7 +11,11 @@ import subprocess
...
@@ -11,7 +11,11 @@ import subprocess
import
pytest
import
pytest
pytestmark
=
pytest
.
mark
.
pre_merge
pytestmark
=
[
pytest
.
mark
.
gpu_0
,
pytest
.
mark
.
pre_merge
,
pytest
.
mark
.
integration
,
]
@
pytest
.
fixture
(
scope
=
"module"
)
@
pytest
.
fixture
(
scope
=
"module"
)
...
...
lib/bindings/python/tests/test_bindings_install.py
View file @
d4e30a59
...
@@ -15,7 +15,12 @@
...
@@ -15,7 +15,12 @@
import
pytest
import
pytest
pytestmark
=
pytest
.
mark
.
pre_merge
pytestmark
=
[
pytest
.
mark
.
gpu_0
,
pytest
.
mark
.
parallel
,
pytest
.
mark
.
pre_merge
,
pytest
.
mark
.
unit
,
]
def
test_bindings_install
():
def
test_bindings_install
():
...
...
lib/bindings/python/tests/test_example_hello_world.py
View file @
d4e30a59
...
@@ -11,7 +11,11 @@ import subprocess
...
@@ -11,7 +11,11 @@ import subprocess
import
pytest
import
pytest
pytestmark
=
pytest
.
mark
.
pre_merge
pytestmark
=
[
pytest
.
mark
.
gpu_0
,
pytest
.
mark
.
pre_merge
,
pytest
.
mark
.
integration
,
]
@
pytest
.
fixture
(
scope
=
"module"
)
@
pytest
.
fixture
(
scope
=
"module"
)
...
...
lib/bindings/python/tests/test_http_error.py
View file @
d4e30a59
...
@@ -5,7 +5,12 @@ import pytest
...
@@ -5,7 +5,12 @@ import pytest
from
dynamo.llm
import
HttpError
from
dynamo.llm
import
HttpError
pytestmark
=
pytest
.
mark
.
pre_merge
pytestmark
=
[
pytest
.
mark
.
gpu_0
,
pytest
.
mark
.
parallel
,
pytest
.
mark
.
pre_merge
,
pytest
.
mark
.
unit
,
]
def
test_raise_http_error
():
def
test_raise_http_error
():
...
...
lib/bindings/python/tests/test_http_server.py
View file @
d4e30a59
...
@@ -29,7 +29,11 @@ from dynamo.runtime import DistributedRuntime
...
@@ -29,7 +29,11 @@ from dynamo.runtime import DistributedRuntime
MSG_CONTAINS_ERROR
=
"This message contains an 400error."
MSG_CONTAINS_ERROR
=
"This message contains an 400error."
MSG_CONTAINS_INTERNAL_ERROR
=
"This message contains an internal server error."
MSG_CONTAINS_INTERNAL_ERROR
=
"This message contains an internal server error."
pytestmark
=
pytest
.
mark
.
pre_merge
pytestmark
=
[
pytest
.
mark
.
gpu_0
,
pytest
.
mark
.
pre_merge
,
pytest
.
mark
.
integration
,
]
class
MockHttpEngine
:
class
MockHttpEngine
:
...
...
lib/bindings/python/tests/test_kserve_grpc.py
View file @
d4e30a59
...
@@ -17,7 +17,11 @@ except ImportError:
...
@@ -17,7 +17,11 @@ except ImportError:
from
dynamo.llm
import
KserveGrpcService
,
ModelRuntimeConfig
,
PythonAsyncEngine
from
dynamo.llm
import
KserveGrpcService
,
ModelRuntimeConfig
,
PythonAsyncEngine
pytestmark
=
pytest
.
mark
.
pre_merge
pytestmark
=
[
pytest
.
mark
.
gpu_0
,
pytest
.
mark
.
pre_merge
,
pytest
.
mark
.
integration
,
]
async
def
_fetch_model_config
(
async
def
_fetch_model_config
(
...
...
lib/bindings/python/tests/test_kv_bindings.py
View file @
d4e30a59
...
@@ -21,7 +21,12 @@ import pytest
...
@@ -21,7 +21,12 @@ import pytest
from
dynamo.llm
import
RadixTree
from
dynamo.llm
import
RadixTree
pytestmark
=
pytest
.
mark
.
pre_merge
pytestmark
=
[
pytest
.
mark
.
gpu_0
,
pytest
.
mark
.
parallel
,
pytest
.
mark
.
pre_merge
,
pytest
.
mark
.
unit
,
]
@
pytest
.
mark
.
timeout
(
5
)
# Expected: ~1s, timeout set to 5x for safety
@
pytest
.
mark
.
timeout
(
5
)
# Expected: ~1s, timeout set to 5x for safety
...
...
lib/bindings/python/tests/test_lora_utils.py
View file @
d4e30a59
...
@@ -5,6 +5,13 @@ import pytest
...
@@ -5,6 +5,13 @@ import pytest
from
dynamo.llm
import
lora_name_to_id
from
dynamo.llm
import
lora_name_to_id
pytestmark
=
[
pytest
.
mark
.
gpu_0
,
pytest
.
mark
.
parallel
,
pytest
.
mark
.
pre_merge
,
pytest
.
mark
.
unit
,
]
max_int32
=
0x7FFFFFFF
max_int32
=
0x7FFFFFFF
...
...
lib/bindings/python/tests/test_mm_kv_router.py
View file @
d4e30a59
...
@@ -25,7 +25,12 @@ import pytest
...
@@ -25,7 +25,12 @@ import pytest
from
dynamo.llm
import
RadixTree
,
compute_block_hash_for_seq
from
dynamo.llm
import
RadixTree
,
compute_block_hash_for_seq
pytestmark
=
pytest
.
mark
.
pre_merge
pytestmark
=
[
pytest
.
mark
.
gpu_0
,
pytest
.
mark
.
parallel
,
pytest
.
mark
.
pre_merge
,
pytest
.
mark
.
unit
,
]
# Constants for testing
# Constants for testing
DEFAULT_BLOCK_SIZE
=
32
DEFAULT_BLOCK_SIZE
=
32
...
...
lib/bindings/python/tests/test_parsers.py
View file @
d4e30a59
...
@@ -13,9 +13,17 @@
...
@@ -13,9 +13,17 @@
# See the License for the specific language governing permissions and
# See the License for the specific language governing permissions and
# limitations under the License.
# limitations under the License.
import
pytest
from
dynamo._core
import
get_reasoning_parser_names
,
get_tool_parser_names
from
dynamo._core
import
get_reasoning_parser_names
,
get_tool_parser_names
pytestmark
=
[
pytest
.
mark
.
gpu_0
,
pytest
.
mark
.
parallel
,
pytest
.
mark
.
pre_merge
,
pytest
.
mark
.
unit
,
]
def
test_get_tool_parser_names
():
def
test_get_tool_parser_names
():
parsers
=
get_tool_parser_names
()
parsers
=
get_tool_parser_names
()
...
...
lib/bindings/python/tests/test_replay.py
View file @
d4e30a59
...
@@ -2,16 +2,22 @@
...
@@ -2,16 +2,22 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-License-Identifier: Apache-2.0
import
json
import
json
import
os
import
subprocess
import
sys
import
pytest
import
pytest
from
dynamo.llm
import
KvRouterConfig
,
MockEngineArgs
from
dynamo.llm
import
KvRouterConfig
,
MockEngineArgs
from
dynamo.replay
import
run_synthetic_trace_replay
,
run_trace_replay
from
dynamo.replay
import
run_synthetic_trace_replay
,
run_trace_replay
from
dynamo.replay.main
import
main
from
dynamo.replay.reporting
import
format_report_table
,
write_report_json
pytestmark
=
[
pytestmark
=
[
pytest
.
mark
.
gpu_0
,
pytest
.
mark
.
gpu_0
,
pytest
.
mark
.
parallel
,
pytest
.
mark
.
parallel
,
pytest
.
mark
.
pre_merge
,
pytest
.
mark
.
pre_merge
,
pytest
.
mark
.
unit
,
]
]
MOONCAKE_TRACE_FIRST20
=
"""{"timestamp": 0, "input_length": 6755, "output_length": 500, "hash_ids": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]}
MOONCAKE_TRACE_FIRST20
=
"""{"timestamp": 0, "input_length": 6755, "output_length": 500, "hash_ids": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]}
...
@@ -37,6 +43,50 @@ MOONCAKE_TRACE_FIRST20 = """{"timestamp": 0, "input_length": 6755, "output_lengt
...
@@ -37,6 +43,50 @@ MOONCAKE_TRACE_FIRST20 = """{"timestamp": 0, "input_length": 6755, "output_lengt
"""
"""
def
_vllm_args_payload
():
return
{
"block_size"
:
64
,
"speedup_ratio"
:
1000.0
,
}
def
_sglang_args_payload
():
return
{
"engine_type"
:
"sglang"
,
"num_gpu_blocks"
:
512
,
"block_size"
:
64
,
"speedup_ratio"
:
1000.0
,
"sglang"
:
{
"page_size"
:
64
,
},
}
def
_router_config_payload
():
return
{
"router_queue_threshold"
:
1.25
,
"router_event_threads"
:
1
,
"router_queue_policy"
:
"wspt"
,
"router_temperature"
:
0.0
,
"overlap_score_weight"
:
1.0
,
"use_kv_events"
:
True
,
"durable_kv_events"
:
False
,
"router_replica_sync"
:
False
,
"router_track_active_blocks"
:
True
,
"router_track_output_blocks"
:
False
,
"router_assume_kv_reuse"
:
True
,
"router_snapshot_threshold"
:
1000000
,
"router_reset_states"
:
False
,
"router_ttl_secs"
:
120.0
,
"router_max_tree_size"
:
1048576
,
"router_prune_target_ratio"
:
0.8
,
"router_enable_cache_control"
:
False
,
"skip_initial_worker_wait"
:
False
,
"min_initial_workers"
:
1
,
"remote_indexer_component"
:
None
,
}
def
_write_trace_and_args
(
tmp_path
):
def
_write_trace_and_args
(
tmp_path
):
trace_path
=
tmp_path
/
"trace.jsonl"
trace_path
=
tmp_path
/
"trace.jsonl"
records
=
[
records
=
[
...
@@ -60,125 +110,62 @@ def _write_trace_and_args(tmp_path):
...
@@ -60,125 +110,62 @@ def _write_trace_and_args(tmp_path):
return
trace_path
return
trace_path
def
_write_cli_smoke_trace
(
tmp_path
):
trace_path
=
tmp_path
/
"cli_smoke_trace.jsonl"
records
=
[]
for
index
in
range
(
10
):
records
.
append
(
{
"timestamp"
:
1000.0
+
index
,
"input_length"
:
250
,
"output_length"
:
25
,
"hash_ids"
:
[
index
,
index
+
1
,
index
+
2
,
index
+
3
],
}
)
trace_path
.
write_text
(
"
\n
"
.
join
(
json
.
dumps
(
record
)
for
record
in
records
)
+
"
\n
"
,
encoding
=
"utf-8"
,
)
return
trace_path
def
_write_vllm_args
(
tmp_path
):
def
_write_vllm_args
(
tmp_path
):
args_path
=
tmp_path
/
"args.json"
args_path
=
tmp_path
/
"args.json"
args_path
.
write_text
(
args_path
.
write_text
(
json
.
dumps
(
json
.
dumps
(
_vllm_args_payload
()),
{
"block_size"
:
64
,
"speedup_ratio"
:
1000.0
,
}
),
encoding
=
"utf-8"
,
encoding
=
"utf-8"
,
)
)
return
args_path
return
args_path
def
_vllm_args
():
def
_vllm_args
():
return
MockEngineArgs
.
from_json
(
return
MockEngineArgs
.
from_json
(
json
.
dumps
(
_vllm_args_payload
()))
json
.
dumps
(
{
"block_size"
:
64
,
"speedup_ratio"
:
1000.0
,
}
)
)
def
_write_sglang_args
(
tmp_path
):
def
_write_sglang_args
(
tmp_path
):
args_path
=
tmp_path
/
"sglang_args.json"
args_path
=
tmp_path
/
"sglang_args.json"
args_path
.
write_text
(
args_path
.
write_text
(
json
.
dumps
(
json
.
dumps
(
_sglang_args_payload
()),
{
"engine_type"
:
"sglang"
,
"num_gpu_blocks"
:
512
,
"block_size"
:
64
,
"speedup_ratio"
:
1000.0
,
"sglang"
:
{
"page_size"
:
64
,
},
}
),
encoding
=
"utf-8"
,
encoding
=
"utf-8"
,
)
)
return
args_path
return
args_path
def
_sglang_args
():
def
_sglang_args
():
return
MockEngineArgs
.
from_json
(
return
MockEngineArgs
.
from_json
(
json
.
dumps
(
_sglang_args_payload
()))
json
.
dumps
(
{
"engine_type"
:
"sglang"
,
"num_gpu_blocks"
:
512
,
"block_size"
:
64
,
"speedup_ratio"
:
1000.0
,
"sglang"
:
{
"page_size"
:
64
,
},
}
)
)
def
_write_router_config
(
tmp_path
):
def
_write_router_config
(
tmp_path
):
config_path
=
tmp_path
/
"router_config.json"
config_path
=
tmp_path
/
"router_config.json"
config_path
.
write_text
(
config_path
.
write_text
(
json
.
dumps
(
json
.
dumps
(
_router_config_payload
()),
{
"router_queue_threshold"
:
1.25
,
"router_event_threads"
:
1
,
"router_queue_policy"
:
"wspt"
,
"router_temperature"
:
0.0
,
"overlap_score_weight"
:
1.0
,
"use_kv_events"
:
True
,
"durable_kv_events"
:
False
,
"router_replica_sync"
:
False
,
"router_track_active_blocks"
:
True
,
"router_track_output_blocks"
:
False
,
"router_assume_kv_reuse"
:
True
,
"router_snapshot_threshold"
:
1000000
,
"router_reset_states"
:
False
,
"router_ttl_secs"
:
120.0
,
"router_max_tree_size"
:
1048576
,
"router_prune_target_ratio"
:
0.8
,
"router_enable_cache_control"
:
False
,
"skip_initial_worker_wait"
:
False
,
"min_initial_workers"
:
1
,
"remote_indexer_component"
:
None
,
}
),
encoding
=
"utf-8"
,
encoding
=
"utf-8"
,
)
)
return
config_path
return
config_path
def
_router_config
():
def
_router_config
():
return
KvRouterConfig
.
from_json
(
return
KvRouterConfig
.
from_json
(
json
.
dumps
(
_router_config_payload
()))
json
.
dumps
(
{
"router_queue_threshold"
:
1.25
,
"router_event_threads"
:
1
,
"router_queue_policy"
:
"wspt"
,
"router_temperature"
:
0.0
,
"overlap_score_weight"
:
1.0
,
"use_kv_events"
:
True
,
"durable_kv_events"
:
False
,
"router_replica_sync"
:
False
,
"router_track_active_blocks"
:
True
,
"router_track_output_blocks"
:
False
,
"router_assume_kv_reuse"
:
True
,
"router_snapshot_threshold"
:
1000000
,
"router_reset_states"
:
False
,
"router_ttl_secs"
:
120.0
,
"router_max_tree_size"
:
1048576
,
"router_prune_target_ratio"
:
0.8
,
"router_enable_cache_control"
:
False
,
"skip_initial_worker_wait"
:
False
,
"min_initial_workers"
:
1
,
"remote_indexer_component"
:
None
,
}
)
)
def
_partial_router_config
():
def
_partial_router_config
():
...
@@ -196,6 +183,45 @@ def _assert_basic_report_counts(report, *, num_requests, input_tokens, output_to
...
@@ -196,6 +183,45 @@ def _assert_basic_report_counts(report, *, num_requests, input_tokens, output_to
assert
report
[
"total_output_tokens"
]
==
num_requests
*
output_tokens
assert
report
[
"total_output_tokens"
]
==
num_requests
*
output_tokens
def
_assert_basic_report_metrics
(
report
):
assert
report
[
"request_throughput_rps"
]
>
0
assert
report
[
"output_throughput_tok_s"
]
>
0
assert
report
[
"duration_ms"
]
>
0
def
_replay_cli_env
()
->
dict
[
str
,
str
]:
env
=
os
.
environ
.
copy
()
pythonpath_entries
=
[
"lib/bindings/python/src"
,
"components/src"
]
existing_pythonpath
=
env
.
get
(
"PYTHONPATH"
)
if
existing_pythonpath
:
pythonpath_entries
.
append
(
existing_pythonpath
)
env
[
"PYTHONPATH"
]
=
":"
.
join
(
pythonpath_entries
)
return
env
def
_run_replay_cli
(
tmp_path
,
*
args
):
return
subprocess
.
run
(
[
sys
.
executable
,
"-m"
,
"dynamo.replay"
,
*
args
,
],
capture_output
=
True
,
check
=
True
,
cwd
=
str
(
tmp_path
),
env
=
_replay_cli_env
(),
text
=
True
,
)
def
_assert_replay_cli_outputs
(
completed
,
report_path
):
assert
"NVIDIA AIPerf | LLM Metrics"
in
completed
.
stdout
assert
"Saved full report to:"
in
completed
.
stdout
assert
'"completed_requests"'
not
in
completed
.
stdout
return
json
.
loads
(
report_path
.
read_text
(
encoding
=
"utf-8"
))
@
pytest
.
mark
.
parametrize
(
"engine_type"
,
[
"vllm"
,
"sglang"
])
@
pytest
.
mark
.
parametrize
(
"engine_type"
,
[
"vllm"
,
"sglang"
])
@
pytest
.
mark
.
parametrize
(
"replay_mode"
,
[
"offline"
,
"online"
])
@
pytest
.
mark
.
parametrize
(
"replay_mode"
,
[
"offline"
,
"online"
])
@
pytest
.
mark
.
parametrize
(
"router_mode"
,
[
"round_robin"
,
"kv_router"
])
@
pytest
.
mark
.
parametrize
(
"router_mode"
,
[
"round_robin"
,
"kv_router"
])
...
@@ -419,3 +445,167 @@ def test_run_trace_replay_accepts_partial_extra_engine_args_json(tmp_path, repla
...
@@ -419,3 +445,167 @@ def test_run_trace_replay_accepts_partial_extra_engine_args_json(tmp_path, repla
input_tokens
=
64
,
input_tokens
=
64
,
output_tokens
=
2
,
output_tokens
=
2
,
)
)
def
test_format_report_table_matches_aiperf_shape
():
report
=
{
"mean_ttft_ms"
:
18.26
,
"min_ttft_ms"
:
11.22
,
"max_ttft_ms"
:
106.32
,
"p99_ttft_ms"
:
68.82
,
"p90_ttft_ms"
:
27.76
,
"p75_ttft_ms"
:
16.62
,
"std_ttft_ms"
:
12.07
,
"mean_ttst_ms"
:
11.40
,
"min_ttst_ms"
:
0.02
,
"max_ttst_ms"
:
85.91
,
"p99_ttst_ms"
:
34.54
,
"p90_ttst_ms"
:
12.59
,
"p75_ttst_ms"
:
11.65
,
"std_ttst_ms"
:
7.01
,
"mean_e2e_latency_ms"
:
487.30
,
"min_e2e_latency_ms"
:
267.07
,
"max_e2e_latency_ms"
:
769.57
,
"p99_e2e_latency_ms"
:
715.99
,
"p90_e2e_latency_ms"
:
580.83
,
"p75_e2e_latency_ms"
:
536.17
,
"std_e2e_latency_ms"
:
79.60
,
"mean_itl_ms"
:
11.23
,
"min_itl_ms"
:
8.80
,
"max_itl_ms"
:
13.17
,
"p99_itl_ms"
:
12.48
,
"p90_itl_ms"
:
11.73
,
"p75_itl_ms"
:
11.37
,
"std_itl_ms"
:
0.45
,
"mean_output_token_throughput_per_user"
:
89.23
,
"min_output_token_throughput_per_user"
:
75.93
,
"max_output_token_throughput_per_user"
:
113.60
,
"p99_output_token_throughput_per_user"
:
102.28
,
"p90_output_token_throughput_per_user"
:
90.91
,
"p75_output_token_throughput_per_user"
:
90.29
,
"std_output_token_throughput_per_user"
:
3.70
,
"output_throughput_tok_s"
:
10944.03
,
"request_throughput_rps"
:
255.54
,
"completed_requests"
:
711
,
"wall_time_ms"
:
4046.31
,
"prefix_cache_reused_ratio"
:
0.3587
,
}
rendered
=
format_report_table
(
report
)
assert
"NVIDIA AIPerf | LLM Metrics"
in
rendered
assert
"Time to First Token (ms)"
in
rendered
assert
"Output Token Throughput (tokens/sec)"
in
rendered
assert
"Request Throughput (requests/sec)"
in
rendered
assert
"Prefix Cache Reused Ratio: 0.36"
in
rendered
assert
"10,944.03"
in
rendered
assert
"255.54"
in
rendered
assert
"N/A"
in
rendered
def
test_write_report_json_creates_file
(
tmp_path
):
report_path
=
write_report_json
({
"completed_requests"
:
2
},
tmp_path
/
"report.json"
)
assert
(
report_path
.
read_text
(
encoding
=
"utf-8"
)
==
'{
\n
"completed_requests": 2
\n
}
\n
'
)
def
test_replay_cli_prints_table_and_saves_json
(
tmp_path
,
monkeypatch
,
capsys
):
report
=
{
"mean_ttft_ms"
:
10.0
,
"min_ttft_ms"
:
9.0
,
"max_ttft_ms"
:
12.0
,
"p99_ttft_ms"
:
12.0
,
"p90_ttft_ms"
:
11.0
,
"p75_ttft_ms"
:
10.5
,
"std_ttft_ms"
:
1.0
,
"output_throughput_tok_s"
:
123.0
,
"request_throughput_rps"
:
4.0
,
"completed_requests"
:
3
,
}
def
fake_run
(
*
args
,
**
kwargs
):
return
report
monkeypatch
.
setattr
(
"dynamo.replay.main.run_synthetic_trace_replay"
,
fake_run
)
report_path
=
tmp_path
/
"cli_report.json"
exit_code
=
main
(
[
"--input-tokens"
,
"16"
,
"--output-tokens"
,
"8"
,
"--request-count"
,
"3"
,
"--report-json"
,
str
(
report_path
),
]
)
assert
exit_code
==
0
stdout
=
capsys
.
readouterr
().
out
assert
"NVIDIA AIPerf | LLM Metrics"
in
stdout
assert
"Saved full report to:"
in
stdout
assert
'"completed_requests"'
not
in
stdout
assert
json
.
loads
(
report_path
.
read_text
(
encoding
=
"utf-8"
))
==
report
def
test_replay_cli_subprocess_synthetic_smoke
(
tmp_path
):
report_path
=
tmp_path
/
"synthetic_report.json"
completed
=
_run_replay_cli
(
tmp_path
,
"--input-tokens"
,
"250"
,
"--output-tokens"
,
"25"
,
"--request-count"
,
"10"
,
"--num-workers"
,
"4"
,
"--replay-concurrency"
,
"4"
,
"--report-json"
,
str
(
report_path
),
"--extra-engine-args"
,
'{"block_size":64,"speedup_ratio":1000.0}'
,
)
report
=
_assert_replay_cli_outputs
(
completed
,
report_path
)
_assert_basic_report_counts
(
report
,
num_requests
=
10
,
input_tokens
=
250
,
output_tokens
=
25
,
)
_assert_basic_report_metrics
(
report
)
def
test_replay_cli_subprocess_trace_smoke
(
tmp_path
):
trace_path
=
_write_cli_smoke_trace
(
tmp_path
)
report_path
=
tmp_path
/
"trace_report.json"
completed
=
_run_replay_cli
(
tmp_path
,
str
(
trace_path
),
"--replay-mode"
,
"offline"
,
"--router-mode"
,
"kv_router"
,
"--num-workers"
,
"4"
,
"--report-json"
,
str
(
report_path
),
"--extra-engine-args"
,
'{"block_size":64,"speedup_ratio":1000.0}'
,
)
report
=
_assert_replay_cli_outputs
(
completed
,
report_path
)
_assert_basic_report_counts
(
report
,
num_requests
=
10
,
input_tokens
=
250
,
output_tokens
=
25
,
)
_assert_basic_report_metrics
(
report
)
lib/bindings/python/tests/test_tensor.py
View file @
d4e30a59
...
@@ -13,6 +13,12 @@ from dynamo.runtime import DistributedRuntime
...
@@ -13,6 +13,12 @@ from dynamo.runtime import DistributedRuntime
TEST_END_TO_END
=
os
.
environ
.
get
(
"TEST_END_TO_END"
,
0
)
TEST_END_TO_END
=
os
.
environ
.
get
(
"TEST_END_TO_END"
,
0
)
pytestmark
=
[
pytest
.
mark
.
gpu_0
,
pytest
.
mark
.
pre_merge
,
pytest
.
mark
.
integration
,
]
@
pytest
.
mark
.
asyncio
@
pytest
.
mark
.
asyncio
async
def
test_register
(
runtime
:
DistributedRuntime
):
async
def
test_register
(
runtime
:
DistributedRuntime
):
...
...
lib/mocker/src/scheduler/sglang/tests.rs
View file @
d4e30a59
...
@@ -561,7 +561,7 @@ async fn assert_sglang_scheduler_completes_all(
...
@@ -561,7 +561,7 @@ async fn assert_sglang_scheduler_completes_all(
let
expected_tokens
=
num_requests
*
max_output_tokens
;
let
expected_tokens
=
num_requests
*
max_output_tokens
;
let
mut
received_tokens
=
0
;
let
mut
received_tokens
=
0
;
let
timeout
=
tokio
::
time
::
sleep
(
Duration
::
from_
sec
s
(
2
));
let
timeout
=
tokio
::
time
::
sleep
(
Duration
::
from_
milli
s
(
2
00
));
tokio
::
pin!
(
timeout
);
tokio
::
pin!
(
timeout
);
loop
{
loop
{
...
@@ -572,7 +572,7 @@ async fn assert_sglang_scheduler_completes_all(
...
@@ -572,7 +572,7 @@ async fn assert_sglang_scheduler_completes_all(
if
received_tokens
>=
expected_tokens
{
if
received_tokens
>=
expected_tokens
{
break
;
break
;
}
}
timeout
.set
(
tokio
::
time
::
sleep
(
Duration
::
from_
sec
s
(
2
)));
timeout
.set
(
tokio
::
time
::
sleep
(
Duration
::
from_
milli
s
(
2
00
)));
}
}
_
=
&
mut
timeout
=>
break
,
_
=
&
mut
timeout
=>
break
,
}
}
...
@@ -580,7 +580,6 @@ async fn assert_sglang_scheduler_completes_all(
...
@@ -580,7 +580,6 @@ async fn assert_sglang_scheduler_completes_all(
assert_eq!
(
received_tokens
,
expected_tokens
);
assert_eq!
(
received_tokens
,
expected_tokens
);
tokio
::
time
::
sleep
(
Duration
::
from_millis
(
100
))
.await
;
let
metrics
=
scheduler
.metrics_receiver
()
.borrow
()
.clone
();
let
metrics
=
scheduler
.metrics_receiver
()
.borrow
()
.clone
();
assert
!
(
metrics
.active_decode_blocks
>
0
);
assert
!
(
metrics
.active_decode_blocks
>
0
);
assert
!
(
metrics
.total_blocks
>
0
);
assert
!
(
metrics
.total_blocks
>
0
);
...
@@ -609,7 +608,7 @@ mod router_events {
...
@@ -609,7 +608,7 @@ mod router_events {
let
args
=
MockEngineArgs
::
builder
()
let
args
=
MockEngineArgs
::
builder
()
.num_gpu_blocks
(
500
)
.num_gpu_blocks
(
500
)
.block_size
(
64
)
.block_size
(
64
)
.speedup_ratio
(
10.0
)
.speedup_ratio
(
10
00
.0
)
.sglang
(
Some
(
SglangArgs
{
.sglang
(
Some
(
SglangArgs
{
schedule_policy
:
Some
(
schedule_policy
.to_string
()),
schedule_policy
:
Some
(
schedule_policy
.to_string
()),
page_size
:
Some
(
page_size
),
page_size
:
Some
(
page_size
),
...
...
lib/mocker/src/scheduler/test_utils.rs
View file @
d4e30a59
...
@@ -254,7 +254,7 @@ pub(crate) async fn assert_scheduler_completes_all(
...
@@ -254,7 +254,7 @@ pub(crate) async fn assert_scheduler_completes_all(
let
expected_tokens
=
num_requests
*
max_output_tokens
;
let
expected_tokens
=
num_requests
*
max_output_tokens
;
let
mut
received_tokens
=
0
;
let
mut
received_tokens
=
0
;
let
timeout
=
tokio
::
time
::
sleep
(
Duration
::
from_
sec
s
(
2
));
let
timeout
=
tokio
::
time
::
sleep
(
Duration
::
from_
milli
s
(
2
00
));
tokio
::
pin!
(
timeout
);
tokio
::
pin!
(
timeout
);
loop
{
loop
{
...
@@ -265,7 +265,7 @@ pub(crate) async fn assert_scheduler_completes_all(
...
@@ -265,7 +265,7 @@ pub(crate) async fn assert_scheduler_completes_all(
if
received_tokens
>=
expected_tokens
{
if
received_tokens
>=
expected_tokens
{
break
;
break
;
}
}
timeout
.set
(
tokio
::
time
::
sleep
(
Duration
::
from_
sec
s
(
2
)));
timeout
.set
(
tokio
::
time
::
sleep
(
Duration
::
from_
milli
s
(
2
00
)));
}
}
_
=
&
mut
timeout
=>
break
,
_
=
&
mut
timeout
=>
break
,
}
}
...
@@ -276,7 +276,6 @@ pub(crate) async fn assert_scheduler_completes_all(
...
@@ -276,7 +276,6 @@ pub(crate) async fn assert_scheduler_completes_all(
"Expected {expected_tokens} output signals, got {received_tokens}"
"Expected {expected_tokens} output signals, got {received_tokens}"
);
);
tokio
::
time
::
sleep
(
Duration
::
from_millis
(
100
))
.await
;
let
metrics
=
scheduler
.metrics_receiver
()
.borrow
()
.clone
();
let
metrics
=
scheduler
.metrics_receiver
()
.borrow
()
.clone
();
assert_eq!
(
assert_eq!
(
metrics
.active_decode_blocks
,
0
,
metrics
.active_decode_blocks
,
0
,
...
...
lib/mocker/src/scheduler/vllm/tests.rs
View file @
d4e30a59
...
@@ -477,7 +477,7 @@ mod live_scheduler {
...
@@ -477,7 +477,7 @@ mod live_scheduler {
let
args
=
MockEngineArgs
::
builder
()
let
args
=
MockEngineArgs
::
builder
()
.num_gpu_blocks
(
500
)
.num_gpu_blocks
(
500
)
.block_size
(
64
)
.block_size
(
64
)
.speedup_ratio
(
10.0
)
.speedup_ratio
(
10
00
.0
)
.enable_prefix_caching
(
enable_prefix_caching
)
.enable_prefix_caching
(
enable_prefix_caching
)
.enable_chunked_prefill
(
enable_chunked_prefill
)
.enable_chunked_prefill
(
enable_chunked_prefill
)
.build
()
.build
()
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment