# Bench Entrypoints `multiturn_bench` simulates concurrent multi-turn conversations against an OpenAI-compatible chat endpoint and reports per-turn TTFT and total latency statistics. It can optionally enable **speculative prefill** — a technique that pre-warms the KV cache with the predicted next-turn prefix after each assistant response, cutting TTFT on subsequent turns. `offline_replay_bench` runs the Rust-native replay loop directly for profiling and throughput measurements without going through the Python wrapper. It uses the mocker's internal polynomial perf model so the results stay focused on replay overhead instead of external timing backends. ## Quick start ```bash # Smoke test (1 user, 1 turn, ~50 tokens) cargo bench --package dynamo-bench --bench multiturn_bench -- --ping ``` ## Speculative prefill demo Speculative prefill works best with multi-turn workloads where the conversation grows incrementally (e.g. reasoning models in agentic loops). After each assistant turn the frontend constructs the next-turn prompt prefix and sends a `max_tokens=1` request to warm the KV cache, so the real follow-up hits a warm cache and gets a much lower TTFT. ### 1. Launch the backend and frontend ```bash # Terminal 1 — backend (vLLM example, any supported backend works) python -m dynamo.vllm \ --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B # Terminal 2 — frontend with KV router python -m dynamo.frontend \ --router-mode kv \ --http-port 8000 ``` ### 2. Run baseline (no speculative prefill) ```bash cargo bench --package dynamo-bench --bench multiturn_bench -- \ --url http://localhost:8000 \ --num-users 10 \ --num-turns 5 \ --num-user-tokens 128 \ --max-completion-tokens 256 \ --mean-delay-ms 5000 \ --output baseline.json \ --verbose ``` ### 3. Run with speculative prefill ```bash cargo bench --package dynamo-bench --bench multiturn_bench -- \ --url http://localhost:8000 \ --num-users 10 \ --num-turns 5 \ --num-user-tokens 128 \ --max-completion-tokens 256 \ --mean-delay-ms 5000 \ --speculative-prefill \ --output specprefill.json \ --verbose ``` Compare the per-turn TTFT columns: turns 2+ should show a significant TTFT reduction (up to ~3x) because the KV cache is already warm when the real request arrives. ## CLI reference | Flag | Default | Description | |------|---------|-------------| | `--url` | `http://localhost:8000` | Frontend HTTP endpoint | | `--model` | auto-detected | Model name (queries `/v1/models` if omitted) | | `--num-users` | `10` | Concurrent simulated users | | `--num-turns` | `5` | Conversation turns per user | | `--num-user-tokens` | `128` | Approximate user-prompt token count per turn | | `--max-completion-tokens` | `1000` | Output sequence length cap | | `--ignore-eos` | `true` | Force generation to max tokens | | `--mean-delay-ms` | `5000` | Mean inter-turn delay (exponential distribution) | | `--speculative-prefill` | `false` | Enable speculative prefill via `nvext.agent_hints` | | `--output ` | none | Write results to JSON file | | `--verbose` / `-v` | `false` | Print per-turn logging | | `--seed` | `42` | Random seed | | `--ping` | `false` | Smoke-test mode (1 user, 1 turn, ~50 tokens, no delay) | ## How speculative prefill works 1. The client sends `{"nvext": {"agent_hints": {"speculative_prefill": true}}}` in each request. 2. As the assistant response streams back, the frontend accumulates the full response text. 3. Once `finish_reason` is set, a background task constructs the next-turn prompt (conversation history + assistant response, thinking content stripped) and sends a `max_tokens=1` prefill-only request through the pipeline. 4. The KV router routes the speculative request to the same worker, warming its cache. 5. When the real next-turn request arrives, the KV router sees high cache overlap on that worker and routes there, yielding a much lower TTFT. See also: [Agent Hints documentation](../../docs/components/frontend/nvext.md#agent-hints) ## Offline replay ```bash cargo bench --package dynamo-bench --bench offline_replay_bench -- \ /path/to/mooncake_trace.jsonl \ --num-workers 4 \ --router-mode kv-router \ --arrival-speedup-ratio 4 \ --trace-block-size 512 \ --block-size 64 ``` Use `--speedup-ratio` and `--decode-speedup-ratio` if you want a simple scaling knob while keeping the same internal polynomial model.