"lib/kvbm/pyproject.toml" did not exist on "08fcd7e93ba5df3093a8b54fe79e0895fe7a5f15"
dynamo_run.md 20 KB
Newer Older
1
# Dynamo Run
2

3
`dynamo-run` is a Rust binary that lets you easily run a model, explore the Dynamo components, and demonstrates the Rust API. It supports the `mistral.rs` and `llama.cpp` engines. `mistralrs` is the default for safe tensors, `llama.cpp` for GGUF files.
4

5
It is primarily for development and rapid prototyping. For production use we recommend the Python wrapped components, see the main project README.
6

7
## Basics
8

9
Usage: See `dynamo-run --help`
10

atchernych's avatar
atchernych committed
11
Example: `dynamo-run Qwen/Qwen3-0.6B`
12

13
Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.
14

15
16
17
To adjust verbosity, use `-v` to enable debug logging or `-vv` to enable full trace logging. For example:

```bash
18
19
dynamo-run in=http out=mistralrs <model> -v  # enables debug logging
dynamo-run in=text out=llamacpp <model> -vv  # enables full trace logging
20
21
```

22
### Use model from Hugging Face
23

24
To automatically download Qwen3 4B from Hugging Face (16 GiB download) and to start it in interactive text mode:
25
```
26
dynamo-run Qwen/Qwen3-4B
27
28
```

29
The general format for HF download follows this pattern:
30
```
atchernych's avatar
atchernych committed
31
dynamo-run out=<engine> <HUGGING_FACE_ORGANIZATION/MODEL_NAME>
32
33
```

34
For gated models (such as meta-llama/Llama-3.2-3B-Instruct), you must set an `HF_TOKEN` environment variable.
35

36
The parameter can be the ID of a HuggingFace repository (which will be downloaded), a GPT-Generated Unified Format (GGUF) file, or a folder containing safetensors, config.json, or similar (perhaps a locally checked out HuggingFace repository).
37

38
### Run a model from local file
39

40
41
42
43
44
To run a model from local file:
- Download the model from Hugging Face
- Run the model from local file

See the following sections for details.
45

46
#### Download model from Hugging Face
47
One of the models available from Hugging Face should be high quality and fast on almost any machine: https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF
48
49
50
For example, try https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/blob/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf

To download model file:
51
```
52
curl -L -o Llama-3.2-3B-Instruct-Q4_K_M.gguf "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf?download=true"
53
```
54

55
56
57
To run the model:

*Text interface*
58
```
atchernych's avatar
atchernych committed
59
dynamo-run Llama-3.2-3B-Instruct-Q4_K_M.gguf # or path to a Hugging Face repo checkout instead of the GGUF file
60
```
61

62
63
64
65
66
You can also pipe a prompt into `dynamo-run`:
```
echo 'What is the capital of Tuvalu?' | dynamo-run ~/llms/Qwen3-0.6B-Q8_0.gguf --context-length 4096
```

67
*HTTP interface*
68
```
atchernych's avatar
atchernych committed
69
dynamo-run in=http out=mistralrs Llama-3.2-3B-Instruct-Q4_K_M.gguf
70
```
71
You can also list models or send a request:
72

73
*List the models*
74
```
75
curl localhost:8080/v1/models
76
```
77

78
*Send a request*
79
```
80
curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions
81
```
82

83
## Distributed System
84
85

You can run the ingress side (HTTP server and pre-processing) on one machine, for example a CPU node, and the worker on a different machine (a GPU node).
86

87
You will need [etcd](https://etcd.io/) and [nats](https://nats.io) with jetstream installed and accessible from both nodes. For development I run NATS like this: `nats-server -js --trace --store_dir $(mktemp -d)`.
88

89
**Node 1:** OpenAI compliant HTTP server, optional pre-processing, worker discovery:
90

91
```
92
dynamo-run in=http out=auto
93
```
94

95
**Node 2:** Engine. Receives and returns requests over the network:
96

97
```
98
dynamo-run in=dyn://llama3B.backend.generate out=mistralrs ~/llms/Llama-3.2-3B-Instruct
99
```
100

101
102
This uses etcd to auto-discover the model and NATS to talk to it. You can
run multiple instances on the same endpoint; it picks one based on the
103
`--router-mode` (round-robin by default if left unspecified).
104

105
Run `dynamo-run --help` for more options.
106

107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
### Network names

The `in=dyn://` URLs have the format `dyn://namespace.component.endpoint`. For quickstart just use any string `dyn://test`, `dynamo-run` will default any missing parts for you. The pieces matter for a larger system.

* *Namespace*: A pipeline. Usually a model. e.g "llama_8b". Just a name.
* *Component*: A load balanced service needed to run that pipeline. "backend", "prefill", "decode", "preprocessor", "draft", etc. This typically has some configuration (which model to use, for example).
* *Endpoint*: Like a URL. "generate", "load_metrics".
* *Instance*: A process. Unique. Dynamo assigns each one a unique instance_id. The thing that is running is always an instance. Namespace/component/endpoint can refer to multiple instances.

If you run two models, that is two pipelines. An exception would be if doing speculative decoding. The draft model is part of the pipeline of a bigger model.

If you run two instances of the same model ("data parallel") they are the same namespace+component+endpoint but different instances. The router will spread traffic over all the instances of a namespace+component+endpoint. If you have four prefill workers in a pipeline, they all have the same namespace+component+endpoint and are automatically assigned unique instance_ids.

Example 1: Data parallel load balanced, one model one pipeline two instances.
```
122
123
Node 1: dynamo-run in=dyn://qwen3-32b.backend.generate /data/Qwen3-32B
Node 2: dynamo-run in=dyn://qwen3-32b.backend.generate /data/Qwen3-32B
124
125
126
127
```

Example 2: Two models, two pipelines.
```
128
129
Node 1: dynamo-run in=dyn://qwen3-32b.backend.generate /data/Qwen3-32B
Node 2: dynamo-run in=dyn://llama3-1-8b.backend.generate /data/Llama-3.1-8B-Instruct/
130
131
132
133
134
135
```

Example 3: Different endpoints.

The KV metrics publisher in VLLM adds a `load_metrics` endpoint to the current component. If the `llama3-1-8b.backend` component above is using patched vllm it will also expose `llama3-1-8b.backend.load_metrics`.

136
Example 4: Multiple component in a pipeline.
137

138
In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.generate` (possibly multiple instances of this) and `deepseek-distill-llama8b.decode.generate`.
139

140
141
142
143
144
145
146
147
148
149
150
151
For output it is always only `out=auto`. This tells Dynamo to auto-discover the instances, group them by model, and load balance appropriately (depending on `--router-mode` flag). The exception is static workers, see that section.

### Static workers without etcd

Normally in the distributed system the frontend uses etcd to discover workers. The option exists to have a static endpoint without etcd.

```
Node 1: dynamo-run in=http out=dyn://dynamo.backend.generate --model-name Qwen3-0.6B-Q8_0.gguf --model-path ~/llms/Qwen3-0.6B
Node 2: dynamo-run in=dyn://dynamo.backend.generate out=llamacpp ~/llms/Qwen3-0.6B-Q8_0.gguf --static-worker --context-length 4096
```

Note how `out=` points to a single endpoint, which must match the worker. The model's name and config (to do pre-processing) are usually discovered by the frontend via etcd. Now we must pass them in (`--model-name` and `--model-path`).
152

153
154
155
### KV-aware routing

```
156
dynamo-run in=http out=auto --router-mode kv
157
158
```

159
The only difference from the distributed system above is `--router-mode kv`. vllm announces when a KV block is created or removed. The Dynamo router finds the worker with the best match for those KV blocks and directs the traffic to that node.
160

161
For performance testing, compare a typical workload with `--router-mode random|round-robin` to see if it can benefit from KV-aware routing.
162

163
164
165
166
167
168
169
The KV-aware routing arguments:

- `--kv-overlap-score-weight`: Sets the amount of weighting on overlaps with prefix caches, which directly contributes to the prefill cost. A large weight is expected to yield a better TTFT (at the expense of worse ITL). When set to 0, prefix caches are not considered at all (falling back to pure load balancing behavior on the active blocks).

- `--router-temperature`: Sets the temperature when randomly selecting workers to route to via softmax sampling on the router cost logits. Setting it to 0 recovers the deterministic behavior where the min logit is picked.

- `--use-kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true, then we use the `KvIndexer` to listen to the block creation and deletion events. If false, `ApproxKvIndexer`, which assumes the kv cache of historical prompts exists for fixed time durations (hard-coded to 120s), is used to predict the kv cache hit ratio in each engine. Set false if your backend engine does not emit KV events.
170

171
172
### Request Migration

173
In a [Distributed System](#distributed-system), you can enable [request migration](../architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
174
175

```bash
176
dynamo-run in=dyn://... out=<engine> ... --migration-limit=3
177
178
```

179
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../architecture/request_migration.md) documentation for details on how this works.
180

181
## Development
182

183
`dynamo-run` is also an example of what can be built in Rust with the `dynamo-llm` and `dynamo-runtime` crates. The following guide shows how to build from source with all the features.
184

185
### Step 1: Install libraries
186
**Ubuntu:**
187
```
188
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libssl-dev libclang-dev protobuf-compiler python3-dev cmake
189
190
```

191
**macOS:**
192
193
194
195
196
197
198
- [Homebrew](https://brew.sh/)
```
# if brew is not installed on your system, install it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
- [Xcode](https://developer.apple.com/xcode/)

199
200
```
brew install cmake protobuf
201

202
## Check that Metal is accessible
203
204
xcrun -sdk macosx metal
```
205
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
206

207
### Step 2: Install Rust
208
```
209
210
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
211
```
212

213
### Step 3: Build
214

215
- Linux with GPU and CUDA (tested on Ubuntu):
216
```
217
cargo build --features cuda
218
```
219

220
- macOS with Metal:
221
```
222
cargo build --features metal
223
224
```

225
- CPU only:
226
```
227
cargo build
228
229
```

230
231
232
Optionally you can run `cargo build` from any location with arguments:

```
233
234
--target-dir /path/to/target_directory # specify target_directory with write privileges
--manifest-path /path/to/project/Cargo.toml # if cargo build is run outside of `launch/` directory
235
236
```

237
The binary is called `dynamo-run` in `target/debug`
238
```
239
cd target/debug
240
241
```

242
243
Build with `--release` for a smaller binary and better performance, but longer build times. The binary will be in `target/release`.

244

245
## Engines
246

247
The input defaults to `in=text`. The output defaults to `out=mistralrs` engine, unless it is disabled with `--no-default-features` in which case an engine that echo's back your input is used.
248

249
### mistralrs
250
251
252
253

[mistral.rs](https://github.com/EricLBuehler/mistral.rs) is a pure Rust engine that is fast to run, fast to load, supports GGUF as well as safetensors, and runs well on CPU as well as GPU. For those reasons it is the default engine.

```
254
dynamo-run Qwen/Qwen3-4B
255
256
257
258
259
```

is equivalent to

```
260
dynamo-run in=text out=mistralrs Qwen/Qwen3-4B
261
262
```

263
If you have multiple GPUs, `mistral.rs` does automatic tensor parallelism. You do not need to pass any extra flags to dynamo-run to enable it.
264

265
### llamacpp
266

267
[llama.cpp](https://github.com/ggml-org/llama.cpp) is built for CPU by default. For an optimized build pass the appropriate feature flag (highly recommended):
268

269
```
270
cargo build --features cuda|metal|vulkan -p dynamo-run
271
272
```

273
274
275
276
277
278
For GNU OpenMP support add the `openmp` feature. On Ubuntu this requires `libgomp1` (part of `build-essential`) at build and runtime.

```
cargo build --features cuda,openmp -p dynamo-run
```

279
```
280
281
dynamo-run out=llamacpp ~/llms/gemma-3-1b-it-q4_0.gguf
dynamo-run out=llamacpp ~/llms/Qwen3-0.6B-Q8_0.gguf # From https://huggingface.co/ggml-org
282
```
283

284
Note that in some cases we are unable to extract the tokenizer from the GGUF, and so a Hugging Face checkout of a matching model must also be passed. Dynamo uses the weights from the GGUF and the pre-processor (`tokenizer.json`, etc) from the `--model-config`:
285
```
286
dynamo-run out=llamacpp ~/llms/Llama-4-Scout-17B-16E-Instruct-UD-IQ1_S.gguf --context-length 32768 --model-config ~/llms/Llama-4-Scout-17B-16E-Instruct
287
288
```

289
If you have multiple GPUs, llama.cpp does automatic tensor parallelism. You do not need to pass any extra flags to `dynamo-run` to enable it.
Graham King's avatar
Graham King committed
290

291
### Mocker engine
Graham King's avatar
Graham King committed
292

293
The mocker engine is a mock vLLM implementation designed for testing and development purposes. It simulates realistic token generation timing without requiring actual model inference, making it useful for:
294

295
296
297
298
- Testing distributed system components without GPU resources
- Benchmarking infrastructure and networking overhead
- Developing and debugging Dynamo components
- Load testing and performance analysis
Graham King's avatar
Graham King committed
299

300
**Basic usage:**
301

302
The `--model-path` is required but can point to any valid model path - the mocker doesn't actually load the model weights (but the pre-processor needs the tokenizer). The arguments `block_size`, `num_gpu_blocks`, `max_num_seqs`, `max_num_batched_tokens`, `enable_prefix_caching`, and `enable_chunked_prefill` are common arguments shared with the real VLLM engine.
303

304
305
306
307
And below are arguments that are mocker-specific:
- `speedup_ratio`: Speed multiplier for token generation (default: 1.0). Higher values make the simulation engines run faster.
- `dp_size`: Number of data parallel workers to simulate (default: 1)
- `watermark`: KV cache watermark threshold as a fraction (default: 0.01). This argument also exists for the real VLLM engine but cannot be passed as an engine arg.
308

309
310
311
312
```bash
echo '{"speedup_ratio": 10.0}' > mocker_args.json
dynamo-run in=dyn://dynamo.mocker.generate out=mocker --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 --extra-engine-args mocker_args.json
dynamo-run in=http out=auto --router-mode kv
313
```
314

315
### echo_full
316

317
The `echo_full` engine accepts un-processed requests and echoes the prompt back as the response.
318
319

```
320
dynamo-run in=http out=echo_full --model-name my_model
321
322
```

323
### echo_core
324

325
The `echo_core` engine accepts pre-processed requests and echoes the tokens back as the response. This is useful for testing pre-processing functionality as the response includes the full prompt template.
326
327
328
329
330

```
dynamo-run in=http out=echo_core --model-path <hf-repo-checkout>
```

331
332
333
334
335
336
337
Note that to use it with `in=http` you need to tell the post processor to ignore stop tokens from the template by adding `nvext.ignore_eos` like this:
```
curl -N -d '{"nvext": {"ignore_eos": true}, "stream": true, "model": "Qwen2.5-3B-Instruct", "max_completion_tokens": 4096, "messages":[{"role":"user", "content": "Tell me a story" }]}' ...
```

The default `in=text` sets that for you.

338
### Echo Configuration
339
340
341
342
343
344
345
346
347

Both echo engines use a configurable delay between tokens to simulate generation speed. You can adjust this using the `DYN_TOKEN_ECHO_DELAY_MS` environment variable:

```
# Set token echo delay to 1ms (1000 tokens per second)
DYN_TOKEN_ECHO_DELAY_MS=1 dynamo-run in=http out=echo_full
```

The default delay is 10ms, which produces approximately 100 tokens per second.
348

349
350
351
352
353
354
355
### Other engines, multi-node, production

`vllm`, `sglang` and `trtllm` production grade engines are available in `components/backends`. They run as Python components, using the Rust bindings. See the main README.

`dynamo-run` is an exploration, development and prototyping tool, as well as an example of using the Rust API. Multi-node and production setups should be using the main engine components.

## Batch mode
356

357
`dynamo-run` can take a jsonl file full of prompts and evaluate them all:
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375

```
dynamo-run in=batch:prompts.jsonl out=llamacpp <model>
```

The input file should look like this:
```
{"text": "What is the capital of France?"}
{"text": "What is the capital of Spain?"}
```

Each one is passed as a prompt to the model. The output is written back to the same folder in `output.jsonl`. At the end of the run some statistics are printed.
The output looks like this:
```
{"text":"What is the capital of France?","response":"The capital of France is Paris.","tokens_in":7,"tokens_out":7,"elapsed_ms":1566}
{"text":"What is the capital of Spain?","response":".The capital of Spain is Madrid.","tokens_in":7,"tokens_out":7,"elapsed_ms":855}
```

376
## Writing your own engine in Python
377

378
The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo. All of the main backend components in `components/backends/` work like this.
379
380
381
382
383
384
385
386
387
388

The Python file must do three things:
1. Decorate a function to get the runtime
2. Register on the network
3. Attach a request handler

```
from dynamo.llm import ModelType, register_llm
from dynamo.runtime import DistributedRuntime, dynamo_worker

389
390
391
392
   # 1. Decorate a function to get the runtime
   #
   @dynamo_worker(static=False)
   async def worker(runtime: DistributedRuntime):
393
394
395
396
397

    # 2. Register ourselves on the network
    #
    component = runtime.namespace("namespace").component("component")
    await component.create_service()
398
    model_path = "Qwen/Qwen3-0.6B" # or "/data/models/Qwen3-0.6B"
399
400
    model_type = ModelType.Backend
    endpoint = component.endpoint("endpoint")
401
402
    # Optional last param to register_llm is model_name. If not present derives it from model_path
    await register_llm(model_type, endpoint, model_path)
403
404
405
406
407
408

    # Initialize your engine here
    # engine = ...

    # 3. Attach request handler
    #
409
    await endpoint.serve_endpoint(RequestHandler(engine).generate)
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427

class RequestHandler:

    def __init__(self, engine):
        ...

    async def generate(self, request):
        # Call the engine
        # yield result dict
        ...

if __name__ == "__main__":
    uvloop.install()
    asyncio.run(worker())
```


The `model_path` can be:
428
- A HuggingFace repo ID, optionally prefixed with `hf://`. It is downloaded and cached locally.
429
430
431
432
433
434
435
436
- The path to a checkout of a HuggingFace repo - any folder containing safetensor files as well as `config.json`, `tokenizer.json` and `tokenizer_config.json`.
- The path to a GGUF file, if your engine supports that.

The `model_type` can be:
- ModelType.Backend. Dynamo handles pre-processing. Your `generate` method receives a `request` dict containing a `token_ids` array of int. It must return a dict also containing a `token_ids` array and an optional `finish_reason` string.
- ModelType.Chat. Your `generate` method receives a `request` and must return a response dict of type [OpenAI Chat Completion](https://platform.openai.com/docs/api-reference/chat). Your engine handles pre-processing.
- ModelType.Completion. Your `generate` method receives a `request` and must return a response dict of the older [Completions](https://platform.openai.com/docs/api-reference/completions). Your engine handles pre-processing.

437
438
439
440
`register_llm` can also take the following kwargs:
- `model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name, the folder name, or the GGUF file name.
- `context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM.
- `kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16.
441
- `user_data`: Optional dictionary containing custom metadata for worker behavior (e.g., LoRA configuration). Defaults to None.
442

443
Here are some example engines:
444
445
446
447
448
449
450

- Backend:
    * [vllm](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_vllm.py)
    * [sglang](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_sglang.py)
- Chat:
    * [sglang](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_sglang_tok.py)

451
More fully-featured Python engines are in `components/backends`.
452

453
## Debugging
454
455
456
457
458
459
460
461

`dynamo-run` and `dynamo-runtime` support [tokio-console](https://github.com/tokio-rs/console). Build with the feature to enable:
```
cargo build --features cuda,tokio-console -p dynamo-run
```

The listener uses the default tokio console port, and all interfaces (0.0.0.0).