dynamo_run.md 25.2 KB
Newer Older
1
# Running Dynamo (`dynamo run`)
2

3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
- [Running Dynamo (`dynamo run`)](#running-dynamo-dynamo-run)
  - [Quickstart with pip and vllm](#quickstart-with-pip-and-vllm)
    - [Use model from Hugging Face](#use-model-from-hugging-face)
    - [Run a model from local file](#run-a-model-from-local-file)
      - [Download model from Hugging Face](#download-model-from-hugging-face)
      - [Run model from local file](#run-model-from-local-file)
    - [Distributed System](#distributed-system)
    - [Network names](#network-names)
    - [KV-aware routing](#kv-aware-routing)
  - [Full usage details](#full-usage-details)
    - [Getting Started](#getting-started)
      - [Setup](#setup)
        - [Step 1: Install libraries](#step-1-install-libraries)
        - [Step 2: Install Rust](#step-2-install-rust)
        - [Step 3: Build](#step-3-build)
      - [Defaults](#defaults)
    - [Running Inference with Pre-built Engines](#running-inference-with-pre-built-engines)
      - [mistralrs](#mistralrs)
      - [llamacpp](#llamacpp)
      - [sglang](#sglang)
      - [vllm](#vllm)
      - [trtllm](#trtllm)
        - [Step 1: Build the environment](#step-1-build-the-environment)
        - [Step 2: Run the environment](#step-2-run-the-environment)
        - [Step 3: Execute `dynamo run` command](#step-3-execute-dynamo-run-command)
      - [Echo Engines](#echo-engines)
        - [echo\_core](#echo_core)
        - [echo\_full](#echo_full)
        - [Configuration](#configuration)
      - [Batch mode](#batch-mode)
    - [Extra engine arguments](#extra-engine-arguments)
    - [Writing your own engine in Python](#writing-your-own-engine-in-python)
35

36
37
38
39
40
This guide explains the`dynamo run` command.

`dynamo-run` is a CLI tool for exploring the Dynamo components. It's also an example of how to use components from Rust. If you use the Python wheel, it's available as `dynamo run` .

It supports these engines: mistralrs, llamacpp, sglang, vllm, and tensorrt-llm. `mistralrs` is the default.
41
42
43

Usage:
```
44
dynamo-run in=[http|text|dyn://<path>|batch:<folder>] out=echo_core|echo_full|mistralrs|llamacpp|sglang|vllm|dyn [--http-port 8080] [--model-path <path>] [--model-name <served-model-name>] [--model-config <hf-repo>] [--tensor-parallel-size=1] [--context-length=N] [--num-nodes=1] [--node-rank=0] [--leader-addr=127.0.0.1:9876] [--base-gpu-id=0] [--extra-engine-args=args.json] [--router-mode random|round-robin|kv] [--kv-overlap-score-weight=2.0] [--kv-gpu-cache-usage-weight=1.0] [--kv-waiting-requests-weight=1.0] [--verbosity (-v|-vv)]
45
46
```

47
Example: `dynamo run Qwen/Qwen3-0.6B`
48

49
Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.
50

51
52
53
54
55
56
57
To adjust verbosity, use `-v` to enable debug logging or `-vv` to enable full trace logging. For example:

```bash
dynamo-run in=http out=mistralrs -v  # enables debug logging
dynamo-run in=text out=llamacpp -vv  # enables full trace logging
```

58
## Quickstart with pip and vllm
59

60
If you used `pip` to install `dynamo`, you have the `dynamo-run` binary pre-installed with the `vllm` engine. You must be in a virtual environment with vllm installed to use this engine. To compile from source, see [Full usage details](#full-usage-details) below.
61

62
63
The vllm and sglang engines require [etcd](https://etcd.io/) and [nats](https://nats.io/) with jetstream (`nats-server -js`). Mistralrs and llamacpp do not.

64
### Use model from Hugging Face
65

66
To automatically downloads Qwen3 4B from Hugging Face (16 GiB download) and starts it in interactive text mode:
67
```
68
dynamo run out=vllm Qwen/Qwen3-4B
69
70
```

71
The general format for HF download follows this pattern:
72
73
```
dynamo run out=<engine> <HUGGING_FACE_ORGANIZATION/MODEL_NAME>
74
75
```

76
For gated models (such as meta-llama/Llama-3.2-3B-Instruct), you must set an `HF_TOKEN` environment variable.
77

78
The parameter can be the ID of a HuggingFace repository (which will be downloaded), a GPT-Generated Unified Format (GGUF) file, or a folder containing safetensors, config.json, or similar (perhaps a locally checked out HuggingFace repository).
79

80
### Run a model from local file
81

82
83
84
85
86
To run a model from local file:
- Download the model from Hugging Face
- Run the model from local file

See the following sections for details.
87

88
89
90
91
92
#### Download model from Hugging Face
One of the models available from HUgging Face should be high quality and fast on almost any machine: https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF
For example, try https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/blob/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf

To download model file:
93
```
94
curl -L -o Llama-3.2-3B-Instruct-Q4_K_M.gguf "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf?download=true"
95
```
96
#### Run model from local file
97
98
99
To run the model:

*Text interface*
100
```
101
dynamo run Llama-3.2-3B-Instruct-Q4_K_M.gguf # or path to a Hugging Face repo checkout instead of the GGUF file
102
```
103

104
*HTTP interface*
105
```
106
dynamo run in=http out=mistralrs Llama-3.2-3B-Instruct-Q4_K_M.gguf
107
```
108
You can also list models or send a request:
109

110
*List the models*
111
```
112
curl localhost:8080/v1/models
113
```
114

115
*Send a request*
116
```
117
curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions
118
```
119

120
121
122
### Distributed System

You can run the ingress side (HTTP server and pre-processing) on one machine, for example a CPU node, and the worker on a different machine (a GPU node).
123

124
You will need [etcd](https://etcd.io/) and [nats](https://nats.io) with jetstream installed and accessible from both nodes.
125

126
**Node 1:** OpenAI compliant HTTP server, optional pre-processing, worker discovery:
127

128
```
129
dynamo-run in=http out=dyn
130
```
131

132
**Node 2:** Vllm engine. Receives and returns requests over the network:
133

134
```
135
dynamo-run in=dyn://llama3B.backend.generate out=vllm ~/llms/Llama-3.2-3B-Instruct
136
```
137

138
139
This uses etcd to auto-discover the model and NATS to talk to it. You can
run multiple instances on the same endpoint; it picks one based on the
140
`--router-mode` (round-robin by default if left unspecified).
141

142
Run `dynamo-run --help` for more options.
143

144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
### Network names

The `in=dyn://` URLs have the format `dyn://namespace.component.endpoint`. For quickstart just use any string `dyn://test`, `dynamo-run` will default any missing parts for you. The pieces matter for a larger system.

* *Namespace*: A pipeline. Usually a model. e.g "llama_8b". Just a name.
* *Component*: A load balanced service needed to run that pipeline. "backend", "prefill", "decode", "preprocessor", "draft", etc. This typically has some configuration (which model to use, for example).
* *Endpoint*: Like a URL. "generate", "load_metrics".
* *Instance*: A process. Unique. Dynamo assigns each one a unique instance_id. The thing that is running is always an instance. Namespace/component/endpoint can refer to multiple instances.

If you run two models, that is two pipelines. An exception would be if doing speculative decoding. The draft model is part of the pipeline of a bigger model.

If you run two instances of the same model ("data parallel") they are the same namespace+component+endpoint but different instances. The router will spread traffic over all the instances of a namespace+component+endpoint. If you have four prefill workers in a pipeline, they all have the same namespace+component+endpoint and are automatically assigned unique instance_ids.

Example 1: Data parallel load balanced, one model one pipeline two instances.
```
Node 1: dynamo-run in=dyn://qwen3-32b.backend.generate out=sglang /data/Qwen3-32B --tensor-parallel-size 2 --base-gpu-id 0
Node 2: dynamo-run in=dyn://qwen3-32b.backend.generate out=sglang /data/Qwen3-32B --tensor-parallel-size 2 --base-gpu-id 2
```

Example 2: Two models, two pipelines.
```
Node 1: dynamo-run in=dyn://qwen3-32b.backend.generate out=vllm /data/Qwen3-32B
Node 2: dynamo-run in=dyn://llama3-1-8b.backend.generate out=vllm /data/Llama-3.1-8B-Instruct/
```

Example 3: Different endpoints.

The KV metrics publisher in VLLM adds a `load_metrics` endpoint to the current component. If the `llama3-1-8b.backend` component above is using patched vllm it will also expose `llama3-1-8b.backend.load_metrics`.

Example 4: Multiple component in a pipeline

In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.generate` (possibly multiple instance of this) and `deepseek-distill-llama8b.decode.generate`.

For output it is always only `out=dyn`. This tells Dynamo to auto-discover the instances, group them by model, and load balance appropriately (depending on `--router-mode` flag). The old syntax of `dyn://...` is still accepted for backwards compatibility.

179
180
181
182
### KV-aware routing

**Setup**

183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
Currently, only patched vllm supports KV-aware routing.

To set up KV-aware routing on patched vllm:

1. Ensure that `etcd` and `nats` (see [Quickstart with pip and vllm](#quickstart-with-pip-and-vllm)) are running and accessible from all nodes.
1. Create a virtualenv: `uv venv kvtest` and source its `activate`.
1. Use `pip` to **either**:
   1. Install Dynamo's vllm branch:
      ```
      uv pip install ai-dynamo-vllm
      ```
       **or**
   1. Install upstream vllm 0.8.4:
      ```
      uv pip install vllm==0.8.4
      ```
      And then patch it:
      ```
      cd kvtest/lib/python3.12/site-packages
      patch -p1 < $REPO_ROOT/container/deps/vllm/vllm_v0.8.4-dynamo-kv-disagg-patch.patch
      ```
1. Build the C bindings:
   ```
   cd $REPO_ROOT/lib/bindings/c
   cargo build`.
   ```
1. Put the library you just built on library path:
   ```
   export LD_LIBRARY_PATH=$REPO_ROOT/target/debug/
   ```
If you patched locally (instead of installing `ai-dynamo-vllm`), edit vllm's `platforms/__init__.py` to undo a patch change:
214
215
216
217
218
219
220
```
    #vllm_version = version("ai_dynamo_vllm")
    vllm_version = version("vllm")
```

**Start the workers**

221
The workers are started normally:
222
223
224
225
226
227
228
229

```
dynamo-run in=dyn://dynamo.endpoint.generate out=vllm /data/llms/Qwen/Qwen3-4B
```

**Start the ingress node**

```
230
dynamo-run in=http out=dyn --router-mode kv
231
232
```

233
The only difference from the distributed system above is `--router-mode kv`. The patched vllm announces when a KV block is created or removed. The Dynamo router run finds the worker with the best match for those KV blocks and directs the traffic to that node.
234

235
For performance testing, compare a typical workload with `--router-mode random|round-robin` to see if it can benefit from KV-aware routing.
236

237
## Full usage details
238

239
`dynamo run` executes `dynamo-run`. `dynamo-run` is also an example of what can be built in Rust with the `dynamo-llm` and `dynamo-runtime` crates. The following guide shows how to build from source with all the features.
240

241
### Getting Started
242

243
244
245
#### Setup

##### Step 1: Install libraries
246
**Ubuntu:**
247
```
248
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libssl-dev libclang-dev protobuf-compiler python3-dev cmake
249
250
```

251
**macOS:**
252
253
254
255
256
257
258
- [Homebrew](https://brew.sh/)
```
# if brew is not installed on your system, install it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
- [Xcode](https://developer.apple.com/xcode/)

259
260
```
brew install cmake protobuf
261

262
## Check that Metal is accessible
263
264
xcrun -sdk macosx metal
```
265
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
266

267
##### Step 2: Install Rust
268
```
269
270
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
271
```
272

273
##### Step 3: Build
274

275
- Linux with GPU and CUDA (tested on Ubuntu):
276
```
277
cargo build --features cuda
278
```
279

280
- macOS with Metal:
281
```
282
cargo build --features metal
283
284
```

285
- CPU only:
286
```
287
cargo build
288
289
```

290
291
292
293
294
295
296
Optionally you can run `cargo build` from any location with arguments:

```
--target-dir /path/to/target_directory` # specify target_directory with write privileges
--manifest-path /path/to/project/Cargo.toml` # if cargo build is run outside of `launch/` directory
```

297
The binary is called `dynamo-run` in `target/debug`
298
```
299
cd target/debug
300
301
```

302
303
Build with `--release` for a smaller binary and better performance, but longer build times. The binary will be in `target/release`.

304
305
306
307
308
309
310


#### Defaults
The input defaults to `in=text`. The output defaults to `out=mistralrs` engine, unless it is disabled with `--no-default-features` in which case vllm is used.
### Running Inference with Pre-built Engines

#### mistralrs
311
312
313
314

[mistral.rs](https://github.com/EricLBuehler/mistral.rs) is a pure Rust engine that is fast to run, fast to load, supports GGUF as well as safetensors, and runs well on CPU as well as GPU. For those reasons it is the default engine.

```
315
dynamo-run Qwen/Qwen3-4B
316
317
318
319
320
```

is equivalent to

```
321
dynamo-run in=text out=mistralrs Qwen/Qwen3-4B
322
323
```

324
325
If you have multiple GPUs, mistral.rs does automatic tensor parallelism. You do not need to pass any extra flags to dynamo-run to enable it.

326
#### llamacpp
327

328
[llama.cpp](https://github.com/ggml-org/llama.cpp) is built for CPU by default. For an optimized build pass the appropriate feature flag (highly recommended):
329

330
```
331
cargo build --features cuda|metal|vulkan -p dynamo-run
332
333
```

334
335
336
337
338
339
For GNU OpenMP support add the `openmp` feature. On Ubuntu this requires `libgomp1` (part of `build-essential`) at build and runtime.

```
cargo build --features cuda,openmp -p dynamo-run
```

340
```
341
342
dynamo-run out=llamacpp ~/llms/gemma-3-1b-it-q4_0.gguf
dynamo-run out=llamacpp ~/llms/Qwen3-0.6B-Q8_0.gguf # From https://huggingface.co/ggml-org
343
```
344

345
Note that in some cases we are unable to extract the tokenizer from the GGUF, and so a Hugging Face checkout of a matching model must also be passed. Dynamo uses the weights from the GGUF and the pre-processor (`tokenizer.json`, etc) from the `--model-config`:
346
```
347
dynamo-run out=llamacpp ~/llms/Llama-4-Scout-17B-16E-Instruct-UD-IQ1_S.gguf --context-length 32768 --model-config ~/llms/Llama-4-Scout-17B-16E-Instruct
348
349
350
```

If you have multiple GPUs, llama.cpp does automatic tensor parallelism. You do not need to pass any extra flags to dynamo-run to enable it.
351

352
#### sglang
353

354
355
The [SGLang](https://docs.sglang.ai/index.html) engine requires [etcd](https://etcd.io/) and [nats](https://nats.io/) with jetstream (`nats-server -js`) to be running.

356
357
1. Setup the python virtual env:

358
359
360
361
362
363
364
```
uv venv
source .venv/bin/activate
uv pip install pip
uv pip install sgl-kernel --force-reinstall --no-deps
uv pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
```
365

366
367
2. Run

368
Any example above using `out=sglang` can work, but our sglang backend is also multi-gpu.
369
370

```
371
372
cd target/debug
./dynamo-run in=http out=sglang --model-path ~/llms/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8
373
374
```

375
To pass extra arguments to the sglang engine see [Extra engine arguments](#extra-engine-arguments).
376

377
**Multi-GPU**
378

379
380
381
382
383
Pass `--tensor-parallel-size <NUM-GPUS>` to `dynamo-run`.

```
dynamo-run out=sglang ~/llms/Llama-4-Scout-17B-16E-Instruct/ --tensor-parallel-size 8
```
384

385
To specify which GPU to start from pass `--base-gpu-id <num>`, for example on a shared eight GPU machine where GPUs 0-3 are already in use:
386
```
387
dynamo-run out=sglang <model> --tensor-parallel-size 4 --base-gpu-id 4
388
389
```

390
**Multinode:**
391

392
Dynamo only manages the leader node (node rank 0). The follower nodes are started in the [normal sglang way](https://docs.sglang.ai/references/deepseek.html#running-examples-on-multi-node).
393

394
Leader node:
395
```
396
397
398
399
400
401
dynamo-run out=sglang /data/models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 16 --node-rank 0 --num-nodes 2 --leader-addr 10.217.98.122:5000
```

All follower nodes. Increment `node-rank` each time:
```
python3 -m sglang.launch_server --model-path /data/models/DeepSeek-R1-Distill-Llama-70B --tp 16 --dist-init-addr 10.217.98.122:5000 --nnodes 2 --node-rank 1 --trust-remote-code
402
```
403
404
405
406

- Parameters `--leader-addr` and `--dist-init-addr` must match and be the IP address of the leader node. All followers must be able to connect. SGLang is using [PyTorch Distributed](https://docs.pytorch.org/tutorials/beginner/dist_overview.html) for networking.
- Parameters `--tensor-parallel-size` and `--tp` must match and be the total number of GPUs across the cluster.
- `--node-rank` must be unique consecutive integers starting at 1. The leader, managed by Dynamo, is 0.
Graham King's avatar
Graham King committed
407

408
#### vllm
Graham King's avatar
Graham King committed
409

410
411
412
Using the [vllm](https://github.com/vllm-project/vllm) Python library. Slow startup, fast inference. Supports both safetensors from HF and GGUF files, but is very slow for GGUF - prefer llamacpp.

The vllm engine requires requires [etcd](https://etcd.io/) and [nats](https://nats.io/) with jetstream (`nats-server -js`) to be running.
Graham King's avatar
Graham King committed
413
414
415

We use [uv](https://docs.astral.sh/uv/) but any virtualenv manager should work.

416
1. Setup:
Graham King's avatar
Graham King committed
417
418
419
420
```
uv venv
source .venv/bin/activate
uv pip install pip
421
uv pip install vllm==0.8.4 setuptools
Graham King's avatar
Graham King committed
422
423
```

424
**Note: If you're on Ubuntu 22.04 or earlier, you must add `--python=python3.10` to your `uv venv` command**
Graham King's avatar
Graham King committed
425

426
2. Build:
Graham King's avatar
Graham King committed
427
```
428
cargo build
429
cd target/debug
Graham King's avatar
Graham King committed
430
431
```

432
433
434
435
3. Run
Inside that virtualenv:

**HF repo:**
Graham King's avatar
Graham King committed
436
```
437
./dynamo-run in=http out=vllm ~/llms/Llama-3.2-3B-Instruct/
Graham King's avatar
Graham King committed
438
439

```
440

441
To pass extra arguments to the vllm engine see [Extra engine arguments](#extra-engine-arguments) below.
442

443
444
vllm attempts to allocate enough KV cache for the full context length at startup. If that does not fit in your available memory pass `--context-length <value>`.

445
**Multi-GPU**
446

447
Pass `--tensor-parallel-size <NUM-GPUS>` to `dynamo-run`.
448

449
To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
450

451
**Multinode:**
452

453
vllm uses [ray](https://docs.vllm.ai/en/latest/serving/distributed_serving.html#running-vllm-on-multiple-nodes) for pipeline parallel inference. Dynamo does not change or manage that.
454

455
456
457
458
Here is an example on two 8x nodes:
- Leader node: `ray start --head --port=6379`
- Each follower node: `ray start --address='<HEAD_NODE_IP>:6379`
- Leader node: `dynamo-run out=vllm ~/llms/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 16`
459

460
The `--tensor-parallel-size` parameter is the total number of GPUs in the cluster. This is often constrained by a model dimension such as being a divisor of the number of attention heads.
461

462
Startup can be slow so you may want to `export DYN_LOG=debug` to see progress.
463

464
Shutdown: `ray stop`
465

466
#### trtllm
467

468
469
470
471
472
Using [TensorRT-LLM's LLM API](https://nvidia.github.io/TensorRT-LLM/llm-api/), a high-level Python API.

You can use `--extra-engine-args` to pass extra arguments to LLM API engine.

The trtllm engine requires requires [etcd](https://etcd.io/) and [nats](https://nats.io/) with jetstream (`nats-server -js`) to be running.
473
474
475

##### Step 1: Build the environment

476
See instructions [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md#build-docker) to build the dynamo container with TensorRT-LLM.
477
478
479

##### Step 2: Run the environment

480
See instructions [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md#run-container) to run the built environment.
481
482
483
484
485

##### Step 3: Execute `dynamo run` command

Execute the following to load the TensorRT-LLM model specified in the configuration.
```
486
dynamo-run in=http out=trtllm TinyLlama/TinyLlama-1.1B-Chat-v1.0
487
488
```

489
#### Echo Engines
490
491
492

Dynamo includes two echo engines for testing and debugging purposes:

493
##### echo_core
494

495
The `echo_core` engine accepts pre-processed requests and echoes the tokens back as the response. This is useful for testing pre-processing functionality as the response includes the full prompt template.
496
497
498
499
500

```
dynamo-run in=http out=echo_core --model-path <hf-repo-checkout>
```

501
502
503
504
505
506
507
Note that to use it with `in=http` you need to tell the post processor to ignore stop tokens from the template by adding `nvext.ignore_eos` like this:
```
curl -N -d '{"nvext": {"ignore_eos": true}, "stream": true, "model": "Qwen2.5-3B-Instruct", "max_completion_tokens": 4096, "messages":[{"role":"user", "content": "Tell me a story" }]}' ...
```

The default `in=text` sets that for you.

508
##### echo_full
509
510
511
512

The `echo_full` engine accepts un-processed requests and echoes the prompt back as the response.

```
513
dynamo-run in=http out=echo_full --model-name my_model
514
515
```

516
##### Configuration
517
518
519
520
521
522
523
524
525

Both echo engines use a configurable delay between tokens to simulate generation speed. You can adjust this using the `DYN_TOKEN_ECHO_DELAY_MS` environment variable:

```
# Set token echo delay to 1ms (1000 tokens per second)
DYN_TOKEN_ECHO_DELAY_MS=1 dynamo-run in=http out=echo_full
```

The default delay is 10ms, which produces approximately 100 tokens per second.
526

527
#### Batch mode
528

529
`dynamo-run` can take a jsonl file full of prompts and evaluate them all:
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547

```
dynamo-run in=batch:prompts.jsonl out=llamacpp <model>
```

The input file should look like this:
```
{"text": "What is the capital of France?"}
{"text": "What is the capital of Spain?"}
```

Each one is passed as a prompt to the model. The output is written back to the same folder in `output.jsonl`. At the end of the run some statistics are printed.
The output looks like this:
```
{"text":"What is the capital of France?","response":"The capital of France is Paris.","tokens_in":7,"tokens_out":7,"elapsed_ms":1566}
{"text":"What is the capital of Spain?","response":".The capital of Spain is Madrid.","tokens_in":7,"tokens_out":7,"elapsed_ms":855}
```

548
549
550
551
552
553
554
555
556
557
558
559
560
### Extra engine arguments
The vllm and sglang backends support passing any argument the engine accepts.
Put the arguments in a JSON file:
```
{
    "dtype": "half",
    "trust_remote_code": true
}
```
Pass it like this:
```
dynamo-run out=sglang ~/llms/Llama-3.2-3B-Instruct --extra-engine-args sglang_extra.json
```
561
562
563
564
565
566
567
568
569
570
571
572
573
574

The tensorrtllm backend also support passing any argument the engine accepts. However, in this case config should be a yaml file.

```
backend: pytorch
kv_cache_config:
  event_buffer_max_size: 1024
```

Pass it like this:
```
dynamo-run in=http out=trtllm TinyLlama/TinyLlama-1.1B-Chat-v1.0 --extra-engine-args trtllm_extra.yaml
```

575
### Writing your own engine in Python
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598

Note: This section replaces "bring-your-own-engine".

The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo.

The Python file must do three things:
1. Decorate a function to get the runtime
2. Register on the network
3. Attach a request handler

```
from dynamo.llm import ModelType, register_llm
from dynamo.runtime import DistributedRuntime, dynamo_worker

# 1. Decorate a function to get the runtime
#
@dynamo_worker(static=False)
async def worker(runtime: DistributedRuntime):

    # 2. Register ourselves on the network
    #
    component = runtime.namespace("namespace").component("component")
    await component.create_service()
599
    model_path = "Qwen/Qwen3-0.6B" # or "/data/models/Qwen3-0.6B"
600
601
    model_type = ModelType.Backend
    endpoint = component.endpoint("endpoint")
602
603
    # Optional last param to register_llm is model_name. If not present derives it from model_path
    await register_llm(model_type, endpoint, model_path)
604
605
606
607
608
609

    # Initialize your engine here
    # engine = ...

    # 3. Attach request handler
    #
610
    await endpoint.serve_endpoint(RequestHandler(engine).generate)
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628

class RequestHandler:

    def __init__(self, engine):
        ...

    async def generate(self, request):
        # Call the engine
        # yield result dict
        ...

if __name__ == "__main__":
    uvloop.install()
    asyncio.run(worker())
```


The `model_path` can be:
629
- A HuggingFace repo ID, optionally prefixed with `hf://`. It is downloaded and cached locally.
630
631
632
633
634
635
636
637
- The path to a checkout of a HuggingFace repo - any folder containing safetensor files as well as `config.json`, `tokenizer.json` and `tokenizer_config.json`.
- The path to a GGUF file, if your engine supports that.

The `model_type` can be:
- ModelType.Backend. Dynamo handles pre-processing. Your `generate` method receives a `request` dict containing a `token_ids` array of int. It must return a dict also containing a `token_ids` array and an optional `finish_reason` string.
- ModelType.Chat. Your `generate` method receives a `request` and must return a response dict of type [OpenAI Chat Completion](https://platform.openai.com/docs/api-reference/chat). Your engine handles pre-processing.
- ModelType.Completion. Your `generate` method receives a `request` and must return a response dict of the older [Completions](https://platform.openai.com/docs/api-reference/completions). Your engine handles pre-processing.

638
639
640
641
642
`register_llm` can also take the following kwargs:
- `model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name, the folder name, or the GGUF file name.
- `context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM.
- `kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16.

643
Here are some example engines:
644
645
646
647
648
649
650
651

- Backend:
    * [vllm](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_vllm.py)
    * [sglang](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_sglang.py)
- Chat:
    * [sglang](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_sglang_tok.py)

More fully-featured Backend engines (used by `dynamo-run`):
652
653
654
- [vllm](https://github.com/ai-dynamo/dynamo/blob/main/launch/dynamo-run/src/subprocess/vllm_inc.py)
- [sglang](https://github.com/ai-dynamo/dynamo/blob/main/launch/dynamo-run/src/subprocess/sglang_inc.py)