dynamo_run.md 22.2 KB
Newer Older
1
# Dynamo Run
2

3
4
5
* [Quickstart with pip and vllm](#quickstart-with-pip-and-vllm)
    * [Automatically download a model from Hugging Face](#use-model-from-hugging-face)
    * [Run a model from local file](#run-a-model-from-local-file)
6
    * [Distributed system](#distributed-system)
7
    * [Network names](#network-names)
8
    * [KV-aware routing](#kv-aware-routing)
9
* [Full usage details](#full-usage-details)
10
    * [Setup](#setup)
11
12
    * [mistral.rs](#mistralrs)
    * [llama.cpp](#llamacpp)
13
14
15
    * [Sglang](#sglang)
    * [Vllm](#vllm)
    * [TensorRT-LLM](#tensorrt-llm-engine)
16
    * [Echo Engines](#echo-engines)
17
    * [Write your own engine in Python](#write-your-own-engine-in-python)
18
19
20
21
* [Batch mode](#batch-mode)
* [Defaults](#defaults)
* [Extra engine arguments](#extra-engine-arguments)

22
`dynamo-run` is a CLI tool for exploring the Dynamo components, and an example of how to use them from Rust. It is also available as `dynamo run` if using the Python wheel.
23

24
25
26
27
It supports the following engines: mistralrs, llamacpp, sglang, vllm and tensorrt-llm. `mistralrs` is the default.

Usage:
```
28
dynamo-run in=[http|text|dyn://<path>|batch:<folder>] out=echo_core|echo_full|mistralrs|llamacpp|sglang|vllm|dyn [--http-port 8080] [--model-path <path>] [--model-name <served-model-name>] [--model-config <hf-repo>] [--tensor-parallel-size=1] [--base-gpu-id=0] [--extra-engine-args=args.json] [--router-mode random|round-robin|kv]
29
30
```

31
Example: `dynamo run Qwen/Qwen3-0.6B`
32
33
34

Set environment variable `DYN_LOG` to adjust logging level, e.g. `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`, ask AI for details.

35
## Quickstart with pip and vllm
36

37
If you used `pip` to install `dynamo` you should have the `dynamo-run` binary pre-installed with the `vllm` engine. You must be in a virtual env with vllm installed to use this. To compile from source, see "Full documentation" below.
38

39
40
The vllm and sglang engines require [etcd](https://etcd.io/) and [nats](https://nats.io/) with jetstream (`nats-server -js`). Mistralrs and llamacpp do not.

41
### Use model from Hugging Face
42

43
This will automatically download Qwen3 4B from Hugging Face (16 GiB download) and start it in interactive text mode:
44
```
45
dynamo run out=vllm Qwen/Qwen3-4B
46
47
```

48
49
50
General format for HF download:
```
dynamo run out=<engine> <HUGGING_FACE_ORGANIZATION/MODEL_NAME>
51
52
```

53
For gated models (e.g. meta-llama/Llama-3.2-3B-Instruct) you have to have an `HF_TOKEN` environment variable set.
54

55
The parameter can be the ID of a HuggingFace repository (it will be downloaded), a GGUF file, or a folder containing safetensors, config.json, etc (a locally checked out HuggingFace repository).
56

57
### Run a model from local file
58

59
#### Step 1: Download model from Hugging Face
60
61
62
63
One of these models should be high quality and fast on almost any machine: https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF
E.g. https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/blob/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf

Download model file:
64
```
65
curl -L -o Llama-3.2-3B-Instruct-Q4_K_M.gguf "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf?download=true"
66
```
67
68
#### Run model from local file
**Text interface**
69
```
70
dynamo run Llama-3.2-3B-Instruct-Q4_K_M.gguf # or path to a Hugging Face repo checkout instead of the GGUF
71
```
72

73
**HTTP interface**
74
```
75
dynamo run in=http out=mistralrs Llama-3.2-3B-Instruct-Q4_K_M.gguf
76
```
77

78
**List the models**
79
```
80
curl localhost:8080/v1/models
81
```
82

83
**Send a request**
84
```
85
curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions
86
```
87

88
89
90
### Distributed System

You can run the ingress side (HTTP server and pre-processing) on one machine, for example a CPU node, and the worker on a different machine (a GPU node).
91

92
You will need [etcd](https://etcd.io/) and [nats](https://nats.io) with jetstream installed and accessible from both nodes.
93

94
**Node 1:**
95
96
97

OpenAI compliant HTTP server, optional pre-processing, worker discovery.

98
```
99
dynamo-run in=http out=dyn
100
```
101

102
**Node 2:**
103
104
105

Vllm engine. Receives and returns requests over the network.

106
```
107
dynamo-run in=dyn://llama3B.backend.generate out=vllm ~/llms/Llama-3.2-3B-Instruct
108
```
109

110
This will use etcd to auto-discover the model and NATS to talk to it. You can
111
run multiple instances on the same endpoint and it will pick one based on the
112
`--router-mode` (round-robin by default if left unspecified).
113

114
Run `dynamo-run --help` for more options.
115

116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
### Network names

The `in=dyn://` URLs have the format `dyn://namespace.component.endpoint`. For quickstart just use any string `dyn://test`, `dynamo-run` will default any missing parts for you. The pieces matter for a larger system.

* *Namespace*: A pipeline. Usually a model. e.g "llama_8b". Just a name.
* *Component*: A load balanced service needed to run that pipeline. "backend", "prefill", "decode", "preprocessor", "draft", etc. This typically has some configuration (which model to use, for example).
* *Endpoint*: Like a URL. "generate", "load_metrics".
* *Instance*: A process. Unique. Dynamo assigns each one a unique instance_id. The thing that is running is always an instance. Namespace/component/endpoint can refer to multiple instances.

If you run two models, that is two pipelines. An exception would be if doing speculative decoding. The draft model is part of the pipeline of a bigger model.

If you run two instances of the same model ("data parallel") they are the same namespace+component+endpoint but different instances. The router will spread traffic over all the instances of a namespace+component+endpoint. If you have four prefill workers in a pipeline, they all have the same namespace+component+endpoint and are automatically assigned unique instance_ids.

Example 1: Data parallel load balanced, one model one pipeline two instances.
```
Node 1: dynamo-run in=dyn://qwen3-32b.backend.generate out=sglang /data/Qwen3-32B --tensor-parallel-size 2 --base-gpu-id 0
Node 2: dynamo-run in=dyn://qwen3-32b.backend.generate out=sglang /data/Qwen3-32B --tensor-parallel-size 2 --base-gpu-id 2
```

Example 2: Two models, two pipelines.
```
Node 1: dynamo-run in=dyn://qwen3-32b.backend.generate out=vllm /data/Qwen3-32B
Node 2: dynamo-run in=dyn://llama3-1-8b.backend.generate out=vllm /data/Llama-3.1-8B-Instruct/
```

Example 3: Different endpoints.

The KV metrics publisher in VLLM adds a `load_metrics` endpoint to the current component. If the `llama3-1-8b.backend` component above is using patched vllm it will also expose `llama3-1-8b.backend.load_metrics`.

Example 4: Multiple component in a pipeline

In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.generate` (possibly multiple instance of this) and `deepseek-distill-llama8b.decode.generate`.

For output it is always only `out=dyn`. This tells Dynamo to auto-discover the instances, group them by model, and load balance appropriately (depending on `--router-mode` flag). The old syntax of `dyn://...` is still accepted for backwards compatibility.

151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
### KV-aware routing

**Setup**

Only patched vllm currently supports KV-aware routing. Key setup steps:

1. `etcd` and `nats` (see earlier) must be running and accessible from all nodes.
1. Create a virtualenv: `uv venv kvtest`, source it's `activate`.
1. EITHER install Dynamo's vllm branch: `uv pip install ai-dynamo-vllm`,
1. OR install upstream vllm 0.8.4 (`uv pip install vllm==0.8.4`) and patch it: `cd kvtest/lib/python3.12/site-packages`, `patch -p1 < $REPO_ROOT/container/deps/vllm/vllm_v0.8.4-dynamo-kv-disagg-patch.patch`.
1. Build the C bindings. `cd $REPO_ROOT/lib/bindings/c`. `cargo build`.
1. Put the library you just built on library path: `export LD_LIBRARY_PATH=$REPO_ROOT/target/debug/`.

If you patched locally (instead of installing `ai-dynamo-vllm`) you will need to edit vllm's `platforms/__init__.py` to undo a patch change:
```
    #vllm_version = version("ai_dynamo_vllm")
    vllm_version = version("vllm")
```

**Start the workers**

The workers are started normally.

```
dynamo-run in=dyn://dynamo.endpoint.generate out=vllm /data/llms/Qwen/Qwen3-4B
```

**Start the ingress node**

```
181
dynamo-run in=http out=dyn --router-mode kv
182
183
184
185
186
187
```

The only difference from the distributed system above is `--router-mode kv`. The patched vllm will announce when a KV block is created or removed. The Dynamo router run will find the worker with the best match for those KV blocks and direct the traffic to that node.

For performance testing compare a typical workload with `--router-mode random|round-robin` to see if it will benefit from KV-aware routing.

188
## Full usage details
189

190
`dynamo-run` is what `dynamo run` executes. It is also an example of what you can build in Rust with the `dynamo-llm` and `dynamo-runtime` crates. The following guide demonstrates how you can build from source with all the features.
191

192
### Setup
193

194
195
#### Step 1: Install libraries
**Ubuntu:**
196
```
197
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libssl-dev libclang-dev protobuf-compiler python3-dev cmake
198
199
```

200
**macOS:**
201
202
203
204
205
206
207
- [Homebrew](https://brew.sh/)
```
# if brew is not installed on your system, install it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
- [Xcode](https://developer.apple.com/xcode/)

208
209
```
brew install cmake protobuf
210

211
# Check that Metal is accessible
212
213
xcrun -sdk macosx metal
```
214
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
215

216
#### Step 2: Install Rust
217
```
218
219
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
220
```
221

222
223
#### Step 3: Build

224
- Linux with GPU and CUDA (tested on Ubuntu):
225
```
226
cargo build --features cuda
227
```
228

229
- macOS with Metal:
230
```
231
cargo build --features metal
232
233
```

234
- CPU only:
235
```
236
cargo build
237
238
```

239
240
241
242
243
244
245
Optionally you can run `cargo build` from any location with arguments:

```
--target-dir /path/to/target_directory` # specify target_directory with write privileges
--manifest-path /path/to/project/Cargo.toml` # if cargo build is run outside of `launch/` directory
```

246
The binary will be called `dynamo-run` in `target/debug`
247
```
248
cd target/debug
249
250
```

251
252
253
254
255
256
257
Build with `--release` for a smaller binary and better performance, but longer build times. The binary will be in `target/release`.

### mistralrs

[mistral.rs](https://github.com/EricLBuehler/mistral.rs) is a pure Rust engine that is fast to run, fast to load, supports GGUF as well as safetensors, and runs well on CPU as well as GPU. For those reasons it is the default engine.

```
258
dynamo-run Qwen/Qwen3-4B
259
260
261
262
263
```

is equivalent to

```
264
dynamo-run in=text out=mistralrs Qwen/Qwen3-4B
265
266
```

267
268
If you have multiple GPUs, mistral.rs does automatic tensor parallelism. You do not need to pass any extra flags to dynamo-run to enable it.

269
270
271
### llamacpp

Currently [llama.cpp](https://github.com/ggml-org/llama.cpp) is not included by default. Build it like this:
272

273
274
275
276
277
```
cargo build --features llamacpp[,cuda|metal|vulkan] -p dynamo-run
```

```
278
279
dynamo-run out=llamacpp ~/llms/gemma-3-1b-it-q4_0.gguf
dynamo-run out=llamacpp ~/llms/Qwen3-0.6B-Q8_0.gguf # From https://huggingface.co/ggml-org
280
```
281

282
283
284
285
286
287
Note that in some cases we are unable to extract the tokenizer from the GGUF, and so a Hugging Face checkout of a matching model must also be passed. Dynamo will use the weights from the GGUF and the pre-processor (`tokenizer.json`, etc) from the `--model-config`:
```
dynamo-run out=llamacpp ~/llms/Llama-4-Scout-17B-16E-Instruct-UD-IQ1_S.gguf --model-config ~/llms/Llama-4-Scout-17B-16E-Instruct
```

If you have multiple GPUs, llama.cpp does automatic tensor parallelism. You do not need to pass any extra flags to dynamo-run to enable it.
288

289
### sglang
290

291
292
The [SGLang](https://docs.sglang.ai/index.html) engine requires [etcd](https://etcd.io/) and [nats](https://nats.io/) with jetstream (`nats-server -js`) to be running.

293
294
1. Setup the python virtual env:

295
296
297
298
299
300
301
```
uv venv
source .venv/bin/activate
uv pip install pip
uv pip install sgl-kernel --force-reinstall --no-deps
uv pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
```
302

303
304
305
2. Run

Any example above using `out=sglang` will work, but our sglang backend is also multi-gpu.
306
307

```
308
309
cd target/debug
./dynamo-run in=http out=sglang --model-path ~/llms/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8
310
311
```

312
To pass extra arguments to the sglang engine see *Extra engine arguments* below.
313

314
**Multi-GPU**
315

316
317
318
319
320
Pass `--tensor-parallel-size <NUM-GPUS>` to `dynamo-run`.

```
dynamo-run out=sglang ~/llms/Llama-4-Scout-17B-16E-Instruct/ --tensor-parallel-size 8
```
321

322
To specify which GPU to start from pass `--base-gpu-id <num>`, for example on a shared eight GPU machine where GPUs 0-3 are already in use:
323
```
324
dynamo-run out=sglang <model> --tensor-parallel-size 4 --base-gpu-id 4
325
326
```

327
**Multi-node:**
328

329
Dynamo only manages the leader node (node rank 0). The follower nodes are started in the [normal sglang way](https://docs.sglang.ai/references/deepseek.html#running-examples-on-multi-node).
330

331
Leader node:
332
```
333
334
335
336
337
338
dynamo-run out=sglang /data/models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 16 --node-rank 0 --num-nodes 2 --leader-addr 10.217.98.122:5000
```

All follower nodes. Increment `node-rank` each time:
```
python3 -m sglang.launch_server --model-path /data/models/DeepSeek-R1-Distill-Llama-70B --tp 16 --dist-init-addr 10.217.98.122:5000 --nnodes 2 --node-rank 1 --trust-remote-code
339
```
340
341
342
343

- Parameters `--leader-addr` and `--dist-init-addr` must match and be the IP address of the leader node. All followers must be able to connect. SGLang is using [PyTorch Distributed](https://docs.pytorch.org/tutorials/beginner/dist_overview.html) for networking.
- Parameters `--tensor-parallel-size` and `--tp` must match and be the total number of GPUs across the cluster.
- `--node-rank` must be unique consecutive integers starting at 1. The leader, managed by Dynamo, is 0.
Graham King's avatar
Graham King committed
344

345
### vllm
Graham King's avatar
Graham King committed
346

347
348
349
Using the [vllm](https://github.com/vllm-project/vllm) Python library. Slow startup, fast inference. Supports both safetensors from HF and GGUF files, but is very slow for GGUF - prefer llamacpp.

The vllm engine requires requires [etcd](https://etcd.io/) and [nats](https://nats.io/) with jetstream (`nats-server -js`) to be running.
Graham King's avatar
Graham King committed
350
351
352

We use [uv](https://docs.astral.sh/uv/) but any virtualenv manager should work.

353
1. Setup:
Graham King's avatar
Graham King committed
354
355
356
357
```
uv venv
source .venv/bin/activate
uv pip install pip
358
uv pip install vllm==0.8.4 setuptools
Graham King's avatar
Graham King committed
359
360
361
362
```

**Note: If you're on Ubuntu 22.04 or earlier, you will need to add `--python=python3.10` to your `uv venv` command**

363
2. Build:
Graham King's avatar
Graham King committed
364
```
365
cargo build
366
cd target/debug
Graham King's avatar
Graham King committed
367
368
```

369
370
371
372
3. Run
Inside that virtualenv:

**HF repo:**
Graham King's avatar
Graham King committed
373
```
374
./dynamo-run in=http out=vllm ~/llms/Llama-3.2-3B-Instruct/
Graham King's avatar
Graham King committed
375
376

```
377

378
To pass extra arguments to the vllm engine see [Extra engine arguments](#extra_engine_arguments) below.
379

380
**Multi-GPU**
381

382
Pass `--tensor-parallel-size <NUM-GPUS>` to `dynamo-run`.
383

384
To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
385

386
**Multi-node:**
387

388
vllm uses [ray](https://docs.vllm.ai/en/latest/serving/distributed_serving.html#running-vllm-on-multiple-nodes) for pipeline parallel inference. Dynamo does not change or manage that.
389

390
391
392
393
Here is an example on two 8x nodes:
- Leader node: `ray start --head --port=6379`
- Each follower node: `ray start --address='<HEAD_NODE_IP>:6379`
- Leader node: `dynamo-run out=vllm ~/llms/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 16`
394

395
The `--tensor-parallel-size` parameter is the total number of GPUs in the cluster. This is often constrained by a model dimension such as being a divisor of the number of attention heads.
396

397
Startup can be slow so you may want to `export DYN_LOG=debug` to see progress.
398

399
Shutdown: `ray stop`
400

401
#### TensorRT-LLM engine
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417

To run a TRT-LLM model with dynamo-run we have included a python based [async engine] (/examples/tensorrt_llm/engines/agg_engine.py).
To configure the TensorRT-LLM async engine please see [llm_api_config.yaml](/examples/tensorrt_llm/configs/llm_api_config.yaml). The file defines the options that need to be passed to the LLM engine. Follow the steps below to serve trtllm on dynamo run.

##### Step 1: Build the environment

See instructions [here](/examples/tensorrt_llm/README.md#build-docker) to build the dynamo container with TensorRT-LLM.

##### Step 2: Run the environment

See instructions [here](/examples/tensorrt_llm/README.md#run-container) to run the built environment.

##### Step 3: Execute `dynamo run` command

Execute the following to load the TensorRT-LLM model specified in the configuration.
```
418
dynamo run out=pystr:/workspace/examples/tensorrt_llm/engines/trtllm_engine.py  -- --engine_args /workspace/examples/tensorrt_llm/configs/llm_api_config.yaml
419
420
```

421
### Echo Engines
422
423
424

Dynamo includes two echo engines for testing and debugging purposes:

425
#### echo_core
426
427
428
429
430
431
432

The `echo_core` engine accepts pre-processed requests and echoes the tokens back as the response. This is useful for testing pre-processing functionality as the response will include the full prompt template.

```
dynamo-run in=http out=echo_core --model-path <hf-repo-checkout>
```

433
434
435
436
437
438
439
Note that to use it with `in=http` you need to tell the post processor to ignore stop tokens from the template by adding `nvext.ignore_eos` like this:
```
curl -N -d '{"nvext": {"ignore_eos": true}, "stream": true, "model": "Qwen2.5-3B-Instruct", "max_completion_tokens": 4096, "messages":[{"role":"user", "content": "Tell me a story" }]}' ...
```

The default `in=text` sets that for you.

440
#### echo_full
441
442
443
444

The `echo_full` engine accepts un-processed requests and echoes the prompt back as the response.

```
445
dynamo-run in=http out=echo_full --model-name my_model
446
447
```

448
#### Configuration
449
450
451
452
453
454
455
456
457

Both echo engines use a configurable delay between tokens to simulate generation speed. You can adjust this using the `DYN_TOKEN_ECHO_DELAY_MS` environment variable:

```
# Set token echo delay to 1ms (1000 tokens per second)
DYN_TOKEN_ECHO_DELAY_MS=1 dynamo-run in=http out=echo_full
```

The default delay is 10ms, which produces approximately 100 tokens per second.
458

459
### Batch mode
460

461
`dynamo-run` can take a jsonl file full of prompts and evaluate them all:
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479

```
dynamo-run in=batch:prompts.jsonl out=llamacpp <model>
```

The input file should look like this:
```
{"text": "What is the capital of France?"}
{"text": "What is the capital of Spain?"}
```

Each one is passed as a prompt to the model. The output is written back to the same folder in `output.jsonl`. At the end of the run some statistics are printed.
The output looks like this:
```
{"text":"What is the capital of France?","response":"The capital of France is Paris.","tokens_in":7,"tokens_out":7,"elapsed_ms":1566}
{"text":"What is the capital of Spain?","response":".The capital of Spain is Madrid.","tokens_in":7,"tokens_out":7,"elapsed_ms":855}
```

480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
### Write your own engine in Python

Note: This section replaces "bring-your-own-engine".

The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo.

The Python file must do three things:
1. Decorate a function to get the runtime
2. Register on the network
3. Attach a request handler

```
from dynamo.llm import ModelType, register_llm
from dynamo.runtime import DistributedRuntime, dynamo_worker

# 1. Decorate a function to get the runtime
#
@dynamo_worker(static=False)
async def worker(runtime: DistributedRuntime):

    # 2. Register ourselves on the network
    #
    component = runtime.namespace("namespace").component("component")
    await component.create_service()
504
    model_path = "Qwen/Qwen3-0.6B" # or "/data/models/Qwen3-0.6B"
505
506
    model_type = ModelType.Backend
    endpoint = component.endpoint("endpoint")
507
508
    # Optional last param to register_llm is model_name. If not present derives it from model_path
    await register_llm(model_type, endpoint, model_path)
509
510
511
512
513
514

    # Initialize your engine here
    # engine = ...

    # 3. Attach request handler
    #
515
    await endpoint.serve_endpoint(RequestHandler(engine).generate)
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533

class RequestHandler:

    def __init__(self, engine):
        ...

    async def generate(self, request):
        # Call the engine
        # yield result dict
        ...

if __name__ == "__main__":
    uvloop.install()
    asyncio.run(worker())
```


The `model_path` can be:
534
- A HuggingFace repo ID, optionally prefixed with `hf://`. It will be downloaded and cached locally.
535
536
537
538
539
540
541
542
543
- The path to a checkout of a HuggingFace repo - any folder containing safetensor files as well as `config.json`, `tokenizer.json` and `tokenizer_config.json`.
- The path to a GGUF file, if your engine supports that.

The `model_type` can be:
- ModelType.Backend. Dynamo handles pre-processing. Your `generate` method receives a `request` dict containing a `token_ids` array of int. It must return a dict also containing a `token_ids` array and an optional `finish_reason` string.
- ModelType.Chat. Your `generate` method receives a `request` and must return a response dict of type [OpenAI Chat Completion](https://platform.openai.com/docs/api-reference/chat). Your engine handles pre-processing.
- ModelType.Completion. Your `generate` method receives a `request` and must return a response dict of the older [Completions](https://platform.openai.com/docs/api-reference/completions). Your engine handles pre-processing.

Here are some example engines:
544
545
546
547
548
549
550
551

- Backend:
    * [vllm](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_vllm.py)
    * [sglang](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_sglang.py)
- Chat:
    * [sglang](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_sglang_tok.py)

More fully-featured Backend engines (used by `dynamo-run`):
552
553
554
555
- [vllm](https://github.com/ai-dynamo/dynamo/blob/main/launch/dynamo-run/src/subprocess/vllm_inc.py)
- [sglang](https://github.com/ai-dynamo/dynamo/blob/main/launch/dynamo-run/src/subprocess/sglang_inc.py)


556
### Defaults
557

558
The input defaults to `in=text`. The output will default to `out=mistralrs` engine, unless it is disabled with `--no-default-features` in which case vllm is used.
559

560
### Extra engine arguments
561
562
563
564
565
566
567
568
569
570
571
572
573

The vllm and sglang backends support passing any argument the engine accepts.

Put the arguments in a JSON file:
```
{
    "dtype": "half",
    "trust_remote_code": true
}
```

Pass it like this:
```
574
dynamo-run out=sglang ~/llms/Llama-3.2-3B-Instruct --extra-engine-args sglang_extra.json
575
```