dynamo_run.md 16.5 KB
Newer Older
1
# Dynamo Run
2

3
4
5
6
7
8
* [Quickstart with pip and vllm](#quickstart-with-pip-and-vllm)
    * [Automatically download a model from Hugging Face](#use-model-from-hugging-face)
    * [Run a model from local file](#run-a-model-from-local-file)
    * [Multi-node](#multi-node)
* [Compiling from Source](#compiling-from-source)
    * [Setup](#setup)
9
10
11
    * [Sglang](#sglang)
    * [lama.cpp](#llama_cpp)
    * [Vllm](#vllm)
12
    * [Python bring-your-own-engine](#python-bring-your-own-engine)
13
    * [TensorRT-LLM](#tensorrt-llm-engine)
14
15
16
17
18
    * [Echo Engines](#echo-engines)
* [Batch mode](#batch-mode)
* [Defaults](#defaults)
* [Extra engine arguments](#extra-engine-arguments)

19
`dynamo-run` is a CLI tool for exploring the Dynamo components, and an example of how to use them from Rust. It is also available as `dynamo run` if using the Python wheel.
20

21
## Quickstart with pip and vllm
22

23
If you used `pip` to install `dynamo` you should have the `dynamo-run` binary pre-installed with the `vllm` engine. You must be in a virtual env with vllm installed to use this. To compile from source, see "Full documentation" below.
24

25
### Use model from Hugging Face
26
27

This will automatically download Qwen2.5 3B from Hugging Face (6 GiB download) and start it in interactive text mode:
28
```
29
dynamo run out=vllm Qwen/Qwen2.5-3B-Instruct
30
31
```

32
33
34
General format for HF download:
```
dynamo run out=<engine> <HUGGING_FACE_ORGANIZATION/MODEL_NAME>
35
36
```

37
For gated models (e.g. meta-llama/Llama-3.2-3B-Instruct) you have to have an `HF_TOKEN` environment variable set.
38

39
The parameter can be the ID of a HuggingFace repository (it will be downloaded), a GGUF file, or a folder containing safetensors, config.json, etc (a locally checked out HuggingFace repository).
40

41
### Run a model from local file
42

43
#### Step 1: Download model from Hugging Face
44
45
46
47
One of these models should be high quality and fast on almost any machine: https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF
E.g. https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/blob/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf

Download model file:
48
```
49
curl -L -o Llama-3.2-3B-Instruct-Q4_K_M.gguf "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf?download=true"
50
```
51
52
#### Run model from local file
**Text interface**
53
```
54
dynamo run out=vllm Llama-3.2-3B-Instruct-Q4_K_M.gguf # or path to a Hugging Face repo checkout instead of the GGUF
55
```
56

57
**HTTP interface**
58
```
59
dynamo run in=http out=vllm Llama-3.2-3B-Instruct-Q4_K_M.gguf
60
```
61

62
**List the models**
63
```
64
curl localhost:8080/v1/models
65
```
66

67
**Send a request**
68
```
69
curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions
70
```
71

72
### Multi-node
73
74
75

You will need [etcd](https://etcd.io/) and [nats](https://nats.io) installed and accessible from both nodes.

76
**Node 1:**
77
```
78
dynamo run in=http out=dyn://llama3B_pool
79
```
80

81
**Node 2:**
82
```
83
dynamo run in=dyn://llama3B_pool out=vllm ~/llm_models/Llama-3.2-3B-Instruct
84
```
85

86
87
88
This will use etcd to auto-discover the model and NATS to talk to it. You can run multiple workers on the same endpoint and it will pick one at random each time.

The `llama3B_pool` name is purely symbolic, pick anything as long as it matches the other node.
89

90
91
Run `dynamo run --help` for more options.

92
## Compiling from Source
93

94
`dynamo-run` is what `dynamo run` executes. It is an example of what you can build in Rust with the `dynamo-llm` and `dynamo-runtime`. The following guide demonstrates how you can build from source with all the features.
95

96
### Setup
97

98
99
#### Step 1: Install libraries
**Ubuntu:**
100
```
101
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libssl-dev libclang-dev protobuf-compiler python3-dev cmake
102
103
```

104
**macOS:**
105
106
107
108
109
110
111
- [Homebrew](https://brew.sh/)
```
# if brew is not installed on your system, install it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
- [Xcode](https://developer.apple.com/xcode/)

112
113
```
brew install cmake protobuf
114

115
# Check that Metal is accessible
116
117
xcrun -sdk macosx metal
```
118
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
119

120
#### Step 2: Install Rust
121
```
122
123
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
124
```
125

126
127
128
129
130
131
132
133
134
#### Step 3: Build

Run `cargo build` to install the `dynamo-run` binary in `target/debug`.

> **Optionally**, you can run `cargo build` from any location with arguments:
> ```
> --target-dir /path/to/target_directory` specify target_directory with write privileges
> --manifest-path /path/to/project/Cargo.toml` if cargo build is run outside of `launch/` directory
> ```
135
136


137
- Linux with GPU and CUDA (tested on Ubuntu):
138
```
139
cargo build --features cuda
140
```
141

142
- macOS with Metal:
143
```
144
cargo build --features metal
145
146
```

147
- CPU only:
148
```
149
cargo build
150
151
```

152
The binary will be called `dynamo-run` in `target/debug`
153
```
154
cd target/debug
155
```
156
> Note: Build with `--release` for a smaller binary and better performance, but longer build times. The binary will be in `target/release`.
157

158
To build for other engines, see the following sections.
159

160
161

### sglang
162

163
164
1. Setup the python virtual env:

165
166
167
168
169
170
171
```
uv venv
source .venv/bin/activate
uv pip install pip
uv pip install sgl-kernel --force-reinstall --no-deps
uv pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
```
172
173
174
175

2. Build

```
176
cargo build --features sglang
177
178
179
180
181
182
```

3. Run

Any example above using `out=sglang` will work, but our sglang backend is also multi-gpu and multi-node.

183
**Node 1:**
184
```
185
186
cd target/debug
./dynamo-run in=http out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 0 --leader-addr 10.217.98.122:9876
187
188
```

189
**Node 2:**
190
```
191
192
cd target/debug
./dynamo-run in=none out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 1 --leader-addr 10.217.98.122:9876
193
194
```

195
196
To pass extra arguments to the sglang engine see *Extra engine arguments* below.

197
### llama_cpp
198

199
200
201
202
203
```
cargo build --features llamacpp,cuda
cd target/debug
dynamo-run out=llamacpp ~/llm_models/Llama-3.2-3B-Instruct-Q6_K.gguf
```
204
If the build step also builds llama_cpp libraries into the same folder as the binary ("libllama.so", "libggml.so", "libggml-base.so", "libggml-cpu.so", "libggml-cuda.so"), then `dynamo-run` will need to find those at runtime. Set `LD_LIBRARY_PATH`, and be sure to deploy them alongside the `dynamo-run` binary.
Graham King's avatar
Graham King committed
205

206
### vllm
Graham King's avatar
Graham King committed
207
208
209
210
211

Using the [vllm](https://github.com/vllm-project/vllm) Python library. We only use the back half of vllm, talking to it over `zmq`. Slow startup, fast inference. Supports both safetensors from HF and GGUF files.

We use [uv](https://docs.astral.sh/uv/) but any virtualenv manager should work.

212
1. Setup:
Graham King's avatar
Graham King committed
213
214
215
216
```
uv venv
source .venv/bin/activate
uv pip install pip
217
uv pip install vllm==0.8.4 setuptools
Graham King's avatar
Graham King committed
218
219
220
221
```

**Note: If you're on Ubuntu 22.04 or earlier, you will need to add `--python=python3.10` to your `uv venv` command**

222
2. Build:
Graham King's avatar
Graham King committed
223
```
224
cargo build
225
cd target/debug
Graham King's avatar
Graham King committed
226
227
```

228
229
230
231
3. Run
Inside that virtualenv:

**HF repo:**
Graham King's avatar
Graham King committed
232
```
233
./dynamo-run in=http out=vllm ~/llm_models/Llama-3.2-3B-Instruct/
Graham King's avatar
Graham King committed
234
235
236

```

237
**GGUF:**
Graham King's avatar
Graham King committed
238
```
239
./dynamo-run in=http out=vllm ~/llm_models/Llama-3.2-3B-Instruct-Q6_K.gguf
Graham King's avatar
Graham King committed
240
241
```

242
243
**Multi-node:**
**Node 1:**
244
```
Neelay Shah's avatar
Neelay Shah committed
245
dynamo-run in=text out=vllm ~/llm_models/Llama-3.2-3B-Instruct/ --tensor-parallel-size 8 --num-nodes 2 --leader-addr 10.217.98.122:6539 --node-rank 0
246
247
```

248
**Node 2:**
249
```
Neelay Shah's avatar
Neelay Shah committed
250
dynamo-run in=none out=vllm ~/llm_models/Llama-3.2-3B-Instruct/ --num-nodes 2 --leader-addr 10.217.98.122:6539 --node-rank 1
251
252
```

253
To pass extra arguments to the vllm engine see [Extra engine arguments](#extra_engine_arguments) below.
254

255
### Python bring-your-own-engine
256
257
258
259
260
261

You can provide your own engine in a Python file. The file must provide a generator with this signature:
```
async def generate(request):
```

262
Build: `cargo build --features python`
263

264
#### Python does the pre-processing
265
266
267
268

If the Python engine wants to receive and returns strings - it will do the prompt templating and tokenization itself - run it like this:

```
269
dynamo-run out=pystr:/home/user/my_python_engine.py
270
271
```

272
273
- The `request` parameter is a map, an OpenAI compatible create chat completion request: https://platform.openai.com/docs/api-reference/chat/create
- The function must `yield` a series of maps conforming to create chat completion stream response (example below).
274
- If using an HTTP front-end add the `--model-name` flag. This is the name we serve the model under.
275
276
277

The file is loaded once at startup and kept in memory.

278
**Example engine:**
279
280
281
282
```
import asyncio

async def generate(request):
283
    yield {"id":"1","choices":[{"index":0,"delta":{"content":"The","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-3B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
284
    await asyncio.sleep(0.1)
285
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" capital","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-3B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
286
    await asyncio.sleep(0.1)
287
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" of","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-3B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
288
    await asyncio.sleep(0.1)
289
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" France","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-3B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
290
    await asyncio.sleep(0.1)
291
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" is","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-3B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
292
    await asyncio.sleep(0.1)
293
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" Paris","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-3B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
294
    await asyncio.sleep(0.1)
295
    yield {"id":"1","choices":[{"index":0,"delta":{"content":".","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-3B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
296
    await asyncio.sleep(0.1)
297
    yield {"id":"1","choices":[{"index":0,"delta":{"content":"","role":"assistant"},"finish_reason":"stop"}],"created":1841762283,"model":"Llama-3.2-3B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
298
299
```

300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
Command line arguments are passed to the python engine like this:
```
dynamo-run out=pystr:my_python_engine.py -- -n 42 --custom-arg Orange --yes
```

The python engine receives the arguments in `sys.argv`. The argument list will include some standard ones as well as anything after the `--`.

This input:
```
dynamo-run out=pystr:my_engine.py /opt/models/Llama-3.2-3B-Instruct/ --model-name llama_3.2 --tensor-parallel-size 4 -- -n 1
```

is read like this:
```
async def generate(request):
    .. as before ..

if __name__ == "__main__":
    print(f"MAIN: {sys.argv}")
```

and produces this output:
```
MAIN: ['my_engine.py', '--model-path', '/opt/models/Llama-3.2-3B-Instruct/', '--model-name', 'llama3.2', '--http-port', '8080', '--tensor-parallel-size', '4', '--base-gpu-id', '0', '--num-nodes', '1', '--node-rank', '0', '-n', '1']
```

This allows quick iteration on the engine setup. Note how the `-n` `1` is included. Flags `--leader-addr` and `--model-config` will also be added if provided to `dynamo-run`.

328
#### TensorRT-LLM engine
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344

To run a TRT-LLM model with dynamo-run we have included a python based [async engine] (/examples/tensorrt_llm/engines/agg_engine.py).
To configure the TensorRT-LLM async engine please see [llm_api_config.yaml](/examples/tensorrt_llm/configs/llm_api_config.yaml). The file defines the options that need to be passed to the LLM engine. Follow the steps below to serve trtllm on dynamo run.

##### Step 1: Build the environment

See instructions [here](/examples/tensorrt_llm/README.md#build-docker) to build the dynamo container with TensorRT-LLM.

##### Step 2: Run the environment

See instructions [here](/examples/tensorrt_llm/README.md#run-container) to run the built environment.

##### Step 3: Execute `dynamo run` command

Execute the following to load the TensorRT-LLM model specified in the configuration.
```
345
dynamo run out=pystr:/workspace/examples/tensorrt_llm/engines/trtllm_engine.py  -- --engine_args /workspace/examples/tensorrt_llm/configs/llm_api_config.yaml
346
347
```

348
#### Dynamo does the pre-processing
349
350
351

If the Python engine wants to receive and return tokens - the prompt templating and tokenization is already done - run it like this:
```
Neelay Shah's avatar
Neelay Shah committed
352
dynamo-run out=pytok:/home/user/my_python_engine.py --model-path <hf-repo-checkout>
353
354
355
356
357
358
359
360
361
362
363
364
```

- The request parameter is a map that looks like this:
```
{'token_ids': [128000, 128006, 9125, 128007, ... lots more ... ], 'stop_conditions': {'max_tokens': 8192, 'stop': None, 'stop_token_ids_hidden': [128001, 128008, 128009], 'min_tokens': None, 'ignore_eos': None}, 'sampling_options': {'n': None, 'best_of': None, 'presence_penalty': None, 'frequency_penalty': None, 'repetition_penalty': None, 'temperature': None, 'top_p': None, 'top_k': None, 'min_p': None, 'use_beam_search': None, 'length_penalty': None, 'seed': None}, 'eos_token_ids': [128001, 128008, 128009], 'mdc_sum': 'f1cd44546fdcbd664189863b7daece0f139a962b89778469e4cffc9be58ccc88', 'annotations': []}
```

- The `generate` function must `yield` a series of maps that look like this:
```
{"token_ids":[791],"tokens":None,"text":None,"cum_log_probs":None,"log_probs":None,"finish_reason":None}
```

365
- Command like flag `--model-path` which must point to a Hugging Face repo checkout containing the `tokenizer.json`. The `--model-name` flag is optional. If not provided we use the HF repo name (directory name) as the model name.
366

367
**Example engine:**
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
```
import asyncio

async def generate(request):
    yield {"token_ids":[791]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[6864]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[315]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[9822]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[374]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[12366]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[13]}
```
386

387
388
`pytok` supports the same ways of passing command line arguments as `pystr` - `initialize` or `main` with `sys.argv`.

389
### Echo Engines
390
391
392

Dynamo includes two echo engines for testing and debugging purposes:

393
#### echo_core
394
395
396
397
398
399
400

The `echo_core` engine accepts pre-processed requests and echoes the tokens back as the response. This is useful for testing pre-processing functionality as the response will include the full prompt template.

```
dynamo-run in=http out=echo_core --model-path <hf-repo-checkout>
```

401
#### echo_full
402
403
404
405

The `echo_full` engine accepts un-processed requests and echoes the prompt back as the response.

```
406
dynamo-run in=http out=echo_full --model-name my_model
407
408
```

409
#### Configuration
410
411
412
413
414
415
416
417
418

Both echo engines use a configurable delay between tokens to simulate generation speed. You can adjust this using the `DYN_TOKEN_ECHO_DELAY_MS` environment variable:

```
# Set token echo delay to 1ms (1000 tokens per second)
DYN_TOKEN_ECHO_DELAY_MS=1 dynamo-run in=http out=echo_full
```

The default delay is 10ms, which produces approximately 100 tokens per second.
419

420
### Batch mode
421

422
`dynamo-run` can take a jsonl file full of prompts and evaluate them all:
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440

```
dynamo-run in=batch:prompts.jsonl out=llamacpp <model>
```

The input file should look like this:
```
{"text": "What is the capital of France?"}
{"text": "What is the capital of Spain?"}
```

Each one is passed as a prompt to the model. The output is written back to the same folder in `output.jsonl`. At the end of the run some statistics are printed.
The output looks like this:
```
{"text":"What is the capital of France?","response":"The capital of France is Paris.","tokens_in":7,"tokens_out":7,"elapsed_ms":1566}
{"text":"What is the capital of Spain?","response":".The capital of Spain is Madrid.","tokens_in":7,"tokens_out":7,"elapsed_ms":855}
```

441
### Defaults
442

443
The input defaults to `in=text`. The output will default to `mistralrs` engine. If not available whatever engine you have compiled in (so depending on `--features`).
444

445
### Extra engine arguments
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460

The vllm and sglang backends support passing any argument the engine accepts.

Put the arguments in a JSON file:
```
{
    "dtype": "half",
    "trust_remote_code": true
}
```

Pass it like this:
```
dynamo-run out=sglang ~/llm_models/Llama-3.2-3B-Instruct --extra-engine-args sglang_extra.json
```