`dynamo-run` is a Rust binary that lets you easily run a model, explore the Dynamo components, and demonstrates the Rust API. It supports the `mistral.rs` and `llama.cpp` engines. `mistralrs` is the default for safe tensors, `llama.cpp` for GGUF files.
With the Dynamo CLI, you can chat with models quickly using `dynamo-run`
It is primarily for development and rapid prototyping. For production use we recommend the Python wrapped components, see the main project README.
`dynamo-run` is a CLI tool for exploring the Dynamo components. It's also an example of how to use components from Rust. If you use the Python wheel, it's available as `dynamo-run`.
It supports these engines: mistralrs, llamacpp, sglang, vllm, and tensorrt-llm. `mistralrs` is the default.
dynamo-run in=text out=llamacpp -vv# enables full trace logging
dynamo-run in=text out=llamacpp <model> -vv# enables full trace logging
```
```
## Quickstart with pip and vllm
If you used `pip` to install `dynamo`, you have the `dynamo-run` binary pre-installed with the `vllm` engine. You must be in a virtual environment with vllm installed to use this engine. To compile from source, see [Full usage details](#full-usage-details) below.
The vllm and sglang engines require [etcd](https://etcd.io/) and [nats](https://nats.io/) with jetstream (`nats-server -js`). Mistralrs and llamacpp do not.
### Use model from Hugging Face
### Use model from Hugging Face
To automatically download Qwen3 4B from Hugging Face (16 GiB download) and to start it in interactive text mode:
To automatically download Qwen3 4B from Hugging Face (16 GiB download) and to start it in interactive text mode:
```
```
dynamo-run out=vllm Qwen/Qwen3-4B
dynamo-run Qwen/Qwen3-4B
```
```
The general format for HF download follows this pattern:
The general format for HF download follows this pattern:
curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions
curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions
```
```
### Distributed System
## Distributed System
You can run the ingress side (HTTP server and pre-processing) on one machine, for example a CPU node, and the worker on a different machine (a GPU node).
You can run the ingress side (HTTP server and pre-processing) on one machine, for example a CPU node, and the worker on a different machine (a GPU node).
You will need [etcd](https://etcd.io/) and [nats](https://nats.io) with jetstream installed and accessible from both nodes.
You will need [etcd](https://etcd.io/) and [nats](https://nats.io) with jetstream installed and accessible from both nodes. For development I run NATS like this: `nats-server -js --trace --store_dir $(mktemp -d)`.
The only difference from the distributed system above is `--router-mode kv`. The patched vllm announces when a KV block is created or removed. The Dynamo router run finds the worker with the best match for those KV blocks and directs the traffic to that node.
The only difference from the distributed system above is `--router-mode kv`. vllm announces when a KV block is created or removed. The Dynamo router finds the worker with the best match for those KV blocks and directs the traffic to that node.
For performance testing, compare a typical workload with `--router-mode random|round-robin` to see if it can benefit from KV-aware routing.
For performance testing, compare a typical workload with `--router-mode random|round-robin` to see if it can benefit from KV-aware routing.
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../architecture/request_migration.md) documentation for details on how this works.
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../architecture/request_migration.md) documentation for details on how this works.
## Full usage details
## Development
The `dynamo-run` is also an example of what can be built in Rust with the `dynamo-llm` and `dynamo-runtime` crates. The following guide shows how to build from source with all the features.
### Getting Started
#### Setup
`dynamo-run` is also an example of what can be built in Rust with the `dynamo-llm` and `dynamo-runtime` crates. The following guide shows how to build from source with all the features.
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
##### Step 2: Install Rust
### Step 2: Install Rust
```
```
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
source $HOME/.cargo/env
```
```
##### Step 3: Build
### Step 3: Build
- Linux with GPU and CUDA (tested on Ubuntu):
- Linux with GPU and CUDA (tested on Ubuntu):
```
```
...
@@ -298,12 +242,11 @@ cd target/debug
...
@@ -298,12 +242,11 @@ cd target/debug
Build with `--release` for a smaller binary and better performance, but longer build times. The binary will be in `target/release`.
Build with `--release` for a smaller binary and better performance, but longer build times. The binary will be in `target/release`.
## Engines
#### Defaults
The input defaults to `in=text`. The output defaults to `out=mistralrs` engine, unless it is disabled with `--no-default-features` in which case an engine that echo's back your input is used.
The input defaults to `in=text`. The output defaults to `out=mistralrs` engine, unless it is disabled with `--no-default-features` in which case vllm is used.
### Running Inference with Pre-built Engines
#### mistralrs
### mistralrs
[mistral.rs](https://github.com/EricLBuehler/mistral.rs) is a pure Rust engine that is fast to run, fast to load, supports GGUF as well as safetensors, and runs well on CPU as well as GPU. For those reasons it is the default engine.
[mistral.rs](https://github.com/EricLBuehler/mistral.rs) is a pure Rust engine that is fast to run, fast to load, supports GGUF as well as safetensors, and runs well on CPU as well as GPU. For those reasons it is the default engine.
...
@@ -317,9 +260,9 @@ is equivalent to
...
@@ -317,9 +260,9 @@ is equivalent to
dynamo-run in=text out=mistralrs Qwen/Qwen3-4B
dynamo-run in=text out=mistralrs Qwen/Qwen3-4B
```
```
If you have multiple GPUs, mistral.rs does automatic tensor parallelism. You do not need to pass any extra flags to dynamo-run to enable it.
If you have multiple GPUs, `mistral.rs` does automatic tensor parallelism. You do not need to pass any extra flags to dynamo-run to enable it.
#### llamacpp
### llamacpp
[llama.cpp](https://github.com/ggml-org/llama.cpp) is built for CPU by default. For an optimized build pass the appropriate feature flag (highly recommended):
[llama.cpp](https://github.com/ggml-org/llama.cpp) is built for CPU by default. For an optimized build pass the appropriate feature flag (highly recommended):
...
@@ -343,168 +286,41 @@ Note that in some cases we are unable to extract the tokenizer from the GGUF, an
...
@@ -343,168 +286,41 @@ Note that in some cases we are unable to extract the tokenizer from the GGUF, an
If you have multiple GPUs, llama.cpp does automatic tensor parallelism. You do not need to pass any extra flags to dynamo-run to enable it.
If you have multiple GPUs, llama.cpp does automatic tensor parallelism. You do not need to pass any extra flags to `dynamo-run` to enable it.
#### sglang
The [SGLang](https://docs.sglang.ai/index.html) engine requires [etcd](https://etcd.io/) and [nats](https://nats.io/) with jetstream (`nats-server -js`) to be running.
Dynamo only manages the leader node (node rank 0). The follower nodes are started in the [normal sglang way](https://docs.sglang.ai/references/deepseek.html#running-examples-on-multi-node).
- Parameters `--leader-addr` and `--dist-init-addr` must match and be the IP address of the leader node. All followers must be able to connect. SGLang is using [PyTorch Distributed](https://docs.pytorch.org/tutorials/beginner/dist_overview.html) for networking.
- Parameters `--tensor-parallel-size` and `--tp` must match and be the total number of GPUs across the cluster.
-`--node-rank` must be unique consecutive integers starting at 1. The leader, managed by Dynamo, is 0.
#### vllm
Using the [vllm](https://github.com/vllm-project/vllm) Python library. Slow startup, fast inference. Supports both safetensors from HF and GGUF files, but is very slow for GGUF - prefer llamacpp.
The vllm engine requires [etcd](https://etcd.io/) and [nats](https://nats.io/) with jetstream (`nats-server -js`) to be running.
We use [uv](https://docs.astral.sh/uv/) but any virtualenv manager should work.
1. Setup:
```
uv venv
source .venv/bin/activate
uv pip install pip
uv pip install vllm==0.8.4 setuptools
```
```{note}
If you're on Ubuntu 22.04 or earlier, you must add `--python=python3.10` to your `uv venv` command.
```
2. Build:
### Mocker engine
```
cargo build
cd target/debug
```
3. Run
The mocker engine is a mock vLLM implementation designed for testing and development purposes. It simulates realistic token generation timing without requiring actual model inference, making it useful for:
Inside that virtualenv:
**HF repo:**
- Testing distributed system components without GPU resources
```
- Benchmarking infrastructure and networking overhead
To pass extra arguments to the vllm engine see [Extra engine arguments](#extra-engine-arguments).
The `--model-path` is required but can point to any valid model path - the mocker doesn't actually load the model weights (but the pre-processor needs the tokenizer). The arguments `block_size`, `num_gpu_blocks`, `max_num_seqs`, `max_num_batched_tokens`, `enable_prefix_caching`, and `enable_chunked_prefill` are common arguments shared with the real VLLM engine.
vllm attempts to allocate enough KV cache for the full context length at startup. If that does not fit in your available memory pass `--context-length <value>`.
And below are arguments that are mocker-specific:
-`speedup_ratio`: Speed multiplier for token generation (default: 1.0). Higher values make the simulation engines run faster.
-`dp_size`: Number of data parallel workers to simulate (default: 1)
-`watermark`: KV cache watermark threshold as a fraction (default: 0.01). This argument also exists for the real VLLM engine but cannot be passed as an engine arg.
2025-06-28T00:32:32.507Z WARN dynamo_run::subprocess: File "/tmp/.tmpYeq5qA", line 29, in <module>
dynamo-run in=http out=auto --router-mode kv
2025-06-28T00:32:32.507Z WARN dynamo_run::subprocess: from dynamo.llm import ModelType, WorkerMetricsPublisher, register_llm
2025-06-28T00:32:32.507Z WARN dynamo_run::subprocess: ModuleNotFoundError: No module named 'dynamo'
```
Then run
```
uv pip install maturin
pip install patchelf
cd lib/bindings/python
maturin develop
```
```
this builds the Python->Rust bindings into that missing dynamo module. Rerun dynamo-run, the problem should be resolved.
**Multi-GPU**
Pass `--tensor-parallel-size <NUM-GPUS>` to `dynamo-run`.
To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
**Multinode:**
vllm uses [ray](https://docs.vllm.ai/en/latest/serving/distributed_serving.html#running-vllm-on-multiple-nodes) for pipeline parallel inference. Dynamo does not change or manage that.
Here is an example on two 8x nodes:
- Leader node: `ray start --head --port=6379`
- Each follower node: `ray start --address=<HEAD_NODE_IP>:6379`
The `--tensor-parallel-size` parameter is the total number of GPUs in the cluster. This is often constrained by a model dimension such as being a divisor of the number of attention heads.
Startup can be slow so you may want to `export DYN_LOG=debug` to see progress.
Shutdown: `ray stop`
#### trtllm
### echo_full
Using [TensorRT-LLM's LLM API](https://nvidia.github.io/TensorRT-LLM/llm-api/), a high-level Python API.
The `echo_full` engine accepts un-processed requests and echoes the prompt back as the response.
You can use `--extra-engine-args` to pass extra arguments to LLM API engine.
The trtllm engine requires [etcd](https://etcd.io/) and [nats](https://nats.io/) with jetstream (`nats-server -js`) to be running.
##### Step 1: Build the environment
See instructions [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md#build-docker) to build the dynamo container with TensorRT-LLM.
##### Step 2: Run the environment
See instructions [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md#run-container) to run the built environment.
##### Step 3: Execute `dynamo-run` command
Execute the following to load the TensorRT-LLM model specified in the configuration.
Dynamo includes two echo engines for testing and debugging purposes:
##### echo_core
The `echo_core` engine accepts pre-processed requests and echoes the tokens back as the response. This is useful for testing pre-processing functionality as the response includes the full prompt template.
The `echo_core` engine accepts pre-processed requests and echoes the tokens back as the response. This is useful for testing pre-processing functionality as the response includes the full prompt template.
Both echo engines use a configurable delay between tokens to simulate generation speed. You can adjust this using the `DYN_TOKEN_ECHO_DELAY_MS` environment variable:
Both echo engines use a configurable delay between tokens to simulate generation speed. You can adjust this using the `DYN_TOKEN_ECHO_DELAY_MS` environment variable:
The default delay is 10ms, which produces approximately 100 tokens per second.
The default delay is 10ms, which produces approximately 100 tokens per second.
#### Batch mode
### Other engines, multi-node, production
`vllm`, `sglang` and `trtllm` production grade engines are available in `components/backends`. They run as Python components, using the Rust bindings. See the main README.
`dynamo-run` is an exploration, development and prototyping tool, as well as an example of using the Rust API. Multi-node and production setups should be using the main engine components.
## Batch mode
`dynamo-run` can take a jsonl file full of prompts and evaluate them all:
`dynamo-run` can take a jsonl file full of prompts and evaluate them all:
...
@@ -559,60 +373,9 @@ The output looks like this:
...
@@ -559,60 +373,9 @@ The output looks like this:
{"text":"What is the capital of Spain?","response":".The capital of Spain is Madrid.","tokens_in":7,"tokens_out":7,"elapsed_ms":855}
{"text":"What is the capital of Spain?","response":".The capital of Spain is Madrid.","tokens_in":7,"tokens_out":7,"elapsed_ms":855}
```
```
#### Mocker engine
## Writing your own engine in Python
The mocker engine is a mock vLLM implementation designed for testing and development purposes. It simulates realistic token generation timing without requiring actual model inference, making it useful for:
- Testing distributed system components without GPU resources
- Benchmarking infrastructure and networking overhead
- Developing and debugging Dynamo components
- Load testing and performance analysis
**Basic usage:**
The `--model-path` is required but can point to any valid model path - the mocker doesn't actually load the model weights (but the pre-processor needs the tokenizer). The arguments `block_size`, `num_gpu_blocks`, `max_num_seqs`, `max_num_batched_tokens`, `enable_prefix_caching`, and `enable_chunked_prefill` are common arguments shared with the real VLLM engine.
And below are arguments that are mocker-specific:
-`speedup_ratio`: Speed multiplier for token generation (default: 1.0). Higher values make the simulation engines run faster.
-`dp_size`: Number of data parallel workers to simulate (default: 1)
-`watermark`: KV cache watermark threshold as a fraction (default: 0.01). This argument also exists for the real VLLM engine but cannot be passed as an engine arg.
The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo.
The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo. All of the main backend components in `components/backends/` work like this.
The Python file must do three things:
The Python file must do three things:
1. Decorate a function to get the runtime
1. Decorate a function to get the runtime
...
@@ -685,11 +448,9 @@ Here are some example engines:
...
@@ -685,11 +448,9 @@ Here are some example engines: