Commit 16124e74 authored by cdgamarose-nv's avatar cdgamarose-nv Committed by GitHub
Browse files

docs: Updated dynamo run instructions (#555)



#### Overview:

Updated the dynamo run doc `docs/guides/dynamo_run.md`

#### Details:

- Updated the instructions to make it clear which binary to use for built backends
- Reformatted the doc to make it more readable
- Added missing cmake library for ubuntu
Signed-off-by: default avatarChantal D Gama Rose <cdgamarose@nvidia.com>
parent 5045ada4
# Dynamo Run # Dynamo Run
* [Quickstart with pip and vllm](#quickstart-with-pip-and-vllm)
* [Automatically download a model from Hugging Face](#use-model-from-hugging-face)
* [Run a model from local file](#run-a-model-from-local-file)
* [Multi-node](#multi-node)
* [Compiling from Source](#compiling-from-source)
* [Setup](#setup)
* [sglang](#sglang)
* [llama_cpp](#llama_cpp)
* [vllm](#vllm)
* [Python bring-your-own-engine](#python-bring-your-own-engine)
* [trtllm](#trtllm)
* [Echo Engines](#echo-engines)
* [Batch mode](#batch-mode)
* [Defaults](#defaults)
* [Extra engine arguments](#extra-engine-arguments)
`dynamo-run` is a CLI tool for exploring the Dynamo components, and an example of how to use them from Rust. It is also available as `dynamo run` if using the Python wheel. `dynamo-run` is a CLI tool for exploring the Dynamo components, and an example of how to use them from Rust. It is also available as `dynamo run` if using the Python wheel.
## Quickstart with pip and vllm ## Quickstart with pip and vllm
If you used `pip` to install `dynamo` you should have the `dynamo-run` binary pre-installed with the `vllm` engine. You must be in a virtual env with vllm installed to use this. For more options see "Full documentation" below. If you used `pip` to install `dynamo` you should have the `dynamo-run` binary pre-installed with the `vllm` engine. You must be in a virtual env with vllm installed to use this. To compile from source, see "Full documentation" below.
### Automatically download a model from [Hugging Face](https://huggingface.co/models) ### Use model from Hugging Face
This will automatically download Qwen2.5 3B from Hugging Face (6 GiB download) and start it in interactive text mode: This will automatically download Qwen2.5 3B from Hugging Face (6 GiB download) and start it in interactive text mode:
``` ```
...@@ -22,8 +38,9 @@ For gated models (e.g. meta-llama/Llama-3.2-3B-Instruct) you have to have an `HF ...@@ -22,8 +38,9 @@ For gated models (e.g. meta-llama/Llama-3.2-3B-Instruct) you have to have an `HF
The parameter can be the ID of a HuggingFace repository (it will be downloaded), a GGUF file, or a folder containing safetensors, config.json, etc (a locally checked out HuggingFace repository). The parameter can be the ID of a HuggingFace repository (it will be downloaded), a GGUF file, or a folder containing safetensors, config.json, etc (a locally checked out HuggingFace repository).
## Manually download a model from Hugging Face ### Run a model from local file
#### Step 1: Download model from Hugging Face
One of these models should be high quality and fast on almost any machine: https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF One of these models should be high quality and fast on almost any machine: https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF
E.g. https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/blob/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf E.g. https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/blob/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
...@@ -31,39 +48,37 @@ Download model file: ...@@ -31,39 +48,37 @@ Download model file:
``` ```
curl -L -o Llama-3.2-3B-Instruct-Q4_K_M.gguf "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf?download=true" curl -L -o Llama-3.2-3B-Instruct-Q4_K_M.gguf "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf?download=true"
``` ```
#### Run model from local file
## Run a model from local file **Text interface**
*Text interface*
``` ```
dynamo run out=vllm Llama-3.2-3B-Instruct-Q4_K_M.gguf # or path to a Hugging Face repo checkout instead of the GGUF dynamo run out=vllm Llama-3.2-3B-Instruct-Q4_K_M.gguf # or path to a Hugging Face repo checkout instead of the GGUF
``` ```
*HTTP interface* **HTTP interface**
``` ```
dynamo run in=http out=vllm Llama-3.2-3B-Instruct-Q4_K_M.gguf dynamo run in=http out=vllm Llama-3.2-3B-Instruct-Q4_K_M.gguf
``` ```
*List the models* **List the models**
``` ```
curl localhost:8080/v1/models curl localhost:8080/v1/models
``` ```
*Send a request* **Send a request**
``` ```
curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions
``` ```
*Multi-node* ### Multi-node
You will need [etcd](https://etcd.io/) and [nats](https://nats.io) installed and accessible from both nodes. You will need [etcd](https://etcd.io/) and [nats](https://nats.io) installed and accessible from both nodes.
Node 1: **Node 1:**
``` ```
dynamo run in=http out=dyn://llama3B_pool dynamo run in=http out=dyn://llama3B_pool
``` ```
Node 2: **Node 2:**
``` ```
dynamo run in=dyn://llama3B_pool out=vllm ~/llm_models/Llama-3.2-3B-Instruct dynamo run in=dyn://llama3B_pool out=vllm ~/llm_models/Llama-3.2-3B-Instruct
``` ```
...@@ -74,18 +89,19 @@ The `llama3B_pool` name is purely symbolic, pick anything as long as it matches ...@@ -74,18 +89,19 @@ The `llama3B_pool` name is purely symbolic, pick anything as long as it matches
Run `dynamo run --help` for more options. Run `dynamo run --help` for more options.
# Full documentation ## Compiling from Source
`dynamo-run` is what `dynamo run` executes. It is an example of what you can build in Rust with the `dynamo-llm` and `dynamo-runtime`. Here is a list of how to build from source and all the features. `dynamo-run` is what `dynamo run` executes. It is an example of what you can build in Rust with the `dynamo-llm` and `dynamo-runtime`. The following guide demonstrates how you can build from source with all the features.
## Setup ### Setup
Libraries Ubuntu: #### Step 1: Install libraries
**Ubuntu:**
``` ```
apt install -y build-essential libhwloc-dev libudev-dev pkg-config libssl-dev libclang-dev protobuf-compiler python3-dev sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libssl-dev libclang-dev protobuf-compiler python3-dev cmake
``` ```
Libraries macOS: **macOS:**
- [Homebrew](https://brew.sh/) - [Homebrew](https://brew.sh/)
``` ```
# if brew is not installed on your system, install it # if brew is not installed on your system, install it
...@@ -101,23 +117,22 @@ xcrun -sdk macosx metal ...@@ -101,23 +117,22 @@ xcrun -sdk macosx metal
``` ```
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly. If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
Install Rust: #### Step 2: Install Rust
``` ```
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env source $HOME/.cargo/env
``` ```
## Build #### Step 3: Build
Run `cargo build` to install the `dynamo-run` binary in `target/debug`.
> **Optionally**, you can run `cargo build` from any location with arguments:
> ```
> --target-dir /path/to/target_directory` specify target_directory with write privileges
> --manifest-path /path/to/project/Cargo.toml` if cargo build is run outside of `launch/` directory
> ```
Navigate to launch/ directory
```
cd launch/
```
Optionally can run `cargo build` from any location with arguments:
```
--target-dir /path/to/target_directory` specify target_directory with write privileges
--manifest-path /path/to/project/Cargo.toml` if cargo build is run outside of `launch/` directory
```
- Linux with GPU and CUDA (tested on Ubuntu): - Linux with GPU and CUDA (tested on Ubuntu):
``` ```
...@@ -138,10 +153,12 @@ The binary will be called `dynamo-run` in `target/debug` ...@@ -138,10 +153,12 @@ The binary will be called `dynamo-run` in `target/debug`
``` ```
cd target/debug cd target/debug
``` ```
> Note: Build with `--release` for a smaller binary and better performance, but longer build times. The binary will be in `target/release`.
Build with `--release` for a smaller binary and better performance, but longer build times. The binary will be in `target/release`. To build for other engines, see the following sections.
## sglang
### sglang
1. Setup the python virtual env: 1. Setup the python virtual env:
...@@ -163,33 +180,36 @@ cargo build --features sglang ...@@ -163,33 +180,36 @@ cargo build --features sglang
Any example above using `out=sglang` will work, but our sglang backend is also multi-gpu and multi-node. Any example above using `out=sglang` will work, but our sglang backend is also multi-gpu and multi-node.
Node 1: **Node 1:**
``` ```
dynamo-run in=http out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 0 --leader-addr 10.217.98.122:9876 cd target/debug
./dynamo-run in=http out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 0 --leader-addr 10.217.98.122:9876
``` ```
Node 2: **Node 2:**
``` ```
dynamo-run in=none out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 1 --leader-addr 10.217.98.122:9876 cd target/debug
./dynamo-run in=none out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 1 --leader-addr 10.217.98.122:9876
``` ```
To pass extra arguments to the sglang engine see *Extra engine arguments* below. To pass extra arguments to the sglang engine see *Extra engine arguments* below.
## llama_cpp ### llama_cpp
- `cargo build --features llamacpp,cuda`
- `dynamo-run out=llama_cp ~/llm_models/Llama-3.2-3B-Instruct-Q6_K.gguf`
```
cargo build --features llamacpp,cuda
cd target/debug
dynamo-run out=llamacpp ~/llm_models/Llama-3.2-3B-Instruct-Q6_K.gguf
```
If the build step also builds llama_cpp libraries into the same folder as the binary ("libllama.so", "libggml.so", "libggml-base.so", "libggml-cpu.so", "libggml-cuda.so"), then `dynamo-run` will need to find those at runtime. Set `LD_LIBRARY_PATH`, and be sure to deploy them alongside the `dynamo-run` binary. If the build step also builds llama_cpp libraries into the same folder as the binary ("libllama.so", "libggml.so", "libggml-base.so", "libggml-cpu.so", "libggml-cuda.so"), then `dynamo-run` will need to find those at runtime. Set `LD_LIBRARY_PATH`, and be sure to deploy them alongside the `dynamo-run` binary.
## vllm ### vllm
Using the [vllm](https://github.com/vllm-project/vllm) Python library. We only use the back half of vllm, talking to it over `zmq`. Slow startup, fast inference. Supports both safetensors from HF and GGUF files. Using the [vllm](https://github.com/vllm-project/vllm) Python library. We only use the back half of vllm, talking to it over `zmq`. Slow startup, fast inference. Supports both safetensors from HF and GGUF files.
We use [uv](https://docs.astral.sh/uv/) but any virtualenv manager should work. We use [uv](https://docs.astral.sh/uv/) but any virtualenv manager should work.
Setup: 1. Setup:
``` ```
uv venv uv venv
source .venv/bin/activate source .venv/bin/activate
...@@ -199,37 +219,40 @@ uv pip install vllm==0.7.3 setuptools ...@@ -199,37 +219,40 @@ uv pip install vllm==0.7.3 setuptools
**Note: If you're on Ubuntu 22.04 or earlier, you will need to add `--python=python3.10` to your `uv venv` command** **Note: If you're on Ubuntu 22.04 or earlier, you will need to add `--python=python3.10` to your `uv venv` command**
Build: 2. Build:
``` ```
cargo build --features vllm cargo build --features vllm
cd target/debug
``` ```
Run (still inside that virtualenv) - HF repo: 3. Run
Inside that virtualenv:
**HF repo:**
``` ```
./dynamo-run in=http out=vllm --model-path ~/llm_models/Llama-3.2-3B-Instruct/ ./dynamo-run in=http out=vllm --model-path ~/llm_models/Llama-3.2-3B-Instruct/
``` ```
Run (still inside that virtualenv) - GGUF: **GGUF:**
``` ```
./dynamo-run in=http out=vllm ~/llm_models/Llama-3.2-3B-Instruct-Q6_K.gguf ./dynamo-run in=http out=vllm ~/llm_models/Llama-3.2-3B-Instruct-Q6_K.gguf
``` ```
+ Multi-node: **Multi-node:**
**Node 1:**
Node 1:
``` ```
dynamo-run in=text out=vllm ~/llm_models/Llama-3.2-3B-Instruct/ --tensor-parallel-size 8 --num-nodes 2 --leader-addr 10.217.98.122:6539 --node-rank 0 dynamo-run in=text out=vllm ~/llm_models/Llama-3.2-3B-Instruct/ --tensor-parallel-size 8 --num-nodes 2 --leader-addr 10.217.98.122:6539 --node-rank 0
``` ```
Node 2: **Node 2:**
``` ```
dynamo-run in=none out=vllm ~/llm_models/Llama-3.2-3B-Instruct/ --num-nodes 2 --leader-addr 10.217.98.122:6539 --node-rank 1 dynamo-run in=none out=vllm ~/llm_models/Llama-3.2-3B-Instruct/ --num-nodes 2 --leader-addr 10.217.98.122:6539 --node-rank 1
``` ```
To pass extra arguments to the vllm engine see *Extra engine arguments* below. To pass extra arguments to the vllm engine see [Extra engine arguments](#extra_engine_arguments) below.
## Python bring-your-own-engine ### Python bring-your-own-engine
You can provide your own engine in a Python file. The file must provide a generator with this signature: You can provide your own engine in a Python file. The file must provide a generator with this signature:
``` ```
...@@ -238,7 +261,7 @@ async def generate(request): ...@@ -238,7 +261,7 @@ async def generate(request):
Build: `cargo build --features python` Build: `cargo build --features python`
### Python does the pre-processing #### Python does the pre-processing
If the Python engine wants to receive and returns strings - it will do the prompt templating and tokenization itself - run it like this: If the Python engine wants to receive and returns strings - it will do the prompt templating and tokenization itself - run it like this:
...@@ -252,7 +275,7 @@ dynamo-run out=pystr:/home/user/my_python_engine.py ...@@ -252,7 +275,7 @@ dynamo-run out=pystr:/home/user/my_python_engine.py
The file is loaded once at startup and kept in memory. The file is loaded once at startup and kept in memory.
Example engine: **Example engine:**
``` ```
import asyncio import asyncio
...@@ -302,7 +325,7 @@ MAIN: ['my_engine.py', '--model-path', '/opt/models/Llama-3.2-3B-Instruct/', '-- ...@@ -302,7 +325,7 @@ MAIN: ['my_engine.py', '--model-path', '/opt/models/Llama-3.2-3B-Instruct/', '--
This allows quick iteration on the engine setup. Note how the `-n` `1` is included. Flags `--leader-addr` and `--model-config` will also be added if provided to `dynamo-run`. This allows quick iteration on the engine setup. Note how the `-n` `1` is included. Flags `--leader-addr` and `--model-config` will also be added if provided to `dynamo-run`.
### Dynamo does the pre-processing #### Dynamo does the pre-processing
If the Python engine wants to receive and return tokens - the prompt templating and tokenization is already done - run it like this: If the Python engine wants to receive and return tokens - the prompt templating and tokenization is already done - run it like this:
``` ```
...@@ -321,7 +344,7 @@ dynamo-run out=pytok:/home/user/my_python_engine.py --model-path <hf-repo-checko ...@@ -321,7 +344,7 @@ dynamo-run out=pytok:/home/user/my_python_engine.py --model-path <hf-repo-checko
- Command like flag `--model-path` which must point to a Hugging Face repo checkout containing the `tokenizer.json`. The `--model-name` flag is optional. If not provided we use the HF repo name (directory name) as the model name. - Command like flag `--model-path` which must point to a Hugging Face repo checkout containing the `tokenizer.json`. The `--model-name` flag is optional. If not provided we use the HF repo name (directory name) as the model name.
Example engine: **Example engine:**
``` ```
import asyncio import asyncio
...@@ -343,16 +366,16 @@ async def generate(request): ...@@ -343,16 +366,16 @@ async def generate(request):
`pytok` supports the same ways of passing command line arguments as `pystr` - `initialize` or `main` with `sys.argv`. `pytok` supports the same ways of passing command line arguments as `pystr` - `initialize` or `main` with `sys.argv`.
## trtllm ### trtllm
TensorRT-LLM. Requires `clang` and `libclang-dev`. TensorRT-LLM. Requires `clang` and `libclang-dev`.
Build: 1. Build:
``` ```
cargo build --features trtllm cargo build --features trtllm
``` ```
Run: 2. Run:
``` ```
dynamo-run in=text out=trtllm --model-path /app/trtllm_engine/ --model-config ~/llm_models/Llama-3.2-3B-Instruct/ dynamo-run in=text out=trtllm --model-path /app/trtllm_engine/ --model-config ~/llm_models/Llama-3.2-3B-Instruct/
``` ```
...@@ -361,11 +384,11 @@ Note that TRT-LLM uses it's own `.engine` format for weights. ...@@ -361,11 +384,11 @@ Note that TRT-LLM uses it's own `.engine` format for weights.
The `--model-path` you give to `dynamo-run` must contain the `config.json` (TRT-LLM's , not the model's) and `rank0.engine` (plus other ranks if relevant). The `--model-path` you give to `dynamo-run` must contain the `config.json` (TRT-LLM's , not the model's) and `rank0.engine` (plus other ranks if relevant).
## Echo Engines ### Echo Engines
Dynamo includes two echo engines for testing and debugging purposes: Dynamo includes two echo engines for testing and debugging purposes:
### echo_core #### echo_core
The `echo_core` engine accepts pre-processed requests and echoes the tokens back as the response. This is useful for testing pre-processing functionality as the response will include the full prompt template. The `echo_core` engine accepts pre-processed requests and echoes the tokens back as the response. This is useful for testing pre-processing functionality as the response will include the full prompt template.
...@@ -373,7 +396,7 @@ The `echo_core` engine accepts pre-processed requests and echoes the tokens back ...@@ -373,7 +396,7 @@ The `echo_core` engine accepts pre-processed requests and echoes the tokens back
dynamo-run in=http out=echo_core --model-path <hf-repo-checkout> dynamo-run in=http out=echo_core --model-path <hf-repo-checkout>
``` ```
### echo_full #### echo_full
The `echo_full` engine accepts un-processed requests and echoes the prompt back as the response. The `echo_full` engine accepts un-processed requests and echoes the prompt back as the response.
...@@ -381,7 +404,7 @@ The `echo_full` engine accepts un-processed requests and echoes the prompt back ...@@ -381,7 +404,7 @@ The `echo_full` engine accepts un-processed requests and echoes the prompt back
dynamo-run in=http out=echo_full --model-name my_model dynamo-run in=http out=echo_full --model-name my_model
``` ```
### Configuration #### Configuration
Both echo engines use a configurable delay between tokens to simulate generation speed. You can adjust this using the `DYN_TOKEN_ECHO_DELAY_MS` environment variable: Both echo engines use a configurable delay between tokens to simulate generation speed. You can adjust this using the `DYN_TOKEN_ECHO_DELAY_MS` environment variable:
...@@ -392,9 +415,9 @@ DYN_TOKEN_ECHO_DELAY_MS=1 dynamo-run in=http out=echo_full ...@@ -392,9 +415,9 @@ DYN_TOKEN_ECHO_DELAY_MS=1 dynamo-run in=http out=echo_full
The default delay is 10ms, which produces approximately 100 tokens per second. The default delay is 10ms, which produces approximately 100 tokens per second.
## Batch mode ### Batch mode
dynamo-run can take a jsonl file full of prompts and evaluate them all: `dynamo-run` can take a jsonl file full of prompts and evaluate them all:
``` ```
dynamo-run in=batch:prompts.jsonl out=llamacpp <model> dynamo-run in=batch:prompts.jsonl out=llamacpp <model>
...@@ -413,11 +436,11 @@ The output looks like this: ...@@ -413,11 +436,11 @@ The output looks like this:
{"text":"What is the capital of Spain?","response":".The capital of Spain is Madrid.","tokens_in":7,"tokens_out":7,"elapsed_ms":855} {"text":"What is the capital of Spain?","response":".The capital of Spain is Madrid.","tokens_in":7,"tokens_out":7,"elapsed_ms":855}
``` ```
## Defaults ### Defaults
The input defaults to `in=text`. The output will default to `mistralrs` engine. If not available whatever engine you have compiled in (so depending on `--features`). The input defaults to `in=text`. The output will default to `mistralrs` engine. If not available whatever engine you have compiled in (so depending on `--features`).
## Extra engine arguments ### Extra engine arguments
The vllm and sglang backends support passing any argument the engine accepts. The vllm and sglang backends support passing any argument the engine accepts.
...@@ -433,4 +456,3 @@ Pass it like this: ...@@ -433,4 +456,3 @@ Pass it like this:
``` ```
dynamo-run out=sglang ~/llm_models/Llama-3.2-3B-Instruct --extra-engine-args sglang_extra.json dynamo-run out=sglang ~/llm_models/Llama-3.2-3B-Instruct --extra-engine-args sglang_extra.json
``` ```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment