docs(dynamo-run): Move README into docs/guides/ , add Quickstart (#265)

40c55a24 · Graham King · GitHub · 9f0181a8 · 40c55a24 · 40c55a24
Commit 40c55a24 authored Mar 18, 2025 by Graham King Committed by GitHub Mar 18, 2025
Show whitespace changes
Inline Side-by-side

Showing with 80 additions and 68 deletions

docs/guides/README.md docs/guides/README.md +0 -2

docs/guides/dynamo_run.md docs/guides/dynamo_run.md +78 -66

launch/dynamo-run/README.md launch/dynamo-run/README.md +2 -0

No files found.
--- a/docs/guides/README.md
+++ b/docs/guides/README.md
@@ -17,8 +17,6 @@ limitations under the License.

 # LLM Deployment Examples

-This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations.
-
 ## Components

 - workers: Prefill and decode worker handles actual LLM inference

--- a/launch/README.md
+++ b/launch/README.md
-# Dynamo service runner
+# Dynamo Run

-`dynamo-run` is a tool for exploring the dynamo components, and an example of how to use them from Rust.
+`dynamo-run` is a CLI tool for exploring the Dynamo components, and an example of how to use them from Rust. It is also available as `dynamo run` if using the Python wheel.

-## Setup
+## Quickstart with pip and vllm

-Libraries Ubuntu:
+If you used `pip` to install `dynamo` you should have the `dynamo-run` binary pre-installed with the `vllm` engine. You must be in a virtual env with vllm installed to use this. For more options see "Full documentation" below.
+
+### Automatically download a model from [Hugging Face](https://huggingface.co/models)
+
+This will automatically download Qwen2.5 3B from Hugging Face (6 GiB download) and start it in interactive text mode:
 ```
-apt install -y build-essential libhwloc-dev libudev-dev pkg-config libssl-dev libclang-dev protobuf-compiler python3-dev
+dynamo run out=vllm Qwen/Qwen2.5-3B-Instruct
 ```

-Libraries macOS:
+General format for HF download:
+```
+dynamo run out=<engine> <HUGGING_FACE_ORGANIZATION/MODEL_NAME>
 ```
-brew install cmake protobuf

-# install Xcode from App Store and check that Metal is accessible
-xcrun -sdk macosx metal
+For gated models (e.g. meta-llama/Llama-3.2-3B-Instruct) you have to have an `HF_TOKEN` environment variable set.

-# may have to install Xcode Command Line Tools:
-xcode-select --install
-```
+The parameter can be the ID of a HuggingFace repository (it will be downloaded), a GGUF file, or a folder containing safetensors, config.json, etc (a locally checked out HuggingFace repository).

-Install Rust:
+## Manually download a model from Hugging Face
+
+One of these models should be high quality and fast on almost any machine: https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF
+E.g. https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/blob/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
+
+Download model file:
 ```
-curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
-source $HOME/.cargo/env
+curl -L -o Llama-3.2-3B-Instruct-Q4_K_M.gguf "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf?download=true"
 ```

-## Build
+## Run a model from local file

-Navigate to launch/ directory
+*Text interface*
 ```
-cd launch/
+dynamo run out=vllm Llama-3.2-3B-Instruct-Q4_K_M.gguf # or path to a Hugging Face repo checkout instead of the GGUF
 ```
-Optionally can run `cargo build` from any location with arguments:
+
+*HTTP interface*
 ```
--target-dir /path/to/target_directory` specify target_directory with write privileges
--manifest-path /path/to/project/Cargo.toml` if cargo build is run outside of `launch/` directory
+dynamo run in=http out=vllm Llama-3.2-3B-Instruct-Q4_K_M.gguf
 ```

- Linux with GPU and CUDA (tested on Ubuntu):
+*List the models*
 ```
-cargo build --release --features mistralrs,cuda
+curl localhost:8080/v1/models
 ```

- macOS with Metal:
+*Send a request*
 ```
-cargo build --release --features mistralrs,metal
+curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions
 ```

- CPU only:
+*Multi-node*
+
+You will need [etcd](https://etcd.io/) and [nats](https://nats.io) installed and accessible from both nodes.
+
+Node 1:
 ```
-cargo build --release --features mistralrs
+dynamo run in=http out=dyn://llama3B_pool
 ```

-The binary will be called `dynamo-run` in `target/release`
+Node 2:
 ```
-cd target/release
+dynamo run in=dyn://llama3B_pool out=vllm ~/llm_models/Llama-3.2-3B-Instruct
 ```

-## Quickstart
-### Automatically download a model from [Hugging Face](https://huggingface.co/models)
-NOTE: for gated models (e.g. meta-llama/Llama-3.2-3B-Instruct) you have to have an `HF_TOKEN` environment variable set.
+This will use etcd to auto-discover the model and NATS to talk to it. You can run multiple workers on the same endpoint and it will pick one at random each time.
+
+The `llama3B_pool` name is purely symbolic, pick anything as long as it matches the other node.

+Run `dynamo run --help` for more options.
+
+# Full documentation
+
+`dynamo-run` is what `dynamo run` executes. It is an example of what you can build in Rust with the `dynamo-llm` and `dynamo-runtime`. Here is a list of how to build from source and all the features.
+
+## Setup
+
+Libraries Ubuntu:
 ```
-./dynamo-run <HUGGING_FACE_ORGANIZATION/MODEL_NAME>
+apt install -y build-essential libhwloc-dev libudev-dev pkg-config libssl-dev libclang-dev protobuf-compiler python3-dev
 ```
-This will automatically download Qwen2.5 3B from Hugging Face (6 GiB download) and start it in interactive text mode:
-`./dynamo-run Qwen/Qwen2.5-3B-Instruct`

-The parameter can be the ID of a HuggingFace repository (it will be downloaded), a GGUF file, or a folder containing safetensors, config.json, etc (a locally checked out HuggingFace repository).
+Libraries macOS:
+```
+brew install cmake protobuf

-## Download a model from Hugging Face
+# install Xcode from App Store and check that Metal is accessible
+xcrun -sdk macosx metal

-One of these models should be high quality and fast on almost any machine: https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF
-E.g. https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/blob/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
+# may have to install Xcode Command Line Tools:
+xcode-select --install
+```

-Download model file:
+Install Rust:
 ```
-curl -L -o Llama-3.2-3B-Instruct-Q4_K_M.gguf "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf?download=true"
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+source $HOME/.cargo/env
 ```

-## Run a model from local file
+## Build

-*Text interface*
+Navigate to launch/ directory
 ```
-dynamo-run Llama-3.2-3B-Instruct-Q4_K_M.gguf # or path to a Hugging Face repo checkout instead of the GGUF
+cd launch/
 ```
-
-*HTTP interface*
+Optionally can run `cargo build` from any location with arguments:
 ```
-dynamo-run in=http Llama-3.2-3B-Instruct-Q4_K_M.gguf
+--target-dir /path/to/target_directory` specify target_directory with write privileges
+--manifest-path /path/to/project/Cargo.toml` if cargo build is run outside of `launch/` directory
 ```

-*List the models*
+- Linux with GPU and CUDA (tested on Ubuntu):
 ```
-curl localhost:8080/v1/models
+cargo build --release --features mistralrs,cuda
 ```

-*Send a request*
+- macOS with Metal:
 ```
-curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions
+cargo build --release --features mistralrs,metal
 ```

-*Multi-node*
-
-Node 1:
+- CPU only:
 ```
-dynamo-run in=http out=dyn://llama3B_pool
+cargo build --release --features mistralrs
 ```

-Node 2:
+The binary will be called `dynamo-run` in `target/release`
 ```
-dynamo-run in=dyn://llama3B_pool out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct
+cd target/release
 ```

-This will use etcd to auto-discover the model and NATS to talk to it. You can run multiple workers on the same endpoint and it will pick one at random each time.
-
-The `llama3B_pool` name is purely symbolic, pick anything as long as it matches the other node.
-
-Run `dynamo-run --help` for more options.
-
 ## sglang

 1. Setup the python virtual env:
@@ -416,7 +430,5 @@ The output looks like this:

 ## Defaults

-The input defaults to `in=text`.
-
-The output will default to `mistralrs` engine. If not available whatever engine you have compiled in (so depending on `--features`).
+The input defaults to `in=text`. The output will default to `mistralrs` engine. If not available whatever engine you have compiled in (so depending on `--features`).

--- a/launch/dynamo-run/README.md
+++ b/launch/dynamo-run/README.md
+See `docs/guides/dynamo_run.md`
+