Commit 40c55a24 authored by Graham King's avatar Graham King Committed by GitHub
Browse files

docs(dynamo-run): Move README into docs/guides/ , add Quickstart (#265)

parent 9f0181a8
...@@ -17,8 +17,6 @@ limitations under the License. ...@@ -17,8 +17,6 @@ limitations under the License.
# LLM Deployment Examples # LLM Deployment Examples
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations.
## Components ## Components
- workers: Prefill and decode worker handles actual LLM inference - workers: Prefill and decode worker handles actual LLM inference
......
# Dynamo service runner # Dynamo Run
`dynamo-run` is a tool for exploring the dynamo components, and an example of how to use them from Rust. `dynamo-run` is a CLI tool for exploring the Dynamo components, and an example of how to use them from Rust. It is also available as `dynamo run` if using the Python wheel.
## Setup ## Quickstart with pip and vllm
Libraries Ubuntu: If you used `pip` to install `dynamo` you should have the `dynamo-run` binary pre-installed with the `vllm` engine. You must be in a virtual env with vllm installed to use this. For more options see "Full documentation" below.
### Automatically download a model from [Hugging Face](https://huggingface.co/models)
This will automatically download Qwen2.5 3B from Hugging Face (6 GiB download) and start it in interactive text mode:
``` ```
apt install -y build-essential libhwloc-dev libudev-dev pkg-config libssl-dev libclang-dev protobuf-compiler python3-dev dynamo run out=vllm Qwen/Qwen2.5-3B-Instruct
``` ```
Libraries macOS: General format for HF download:
```
dynamo run out=<engine> <HUGGING_FACE_ORGANIZATION/MODEL_NAME>
``` ```
brew install cmake protobuf
# install Xcode from App Store and check that Metal is accessible For gated models (e.g. meta-llama/Llama-3.2-3B-Instruct) you have to have an `HF_TOKEN` environment variable set.
xcrun -sdk macosx metal
# may have to install Xcode Command Line Tools: The parameter can be the ID of a HuggingFace repository (it will be downloaded), a GGUF file, or a folder containing safetensors, config.json, etc (a locally checked out HuggingFace repository).
xcode-select --install
```
Install Rust: ## Manually download a model from Hugging Face
One of these models should be high quality and fast on almost any machine: https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF
E.g. https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/blob/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
Download model file:
``` ```
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh curl -L -o Llama-3.2-3B-Instruct-Q4_K_M.gguf "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf?download=true"
source $HOME/.cargo/env
``` ```
## Build ## Run a model from local file
Navigate to launch/ directory *Text interface*
``` ```
cd launch/ dynamo run out=vllm Llama-3.2-3B-Instruct-Q4_K_M.gguf # or path to a Hugging Face repo checkout instead of the GGUF
``` ```
Optionally can run `cargo build` from any location with arguments:
*HTTP interface*
``` ```
--target-dir /path/to/target_directory` specify target_directory with write privileges dynamo run in=http out=vllm Llama-3.2-3B-Instruct-Q4_K_M.gguf
--manifest-path /path/to/project/Cargo.toml` if cargo build is run outside of `launch/` directory
``` ```
- Linux with GPU and CUDA (tested on Ubuntu): *List the models*
``` ```
cargo build --release --features mistralrs,cuda curl localhost:8080/v1/models
``` ```
- macOS with Metal: *Send a request*
``` ```
cargo build --release --features mistralrs,metal curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions
``` ```
- CPU only: *Multi-node*
You will need [etcd](https://etcd.io/) and [nats](https://nats.io) installed and accessible from both nodes.
Node 1:
``` ```
cargo build --release --features mistralrs dynamo run in=http out=dyn://llama3B_pool
``` ```
The binary will be called `dynamo-run` in `target/release` Node 2:
``` ```
cd target/release dynamo run in=dyn://llama3B_pool out=vllm ~/llm_models/Llama-3.2-3B-Instruct
``` ```
## Quickstart This will use etcd to auto-discover the model and NATS to talk to it. You can run multiple workers on the same endpoint and it will pick one at random each time.
### Automatically download a model from [Hugging Face](https://huggingface.co/models)
NOTE: for gated models (e.g. meta-llama/Llama-3.2-3B-Instruct) you have to have an `HF_TOKEN` environment variable set. The `llama3B_pool` name is purely symbolic, pick anything as long as it matches the other node.
Run `dynamo run --help` for more options.
# Full documentation
`dynamo-run` is what `dynamo run` executes. It is an example of what you can build in Rust with the `dynamo-llm` and `dynamo-runtime`. Here is a list of how to build from source and all the features.
## Setup
Libraries Ubuntu:
``` ```
./dynamo-run <HUGGING_FACE_ORGANIZATION/MODEL_NAME> apt install -y build-essential libhwloc-dev libudev-dev pkg-config libssl-dev libclang-dev protobuf-compiler python3-dev
``` ```
This will automatically download Qwen2.5 3B from Hugging Face (6 GiB download) and start it in interactive text mode:
`./dynamo-run Qwen/Qwen2.5-3B-Instruct`
The parameter can be the ID of a HuggingFace repository (it will be downloaded), a GGUF file, or a folder containing safetensors, config.json, etc (a locally checked out HuggingFace repository). Libraries macOS:
```
brew install cmake protobuf
## Download a model from Hugging Face # install Xcode from App Store and check that Metal is accessible
xcrun -sdk macosx metal
One of these models should be high quality and fast on almost any machine: https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF # may have to install Xcode Command Line Tools:
E.g. https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/blob/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf xcode-select --install
```
Download model file: Install Rust:
``` ```
curl -L -o Llama-3.2-3B-Instruct-Q4_K_M.gguf "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf?download=true" curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
``` ```
## Run a model from local file ## Build
*Text interface* Navigate to launch/ directory
``` ```
dynamo-run Llama-3.2-3B-Instruct-Q4_K_M.gguf # or path to a Hugging Face repo checkout instead of the GGUF cd launch/
``` ```
Optionally can run `cargo build` from any location with arguments:
*HTTP interface*
``` ```
dynamo-run in=http Llama-3.2-3B-Instruct-Q4_K_M.gguf --target-dir /path/to/target_directory` specify target_directory with write privileges
--manifest-path /path/to/project/Cargo.toml` if cargo build is run outside of `launch/` directory
``` ```
*List the models* - Linux with GPU and CUDA (tested on Ubuntu):
``` ```
curl localhost:8080/v1/models cargo build --release --features mistralrs,cuda
``` ```
*Send a request* - macOS with Metal:
``` ```
curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions cargo build --release --features mistralrs,metal
``` ```
*Multi-node* - CPU only:
Node 1:
``` ```
dynamo-run in=http out=dyn://llama3B_pool cargo build --release --features mistralrs
``` ```
Node 2: The binary will be called `dynamo-run` in `target/release`
``` ```
dynamo-run in=dyn://llama3B_pool out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct cd target/release
``` ```
This will use etcd to auto-discover the model and NATS to talk to it. You can run multiple workers on the same endpoint and it will pick one at random each time.
The `llama3B_pool` name is purely symbolic, pick anything as long as it matches the other node.
Run `dynamo-run --help` for more options.
## sglang ## sglang
1. Setup the python virtual env: 1. Setup the python virtual env:
...@@ -416,7 +430,5 @@ The output looks like this: ...@@ -416,7 +430,5 @@ The output looks like this:
## Defaults ## Defaults
The input defaults to `in=text`. The input defaults to `in=text`. The output will default to `mistralrs` engine. If not available whatever engine you have compiled in (so depending on `--features`).
The output will default to `mistralrs` engine. If not available whatever engine you have compiled in (so depending on `--features`).
See `docs/guides/dynamo_run.md`
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment