"vscode:/vscode.git/clone" did not exist on "da482c2fafdd6ef14a6d8de1e470a564316eced6"
Commit 40c55a24 authored by Graham King's avatar Graham King Committed by GitHub
Browse files

docs(dynamo-run): Move README into docs/guides/ , add Quickstart (#265)

parent 9f0181a8
......@@ -17,8 +17,6 @@ limitations under the License.
# LLM Deployment Examples
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations.
## Components
- workers: Prefill and decode worker handles actual LLM inference
......
# Dynamo service runner
# Dynamo Run
`dynamo-run` is a tool for exploring the dynamo components, and an example of how to use them from Rust.
`dynamo-run` is a CLI tool for exploring the Dynamo components, and an example of how to use them from Rust. It is also available as `dynamo run` if using the Python wheel.
## Setup
## Quickstart with pip and vllm
Libraries Ubuntu:
If you used `pip` to install `dynamo` you should have the `dynamo-run` binary pre-installed with the `vllm` engine. You must be in a virtual env with vllm installed to use this. For more options see "Full documentation" below.
### Automatically download a model from [Hugging Face](https://huggingface.co/models)
This will automatically download Qwen2.5 3B from Hugging Face (6 GiB download) and start it in interactive text mode:
```
apt install -y build-essential libhwloc-dev libudev-dev pkg-config libssl-dev libclang-dev protobuf-compiler python3-dev
dynamo run out=vllm Qwen/Qwen2.5-3B-Instruct
```
Libraries macOS:
General format for HF download:
```
dynamo run out=<engine> <HUGGING_FACE_ORGANIZATION/MODEL_NAME>
```
brew install cmake protobuf
# install Xcode from App Store and check that Metal is accessible
xcrun -sdk macosx metal
For gated models (e.g. meta-llama/Llama-3.2-3B-Instruct) you have to have an `HF_TOKEN` environment variable set.
# may have to install Xcode Command Line Tools:
xcode-select --install
```
The parameter can be the ID of a HuggingFace repository (it will be downloaded), a GGUF file, or a folder containing safetensors, config.json, etc (a locally checked out HuggingFace repository).
Install Rust:
## Manually download a model from Hugging Face
One of these models should be high quality and fast on almost any machine: https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF
E.g. https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/blob/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
Download model file:
```
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
curl -L -o Llama-3.2-3B-Instruct-Q4_K_M.gguf "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf?download=true"
```
## Build
## Run a model from local file
Navigate to launch/ directory
*Text interface*
```
cd launch/
dynamo run out=vllm Llama-3.2-3B-Instruct-Q4_K_M.gguf # or path to a Hugging Face repo checkout instead of the GGUF
```
Optionally can run `cargo build` from any location with arguments:
*HTTP interface*
```
--target-dir /path/to/target_directory` specify target_directory with write privileges
--manifest-path /path/to/project/Cargo.toml` if cargo build is run outside of `launch/` directory
dynamo run in=http out=vllm Llama-3.2-3B-Instruct-Q4_K_M.gguf
```
- Linux with GPU and CUDA (tested on Ubuntu):
*List the models*
```
cargo build --release --features mistralrs,cuda
curl localhost:8080/v1/models
```
- macOS with Metal:
*Send a request*
```
cargo build --release --features mistralrs,metal
curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions
```
- CPU only:
*Multi-node*
You will need [etcd](https://etcd.io/) and [nats](https://nats.io) installed and accessible from both nodes.
Node 1:
```
cargo build --release --features mistralrs
dynamo run in=http out=dyn://llama3B_pool
```
The binary will be called `dynamo-run` in `target/release`
Node 2:
```
cd target/release
dynamo run in=dyn://llama3B_pool out=vllm ~/llm_models/Llama-3.2-3B-Instruct
```
## Quickstart
### Automatically download a model from [Hugging Face](https://huggingface.co/models)
NOTE: for gated models (e.g. meta-llama/Llama-3.2-3B-Instruct) you have to have an `HF_TOKEN` environment variable set.
This will use etcd to auto-discover the model and NATS to talk to it. You can run multiple workers on the same endpoint and it will pick one at random each time.
The `llama3B_pool` name is purely symbolic, pick anything as long as it matches the other node.
Run `dynamo run --help` for more options.
# Full documentation
`dynamo-run` is what `dynamo run` executes. It is an example of what you can build in Rust with the `dynamo-llm` and `dynamo-runtime`. Here is a list of how to build from source and all the features.
## Setup
Libraries Ubuntu:
```
./dynamo-run <HUGGING_FACE_ORGANIZATION/MODEL_NAME>
apt install -y build-essential libhwloc-dev libudev-dev pkg-config libssl-dev libclang-dev protobuf-compiler python3-dev
```
This will automatically download Qwen2.5 3B from Hugging Face (6 GiB download) and start it in interactive text mode:
`./dynamo-run Qwen/Qwen2.5-3B-Instruct`
The parameter can be the ID of a HuggingFace repository (it will be downloaded), a GGUF file, or a folder containing safetensors, config.json, etc (a locally checked out HuggingFace repository).
Libraries macOS:
```
brew install cmake protobuf
## Download a model from Hugging Face
# install Xcode from App Store and check that Metal is accessible
xcrun -sdk macosx metal
One of these models should be high quality and fast on almost any machine: https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF
E.g. https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/blob/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
# may have to install Xcode Command Line Tools:
xcode-select --install
```
Download model file:
Install Rust:
```
curl -L -o Llama-3.2-3B-Instruct-Q4_K_M.gguf "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf?download=true"
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
```
## Run a model from local file
## Build
*Text interface*
Navigate to launch/ directory
```
dynamo-run Llama-3.2-3B-Instruct-Q4_K_M.gguf # or path to a Hugging Face repo checkout instead of the GGUF
cd launch/
```
*HTTP interface*
Optionally can run `cargo build` from any location with arguments:
```
dynamo-run in=http Llama-3.2-3B-Instruct-Q4_K_M.gguf
--target-dir /path/to/target_directory` specify target_directory with write privileges
--manifest-path /path/to/project/Cargo.toml` if cargo build is run outside of `launch/` directory
```
*List the models*
- Linux with GPU and CUDA (tested on Ubuntu):
```
curl localhost:8080/v1/models
cargo build --release --features mistralrs,cuda
```
*Send a request*
- macOS with Metal:
```
curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions
cargo build --release --features mistralrs,metal
```
*Multi-node*
Node 1:
- CPU only:
```
dynamo-run in=http out=dyn://llama3B_pool
cargo build --release --features mistralrs
```
Node 2:
The binary will be called `dynamo-run` in `target/release`
```
dynamo-run in=dyn://llama3B_pool out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct
cd target/release
```
This will use etcd to auto-discover the model and NATS to talk to it. You can run multiple workers on the same endpoint and it will pick one at random each time.
The `llama3B_pool` name is purely symbolic, pick anything as long as it matches the other node.
Run `dynamo-run --help` for more options.
## sglang
1. Setup the python virtual env:
......@@ -416,7 +430,5 @@ The output looks like this:
## Defaults
The input defaults to `in=text`.
The output will default to `mistralrs` engine. If not available whatever engine you have compiled in (so depending on `--features`).
The input defaults to `in=text`. The output will default to `mistralrs` engine. If not available whatever engine you have compiled in (so depending on `--features`).
See `docs/guides/dynamo_run.md`
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment