@@ -15,9 +15,7 @@ See the License for the specific language governing permissions and
limitations under the License.
-->
# vLLM Integration with Dynamo
This example demonstrates how to use Dynamo to serve large language models with the vLLM engine, enabling efficient model serving with both monolithic and disaggregated deployment options.
> **NOTE**: This example is based on an internal NVIDIA library that will soon be publicly released. The example won't work until the official release.
## Prerequisites
...
...
@@ -25,7 +23,7 @@ Start required services (etcd and NATS):
Option A: Using [Docker Compose](/deploy/docker-compose.yml)(Recommended)
```bash
docker compose -f./deploy/docker-compose.yml up -d
docker compose -f deploy/docker-compose.yml up -d
```
Option B: Manual Setup
...
...
@@ -35,282 +33,195 @@ Start required services (etcd and NATS):
- [etcd](https://etcd.io) server
- follow instructions in [etcd installation](https://etcd.io/docs/v3.5/install/) to start an `etcd-server` locally
## Build docker
## Building the Environment
The example is designed to run in a containerized environment using Dynamo, vLLM, and associated dependencies. To build the container:
```bash
# Build image
./container/build.sh --framework VLLM
```
## Launching the Environment
./container/build.sh
```
# Run image interactively
./container/run.sh --framework VLLM -it
```
## Deployment
### 1. HTTP Server
## Run container
Run the server logging (with debug level logging):
This deployment option splits the model serving across prefill and decode workers, enabling more efficient resource utilization.
### Processor
**Terminal 1 - Prefill Worker:**
```bash
# Activate virtual environment
source /opt/dynamo/venv/bin/activate
Processor routes the requests to the (decode) workers. Three scheduling strategies are supported: 1. random, 2. round-robin, 3. kv (see [Kv Router](#kv-router)).
INFO 03-02 05:59:44 worker.py:269] the current vLLM instance can use total_gpu_memory (47.54GiB) x gpu_memory_utilization (0.40) = 19.01GiB
INFO 03-02 05:59:44 worker.py:269] model weights take 14.99GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 2.78GiB.
INFO 03-02 05:59:44 executor_base.py:110] # CUDA blocks: 1423, # CPU blocks: 2048
INFO 03-02 05:59:44 executor_base.py:115] Maximum concurrency for 10 tokens per request: 2276.80x
INFO 03-02 05:59:47 llm_engine.py:476] init engine (profile, create kv cache, warmup model) took 3.41 seconds
Alternatively, the processor can be bypassed by directly hitting the worker endpoints:
The disaggregated deployment utilizes separate GPUs for prefill and decode operations, allowing for optimized resource allocation and improved performance. For more details on the disaggregated deployment, please refer to the [vLLM documentation](https://docs.vllm.ai/en/latest/features/disagg_prefill.html).
##### Example Output
```
INFO 03-02 05:59:44 worker.py:269] the current vLLM instance can use total_gpu_memory (47.54GiB) x gpu_memory_utilization (0.40) = 19.01GiB
INFO 03-02 05:59:44 worker.py:269] model weights take 14.99GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 2.78GiB.
INFO 03-02 05:59:44 executor_base.py:110] # CUDA blocks: 1423, # CPU blocks: 2048
INFO 03-02 05:59:44 executor_base.py:115] Maximum concurrency for 10 tokens per request: 2276.80x
INFO 03-02 05:59:47 llm_engine.py:476] init engine (profile, create kv cache, warmup model) took 3.41 seconds
where `--min-workers` is the number of (decode) workers.
##### Example Output
```json
{
"id":"5b04e7b0-0dcd-4c45-baa0-1d03d924010c",
"choices":[{
"message":{
"role":"assistant",
"content":"The capital of France is Paris. Paris is a major city known for iconic landmarks like the Eiffel Tower and the Louvre Museum."
},
"index":0,
"finish_reason":"stop"
}],
"created":1739548787,
"model":"vllm",
"object":"chat.completion",
"usage":null,
"system_fingerprint":null
}
```
You can choose only the prefix strategy for now:
-`prefix`: Route requests to the worker that has the longest prefix match.
### 4. Multi-Node Deployment
### Disaggregated Router
The vLLM workers can be deployed across multiple nodes by configuring the NATS and etcd connection endpoints through environment variables. This enables distributed inference across a cluster.
The disaggregated router determines whether a request should be send to a
remote prefill engine or a local prefill engine for prefilling based on the
prefill length. If kv router is enabled, the disaggregated router will use
the absolute prefill length (actual prefill length - prefix hit length) to make
the decision.
Set the following environment variables on each node before running the workers:
When prefilling locally, the vllm scheduler will prioritize
prefill request and pause any ongoing decode requests.
The KV Router is a component that aggregates KV Events from all the workers and maintains a prefix tree of the cached tokens. It makes decisions on which worker to route requests to based on the length of the prefix match and the load on the workers.
#### Monolithic
You can run the router and workers in separate terminal sessions or use the `kv-router-run.sh` script to launch them all at once in their own tmux sessions.
#### Deploying using tmux
The helper script `kv-router-run.sh` will launch the router and workers in their own tmux sessions.
"content":"Alright, the user is playing a character in a D&D setting. They want a detailed background for their character, set in the world of Eldoria, particularly in the city of Aeloria. The user mentioned it's about an intrepid explorer"
`genai-perf` is a tool for profiling and benchmarking LLM servers. It is already installed in the container. For more details, please refer to the [genai-perf README](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html).
{"role": "user", "content": "What is the capital of France?"}
]
}'
```
### 7. Known Issues and Limitations
- vLLM is not working well with the `fork` method for multiprocessing and TP > 1. This is a known issue and a workaround is to use the `spawn` method instead. See [vLLM issue](https://github.com/vllm-project/vllm/issues/6152).
-`kv_rank` of `kv_producer` must be smaller than of `kv_consumer`.
- Instances with the same `kv_role` must have the same `--tensor-parallel-size`.
- Currently only `--pipeline-parallel-size 1` is supported for XpYd disaggregated deployment.