"git@developer.sourcefind.cn:OpenDAS/dynamo.git" did not exist on "4b1867c53ebbf98dea54623af24d2424ead56573"
Commit 194abde3 authored by ptarasiewiczNV's avatar ptarasiewiczNV Committed by GitHub
Browse files

docs: Add instructions for multi-node disaggregated deployment


Signed-off-by: default avatarptarasiewiczNV <104908264+ptarasiewiczNV@users.noreply.github.com>
Co-authored-by: default avatarNeelay Shah <neelays@nvidia.com>
parent a8c5637f
...@@ -319,7 +319,24 @@ In the commands above, we used the FP8 variant `neuralmagic/Meta-Llama-3.1-8B-In ...@@ -319,7 +319,24 @@ In the commands above, we used the FP8 variant `neuralmagic/Meta-Llama-3.1-8B-In
``` ```
## 7. Known Issues & Limitations ## 7. Multi-node Deployment
To deploy the solution in a multi-node environment please refer to [deploy_llama_8b_disaggregated_multinode.sh](examples/llm/vllm/deploy/deploy_llama_8b_disaggregated_multinode.sh) script. On a head node run NATS server, API server and context worker with
```
./examples/llm/vllm/deploy/deploy_llama_8b_disaggregated.sh context --head-url <head url>
```
On the second node run the generate worker
```
./examples/llm/vllm/deploy/deploy_llama_8b_disaggregated.sh generate --head-url <head url>
```
The example script is set by default to launch one context worker with TP 1 on the head node and one generate worker with TP 1 on the secondary node. This can be changed for other configurations - see the script for details.
## 8. Known Issues & Limitations
1. **Fixed Worker Count** 1. **Fixed Worker Count**
Currently, the number of prefill and decode workers must be fixed at the start of deployment. Dynamically adding or removing workers is not yet supported. Currently, the number of prefill and decode workers must be fixed at the start of deployment. Dynamically adding or removing workers is not yet supported.
...@@ -333,8 +350,14 @@ In the commands above, we used the FP8 variant `neuralmagic/Meta-Llama-3.1-8B-In ...@@ -333,8 +350,14 @@ In the commands above, we used the FP8 variant `neuralmagic/Meta-Llama-3.1-8B-In
4. **Experimental Patch** 4. **Experimental Patch**
The required vLLM patch is experimental and not yet merged into upstream vLLM. Future releases may remove the need for a custom patch. The required vLLM patch is experimental and not yet merged into upstream vLLM. Future releases may remove the need for a custom patch.
5. **Single generate worker**
Only one generate worker can be used in a single deployment.
6. **Streaming**
When streaming is enabled, only two responses will be returned in the stream: the first token and the complete response.
## 8. References ## 9. References
[^1]: Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao [^1]: Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao
Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language
......
#!/bin/bash
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_TORCH_HOST=""
export VLLM_TORCH_PORT=36183
export VLLM_BASELINE_WORKERS=0
export VLLM_CONTEXT_WORKERS=1
export VLLM_GENERATE_WORKERS=1
export VLLM_BASELINE_TP_SIZE=1
export VLLM_CONTEXT_TP_SIZE=1
export VLLM_GENERATE_TP_SIZE=1
export VLLM_LOGGING_LEVEL=INFO
export VLLM_DATA_PLANE_BACKEND=nccl
export PYTHONUNBUFFERED=1
export NATS_HOST=""
export NATS_PORT=4223
export NATS_STORE="$(mktemp -d)"
export API_SERVER_HOST=""
export API_SERVER_PORT=8005
start_nats_server() {
local head_url=$1
export NATS_HOST="$head_url"
echo "Flushing NATS store: ${NATS_STORE}..."
rm -r "${NATS_STORE}"
echo "Starting NATS Server..."
nats-server -p ${NATS_PORT} --jetstream --store_dir "${NATS_STORE}" &
}
start_api_server() {
local head_url=$1
export VLLM_TORCH_HOST="$head_url"
echo "Starting LLM API Server..."
python3 -m llm.api_server \
--tokenizer neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
--request-plane-uri ${head_url}:${NATS_PORT} \
--api-server-host ${API_SERVER_HOST} \
--model-name llama \
--api-server-port ${API_SERVER_PORT} &
}
start_context_worker() {
local head_url=$1
export VLLM_TORCH_HOST="$head_url"
echo "Starting vLLM context workers..."
CUDA_VISIBLE_DEVICES=0 \
VLLM_WORKER_ID=0 \
python3 -m llm.vllm.deploy \
--context-worker-count ${VLLM_CONTEXT_WORKERS} \
--request-plane-uri ${head_url}:${NATS_PORT} \
--model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
--kv-cache-dtype fp8 \
--dtype auto \
--worker-name llama \
--disable-async-output-proc \
--disable-log-stats \
--max-model-len 1000 \
--max-batch-size 10000 \
--gpu-memory-utilization 0.9 \
--context-tp-size ${VLLM_CONTEXT_TP_SIZE} \
--generate-tp-size ${VLLM_GENERATE_TP_SIZE} \
--log-dir "/tmp/vllm_logs" &
}
start_generate_worker() {
local head_url=$1
export VLLM_TORCH_HOST="$head_url"
echo "Starting vLLM generate workers..."
CUDA_VISIBLE_DEVICES=1 \
VLLM_WORKER_ID=1 \
python3 -m llm.vllm.deploy \
--generate-worker-count ${VLLM_GENERATE_WORKERS} \
--request-plane-uri ${head_url}:${NATS_PORT} \
--model-name neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
--kv-cache-dtype fp8 \
--dtype auto \
--worker-name llama \
--disable-async-output-proc \
--disable-log-stats \
--max-model-len 1000 \
--max-batch-size 10000 \
--gpu-memory-utilization 0.9 \
--context-tp-size ${VLLM_CONTEXT_TP_SIZE} \
--generate-tp-size ${VLLM_GENERATE_TP_SIZE} \
--log-dir "/tmp/vllm_logs" &
}
case "$1" in
context)
if [ "$2" != "--head-url" ] || [ -z "$3" ]; then
echo "Usage: $0 context --head-url <head url>"
exit 1
fi
head_url=$3
export API_SERVER_HOST="$head_url"
start_nats_server
start_api_server "$head_url"
start_context_worker "$head_url"
;;
generate)
if [ "$2" != "--head-url" ] || [ -z "$3" ]; then
echo "Usage: $0 generate --head-url <head url>"
exit 1
fi
head_url=$3
export API_SERVER_HOST="$head_url"
start_generate_worker "$head_url"
;;
*)
echo "Usage: $0 {context|generate} --head-url <head url>"
exit 1
;;
esac
echo "Waiting for deployment to finish startup..."
echo "Once you see all ranks connected to the server, it should be ready..."
echo "Example output:"
echo "\tRank 0 connected to the server"
echo "\t..."
echo "\tRank 1 connected to the server"
sleep 120
echo "Sending chat completions request..."
curl ${API_SERVER_HOST}:${API_SERVER_PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [
{"role": "system", "content": "What is the capital of France?"}
],
"temperature": 0,
"top_p": 0.95,
"max_tokens": 25,
"stream": true,
"n": 1,
"frequency_penalty": 0.0,
"stop": []
}'
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment