"git@developer.sourcefind.cn:yangql/googletest.git" did not exist on "88080ee943b2b769557488e9c60850da96ab839e"
Commit 04e50aba authored by ptarasiewiczNV's avatar ptarasiewiczNV Committed by GitHub
Browse files

docs: Add Llama 70B benchmark reproduction and results


Signed-off-by: default avatarNeelay Shah <neelays@nvidia.com>
Co-authored-by: default avatarNeelay Shah <neelays@nvidia.com>
Co-authored-by: default avatarPiotr Marcinkiewicz <piotrm@nvidia.com>
Co-authored-by: default avatarRyan McCormick <rmccormick@nvidia.com>
parent 993954a6
......@@ -14,10 +14,20 @@
# See the License for the specific language governing permissions and
# limitations under the License.
TAG=
RUN_PREFIX=
PLATFORM=linux/amd64
VERSION=0.1.1
# Get short commit hash
commit_id=$(git rev-parse --short HEAD)
# Attempt to get current tag
current_tag=$(git describe --tags --exact-match 2>/dev/null) || true
# Use tag if available, otherwise use commit hash
VERSION=${current_tag:-$commit_id}
# Frameworks
#
......
......@@ -537,7 +537,7 @@ diff -Naur v0.6.3.post1_vllm/worker/model_runner.py patched_vllm/worker/model_ru
+ def init_kv_cache_handler(self) -> None:
+ if envs.VLLM_DISAGG_STAGE is not None:
+ self._kv_cache_handler = get_kv_cache_handler()
+ torch.distributed.barrier()
+ # torch.distributed.barrier() # TODO ptarasiewicz check why this is raising NCCL errors
+
def _update_inputs_to_capture_for_enc_dec_model(self,
capture_inputs: Dict[str,
......
......@@ -229,12 +229,14 @@ instead for each process ID replacing `<pid>` below:
kill -9 <pid>
```
## Y. Known Issues & Limitations
## Known Issues & Limitations
1. **Tensor Parallelism Constraints**
- Currently limited to TP=1 for both prefill and decode workers
## Z. References
2. Currently streaming is not supported and results are returned all at once.
## References
[^1]: Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao
Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language
......
......@@ -43,7 +43,7 @@ def parse_args():
)
parser.add_argument(
"--log-level", type=int, default=3, help="log level applied to all workers"
"--log-level", type=int, default=1, help="log level applied to all workers"
)
parser.add_argument(
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Tuning and Benchmarking Disaggregated Serving
**Disaggregated Serving** [^1] enables developers and teams deploying
LLMs to tune their deployment based on input and output sequence
lengths to achieve a targeted SLA with the right mix of context and
generation workers. In particular disaggregated serving enables teams
the ability to choose different parallelization strategies for each
phase and balance throughput (tokens / sec / gpu) and latency (tokens
/ sec / user).
## Example:
### 50 tokens per sec SLA with Input (3000) / Output (150) Sequence Length Tuning
To determine the best mix of context and generate workers for a
targeted latency and input and output sequence length generally we
perform "sweeps" comparing different strategies to find the best
throughput within the SLA.
For example for input sequence length 3000 and output sequence length
150 after sweeping different tensor parallellism strategies on two
8 x H100 GPU nodes, we've found that using 2 instances of TP 4 for
context (on one node) and using 1 instance of TP 8 for generate (on
the second node) gives the best throughput at a latency target of 50
tokens per sec per user.
At that latency target, in our early measurements disaggregated
serving outperforms traditional aggregated LLM serving by more than 1.5x
(with throughput normalized per GPU).
### Reproducing Results
To reproduce similar results on a 2 node H100 x 8 GPU system we
provide sample scripts.
### Launch Context Workers on First Node
On first (head) node:
```
bash deploy_llama_70b_context_tp2dp4.sh --head-url <head url>
```
### Launch Generate Worker on Second Node
On second node:
```
bash deploy_llama_70b_generate_tp8dp1.sh --head-url <head url>
```
### Benchmark
The following `genai-perf` command simulates traffic with 3000 input and 150 output sequence lengths.
```
genai-perf profile \
-m llama \
--url <api server url> \
--endpoint-type chat \
--streaming \
--num-dataset-entries 100 \
--service-kind openai \
--endpoint v1/chat/completions \
--warmup-request-count 10 \
--random-seed 123 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-stddev 0 \
--tokenizer neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
--synthetic-input-tokens-mean 3000 \
--output-tokens-mean 150 \
--extra-inputs seed:100 \
--extra-inputs min_tokens:150 \
--extra-inputs max_tokens:150 \
--profile-export-file my_profile_export.json \
--artifact-dir artifacts/ \
--concurrency < N > \
--request-count < 10 * N > \
-- -v \
--async
```
### Example Results
The following results are given as an example, are not fully
optimized, and do not indicate what you may get locally.
| label | configuration | concurrency | output_token_throughput_per_request | output_token_throughput_per_gpu | time_to_first_token | inter_token_latency |
|----------|--------------------------------|-------------|-------------------------------------|---------------------------------|---------------------|---------------------|
| disagg | context_tp2dp4_generate_tp8dp1 | 48 | 49.18197330348195 | 87.55798331 | 1157.4852116520833 | 15.935926391666667 |
| baseline | baseline_tp4dp1 | 4 | 50.27116554062172 | 56.26445983 | 709.2506074249999 | 15.265875249999999 |
### Baseline Comparison
On a single node you can run a comparison. With aggregated workers we
found the best throughput at the target SLA and input and output
sequence lengths with 2 instances of tensor parallelism 4.
```
bash deploy_llama_70b_baseline_tp4dp2.sh --head-url <head url>
```
To see the results use the same `genai-perf` command used to benchmark
the disaggregated setup.
### Stopping deployment
```
pkill -SIGINT -f python3
pkill -SIGINT -f nats
```
## Known issue
Sometimes during the first run there there are nats errors. In that case just restart the deployment.
## References
[^1]: Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao
Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language
model serving. *arXiv:2401.09670v3 [cs.DC]*, 2024.
#!/bin/bash
set -e
set -x
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_TORCH_PORT=36183
export VLLM_CONTEXT_WORKERS=4
export VLLM_CONTEXT_TP_SIZE=2
export VLLM_GENERATE_WORKERS=1
export VLLM_GENERATE_TP_SIZE=8
export VLLM_LOGGING_LEVEL=INFO
export VLLM_DATA_PLANE_BACKEND=nccl
export PYTHONUNBUFFERED=1
export NATS_PORT=4223
export NATS_STORE="$(mktemp -d)"
export API_SERVER_PORT=8005
if [ "$1" != "--head-url" ] || [ -z "$2" ]; then
echo "Usage: $0 --head-url <head url>"
exit 1
fi
head_url=$2
export NATS_HOST="$head_url"
export VLLM_TORCH_HOST="$head_url"
export API_SERVER_HOST="$head_url"
# Start NATS Server
echo "Flushing NATS store: ${NATS_STORE}..."
rm -r "${NATS_STORE}"
echo "Starting NATS Server..."
nats-server -p ${NATS_PORT} --jetstream --store_dir "${NATS_STORE}" &
# Start API Server
echo "Starting LLM API Server..."
python3 -m llm.api_server \
--tokenizer neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
--request-plane-uri ${NATS_HOST}:${NATS_PORT} \
--api-server-host ${API_SERVER_HOST} \
--model-name "llama" \
--api-server-port ${API_SERVER_PORT} &
# Empty --log-dir will dump logs to stdout
echo "Starting vLLM baseline workers..."
gpu_configs=(
"0,1"
"2,3"
"4,5"
"6,7"
)
for i in "${!gpu_configs[@]}"; do
CUDA_VISIBLE_DEVICES="${gpu_configs[$i]}" \
VLLM_WORKER_ID=$i \
python3 -m llm.vllm.deploy \
--context-worker-count 1 \
--context-tp-size ${VLLM_CONTEXT_TP_SIZE} \
--generate-tp-size ${VLLM_GENERATE_TP_SIZE} \
--request-plane-uri ${NATS_HOST}:${NATS_PORT} \
--model-name neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
--worker-name llama \
--kv-cache-dtype fp8 \
--dtype auto \
--disable-async-output-proc \
--disable-log-stats \
--max-model-len 3500 \
--max-batch-size 10000 \
--gpu-memory-utilization 0.5 &
done
#!/bin/bash
set -e
set -x
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_TORCH_PORT=36183
export VLLM_CONTEXT_WORKERS=4
export VLLM_CONTEXT_TP_SIZE=2
export VLLM_GENERATE_WORKERS=1
export VLLM_GENERATE_TP_SIZE=8
export VLLM_LOGGING_LEVEL=INFO
export VLLM_DATA_PLANE_BACKEND=nccl
export PYTHONUNBUFFERED=1
export NATS_PORT=4223
export NATS_STORE="$(mktemp -d)"
export API_SERVER_PORT=8005
if [ "$1" != "--head-url" ] || [ -z "$2" ]; then
echo "Usage: $0 --head-url <head url>"
exit 1
fi
head_url=$2
export NATS_HOST="$head_url"
export VLLM_TORCH_HOST="$head_url"
export API_SERVER_HOST="$head_url"
# Empty --log-dir will dump logs to stdout
echo "Starting vLLM generate workers..."
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
VLLM_WORKER_ID=${VLLM_CONTEXT_WORKERS} \
python3 -m llm.vllm.deploy \
--generate-worker-count 1 \
--context-tp-size ${VLLM_CONTEXT_TP_SIZE} \
--generate-tp-size ${VLLM_GENERATE_TP_SIZE} \
--request-plane-uri ${NATS_HOST}:${NATS_PORT} \
--model-name neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
--worker-name llama \
--kv-cache-dtype fp8 \
--dtype auto \
--disable-async-output-proc \
--disable-log-stats \
--max-model-len 3500 \
--max-batch-size 10000 \
--gpu-memory-utilization 0.9 &
......@@ -25,9 +25,10 @@ import vllm.inputs.data
LOGGER = vllm.logger.init_logger(__name__)
# TODO ptarasiewicz remove after veryfing streaming works efficiently
# FIXME currently streaming all the tokens is not efficient
# with RETURN_EVERY_N so large we return only first token and whole sequence at the end
RETURN_EVERY_N = 1000000
RETURN_EVERY_N = 1
class Stage(abc.ABC):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment