docs: Add Llama 70B benchmark reproduction and results

Signed-off-by: Neelay Shah <neelays@nvidia.com> Co-authored-by: Neelay Shah <neelays@nvidia.com> Co-authored-by: Piotr Marcinkiewicz <piotrm@nvidia.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

docs: Add Llama 70B benchmark reproduction and results
Signed-off-by: Neelay Shah <neelays@nvidia.com> Co-authored-by: Neelay Shah <neelays@nvidia.com> Co-authored-by: Piotr Marcinkiewicz <piotrm@nvidia.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
04e50aba · ptarasiewiczNV · GitHub · 993954a6 · 04e50aba · 04e50aba
Commit 04e50aba authored Feb 01, 2025 by ptarasiewiczNV Committed by GitHub Feb 01, 2025
8 changed files
--- a/container/build.sh
+++ b/container/build.sh
@@ -14,10 +14,20 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+
 TAG=
 RUN_PREFIX=
 PLATFORM=linux/amd64
-VERSION=0.1.1
+
+# Get short commit hash
+commit_id=$(git rev-parse --short HEAD)
+
+# Attempt to get current tag
+current_tag=$(git describe --tags --exact-match 2>/dev/null) || true
+
+# Use tag if available, otherwise use commit hash
+VERSION=${current_tag:-$commit_id}
+

 # Frameworks
 #

--- a/container/deps/vllm/vllm_patch_063post1.patch
+++ b/container/deps/vllm/vllm_patch_063post1.patch
@@ -537,7 +537,7 @@ diff -Naur v0.6.3.post1_vllm/worker/model_runner.py patched_vllm/worker/model_ru
 +    def init_kv_cache_handler(self) -> None:
 +        if envs.VLLM_DISAGG_STAGE is not None:
 +            self._kv_cache_handler = get_kv_cache_handler()
-+            torch.distributed.barrier()
+            # torch.distributed.barrier() # TODO ptarasiewicz check why this is raising NCCL errors
 +
     def _update_inputs_to_capture_for_enc_dec_model(self,
                                                     capture_inputs: Dict[str,

--- a/examples/llm/tensorrtllm/README.md
+++ b/examples/llm/tensorrtllm/README.md
@@ -229,12 +229,14 @@ instead for each process ID replacing `<pid>` below:
 kill -9 <pid>
 ```

-## Y. Known Issues & Limitations
+## Known Issues & Limitations

 1. **Tensor Parallelism Constraints**
   - Currently limited to TP=1 for both prefill and decode workers

-## Z. References
+2. Currently streaming is not supported and results are returned all at once.
+
+## References

 [^1]: Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao
 Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language

--- a/examples/llm/tensorrtllm/deploy/parser.py
+++ b/examples/llm/tensorrtllm/deploy/parser.py
@@ -43,7 +43,7 @@ def parse_args():
    )

    parser.add_argument(
-        "--log-level", type=int, default=3, help="log level applied to all workers"
+        "--log-level", type=int, default=1, help="log level applied to all workers"
    )

    parser.add_argument(

--- a/examples/llm/vllm/benchmark/README.md
+++ b/examples/llm/vllm/benchmark/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Tuning and Benchmarking Disaggregated Serving
+
+**Disaggregated Serving** [^1] enables developers and teams deploying
+LLMs to tune their deployment based on input and output sequence
+lengths to achieve a targeted SLA with the right mix of context and
+generation workers. In particular disaggregated serving enables teams
+the ability to choose different parallelization strategies for each
+phase and balance throughput (tokens / sec / gpu) and latency (tokens
+/ sec / user).
+
+## Example:
+
+### 50 tokens per sec SLA with Input (3000) / Output (150)  Sequence Length Tuning
+
+To determine the best mix of context and generate workers for a
+targeted latency and input and output sequence length generally we
+perform "sweeps" comparing different strategies to find the best
+throughput within the SLA.
+
+For example for input sequence length 3000 and output sequence length
+150 after sweeping different tensor parallellism strategies on two
+8 x H100 GPU nodes, we've found that using 2 instances of TP 4 for
+context (on one node) and using 1 instance of TP 8 for generate (on
+the second node) gives the best throughput at a latency target of 50
+tokens per sec per user.
+
+At that latency target, in our early measurements disaggregated
+serving outperforms traditional aggregated LLM serving by more than 1.5x
+(with throughput normalized per GPU).
+
+### Reproducing Results
+
+To reproduce similar results on a 2 node H100 x 8 GPU system we
+provide sample scripts.
+
+### Launch Context Workers on First Node
+
+On first (head) node:
+
+```
+bash deploy_llama_70b_context_tp2dp4.sh --head-url <head url>
+```
+
+### Launch Generate Worker on Second Node
+
+On second node:
+
+```
+bash deploy_llama_70b_generate_tp8dp1.sh --head-url <head url>
+```
+
+### Benchmark
+
+The following `genai-perf` command simulates traffic with 3000 input and 150 output sequence lengths.
+
+```
+genai-perf profile \
+  -m llama \
+  --url <api server url> \
+  --endpoint-type chat \
+  --streaming \
+  --num-dataset-entries 100 \
+  --service-kind openai \
+  --endpoint v1/chat/completions \
+  --warmup-request-count 10 \
+  --random-seed 123 \
+  --synthetic-input-tokens-stddev 0 \
+  --output-tokens-stddev 0 \
+  --tokenizer neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
+  --synthetic-input-tokens-mean 3000 \
+  --output-tokens-mean 150 \
+  --extra-inputs seed:100 \
+  --extra-inputs min_tokens:150 \
+  --extra-inputs max_tokens:150 \
+  --profile-export-file my_profile_export.json \
+  --artifact-dir artifacts/ \
+  --concurrency < N > \
+  --request-count < 10 * N > \
+  -- -v \
+  --async
+```
+
+### Example Results
+
+The following results are given as an example, are not fully
+optimized, and do not indicate what you may get locally.
+
+| label    | configuration                  | concurrency | output_token_throughput_per_request | output_token_throughput_per_gpu | time_to_first_token | inter_token_latency |
+|----------|--------------------------------|-------------|-------------------------------------|---------------------------------|---------------------|---------------------|
+| disagg   | context_tp2dp4_generate_tp8dp1 |          48 |                    49.18197330348195      |        87.55798331              |       1157.4852116520833    |       15.935926391666667  |
+| baseline | baseline_tp4dp1                |           4 |                         50.27116554062172 |                     56.26445983 |         709.2506074249999 |         15.265875249999999 |
+
+
+###  Baseline Comparison
+
+On a single node you can run a comparison. With aggregated workers we
+found the best throughput at the target SLA and input and output
+sequence lengths with 2 instances of tensor parallelism 4.
+
+```
+bash deploy_llama_70b_baseline_tp4dp2.sh --head-url <head url>
+```
+
+To see the results use the same `genai-perf` command used to benchmark
+the disaggregated setup.
+
+
+### Stopping deployment
+
+```
+pkill -SIGINT -f python3
+pkill -SIGINT -f nats
+```
+
+## Known issue
+
+Sometimes during the first run there there are nats errors. In that case just restart the deployment.
+
+## References
+
+[^1]: Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao
+Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language
+model serving. *arXiv:2401.09670v3 [cs.DC]*, 2024.
--- a/examples/llm/vllm/benchmark/deploy_llama_70b_context_tp2dp4.sh
+++ b/examples/llm/vllm/benchmark/deploy_llama_70b_context_tp2dp4.sh
+#!/bin/bash
+set -e
+set -x
+export VLLM_ATTENTION_BACKEND=FLASHINFER
+export VLLM_WORKER_MULTIPROC_METHOD=spawn
+export VLLM_TORCH_PORT=36183
+export VLLM_CONTEXT_WORKERS=4
+export VLLM_CONTEXT_TP_SIZE=2
+export VLLM_GENERATE_WORKERS=1
+export VLLM_GENERATE_TP_SIZE=8
+export VLLM_LOGGING_LEVEL=INFO
+export VLLM_DATA_PLANE_BACKEND=nccl
+export PYTHONUNBUFFERED=1
+
+export NATS_PORT=4223
+export NATS_STORE="$(mktemp -d)"
+export API_SERVER_PORT=8005
+
+
+if [ "$1" != "--head-url" ] || [ -z "$2" ]; then
+    echo "Usage: $0 --head-url <head url>"
+    exit 1
+fi
+head_url=$2
+
+export NATS_HOST="$head_url"
+export VLLM_TORCH_HOST="$head_url"
+export API_SERVER_HOST="$head_url"
+
+
+# Start NATS Server
+echo "Flushing NATS store: ${NATS_STORE}..."
+rm -r "${NATS_STORE}"
+
+echo "Starting NATS Server..."
+nats-server -p ${NATS_PORT} --jetstream --store_dir "${NATS_STORE}" &
+
+
+# Start API Server
+echo "Starting LLM API Server..."
+python3 -m llm.api_server \
+  --tokenizer neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
+  --request-plane-uri ${NATS_HOST}:${NATS_PORT} \
+  --api-server-host ${API_SERVER_HOST} \
+  --model-name "llama" \
+  --api-server-port ${API_SERVER_PORT} &
+
+
+# Empty --log-dir will dump logs to stdout
+echo "Starting vLLM baseline workers..."
+
+gpu_configs=(
+  "0,1"
+  "2,3"
+  "4,5"
+  "6,7"
+)
+
+for i in "${!gpu_configs[@]}"; do
+    CUDA_VISIBLE_DEVICES="${gpu_configs[$i]}" \
+    VLLM_WORKER_ID=$i \
+    python3 -m llm.vllm.deploy \
+    --context-worker-count 1 \
+    --context-tp-size ${VLLM_CONTEXT_TP_SIZE} \
+    --generate-tp-size ${VLLM_GENERATE_TP_SIZE} \
+    --request-plane-uri ${NATS_HOST}:${NATS_PORT} \
+    --model-name neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
+    --worker-name llama \
+    --kv-cache-dtype fp8 \
+    --dtype auto \
+    --disable-async-output-proc \
+    --disable-log-stats \
+    --max-model-len 3500 \
+    --max-batch-size 10000 \
+    --gpu-memory-utilization 0.5 &
+done
--- a/examples/llm/vllm/benchmark/deploy_llama_70b_generate_tp8dp1.sh
+++ b/examples/llm/vllm/benchmark/deploy_llama_70b_generate_tp8dp1.sh
+#!/bin/bash
+set -e
+set -x
+export VLLM_ATTENTION_BACKEND=FLASHINFER
+export VLLM_WORKER_MULTIPROC_METHOD=spawn
+export VLLM_TORCH_PORT=36183
+export VLLM_CONTEXT_WORKERS=4
+export VLLM_CONTEXT_TP_SIZE=2
+export VLLM_GENERATE_WORKERS=1
+export VLLM_GENERATE_TP_SIZE=8
+export VLLM_LOGGING_LEVEL=INFO
+export VLLM_DATA_PLANE_BACKEND=nccl
+export PYTHONUNBUFFERED=1
+
+export NATS_PORT=4223
+export NATS_STORE="$(mktemp -d)"
+export API_SERVER_PORT=8005
+
+
+if [ "$1" != "--head-url" ] || [ -z "$2" ]; then
+    echo "Usage: $0 --head-url <head url>"
+    exit 1
+fi
+head_url=$2
+
+export NATS_HOST="$head_url"
+export VLLM_TORCH_HOST="$head_url"
+export API_SERVER_HOST="$head_url"
+
+
+# Empty --log-dir will dump logs to stdout
+echo "Starting vLLM generate workers..."
+
+CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
+  VLLM_WORKER_ID=${VLLM_CONTEXT_WORKERS} \
+  python3 -m llm.vllm.deploy \
+  --generate-worker-count 1 \
+  --context-tp-size ${VLLM_CONTEXT_TP_SIZE} \
+  --generate-tp-size ${VLLM_GENERATE_TP_SIZE} \
+  --request-plane-uri ${NATS_HOST}:${NATS_PORT} \
+  --model-name neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
+  --worker-name llama \
+  --kv-cache-dtype fp8 \
+  --dtype auto \
+  --disable-async-output-proc \
+  --disable-log-stats \
+  --max-model-len 3500 \
+  --max-batch-size 10000 \
+  --gpu-memory-utilization 0.9 &
--- a/examples/llm/vllm/operators/stages.py
+++ b/examples/llm/vllm/operators/stages.py
@@ -25,9 +25,10 @@ import vllm.inputs.data
 LOGGER = vllm.logger.init_logger(__name__)


+# TODO ptarasiewicz remove after veryfing streaming works efficiently
 # FIXME currently streaming all the tokens is not efficient
 # with RETURN_EVERY_N so large we return only first token and whole sequence at the end
-RETURN_EVERY_N = 1000000
+RETURN_EVERY_N = 1


 class Stage(abc.ABC):