# Tuning and Benchmarking Disaggregated Serving
**Disaggregated Serving** [^1] enables developers and teams deploying
LLMs to tune their deployment based on input and output sequence
lengths to achieve a targeted SLA with the right mix of context and
generation workers. In particular disaggregated serving enables teams
the ability to choose different parallelization strategies for each
phase and balance throughput (tokens / sec / gpu) and latency (tokens
/ sec / user).
## Example:
### 50 tokens per sec SLA with Input (3000) / Output (150) Sequence Length Tuning
To determine the best mix of context and generate workers for a
targeted latency and input and output sequence length generally we
perform "sweeps" comparing different strategies to find the best
throughput within the SLA.
For example for input sequence length 3000 and output sequence length
150 after sweeping different tensor parallellism strategies on two
8 x H100 GPU nodes, we've found that using 2 instances of TP 4 for
context (on one node) and using 1 instance of TP 8 for generate (on
the second node) gives the best throughput at a latency target of 50
tokens per sec per user.
At that latency target, in our early measurements disaggregated
serving outperforms traditional aggregated LLM serving by more than 1.5x
(with throughput normalized per GPU).
### Reproducing Results
To reproduce similar results on a 2 node H100 x 8 GPU system we
provide sample scripts.
### Launch Context Workers on First Node
On first (head) node:
```
bash deploy_llama_70b_context_tp2dp4.sh --head-url
```
### Launch Generate Worker on Second Node
On second node:
```
bash deploy_llama_70b_generate_tp8dp1.sh --head-url
```
### Benchmark
The following `genai-perf` command simulates traffic with 3000 input and 150 output sequence lengths.
```
genai-perf profile \
-m llama \
--url \
--endpoint-type chat \
--streaming \
--num-dataset-entries 100 \
--service-kind openai \
--endpoint v1/chat/completions \
--warmup-request-count 10 \
--random-seed 123 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-stddev 0 \
--tokenizer neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
--synthetic-input-tokens-mean 3000 \
--output-tokens-mean 150 \
--extra-inputs seed:100 \
--extra-inputs min_tokens:150 \
--extra-inputs max_tokens:150 \
--profile-export-file my_profile_export.json \
--artifact-dir artifacts/ \
--concurrency < N > \
--request-count < 10 * N > \
-- -v \
--async
```
### Example Results
The following results are given as an example, are not fully
optimized, and do not indicate what you may get locally.
| label | configuration | concurrency | output_token_throughput_per_request | output_token_throughput_per_gpu | time_to_first_token | inter_token_latency |
|----------|--------------------------------|-------------|-------------------------------------|---------------------------------|---------------------|---------------------|
| disagg | context_tp2dp4_generate_tp8dp1 | 48 | 49.18197330348195 | 87.55798331 | 1157.4852116520833 | 15.935926391666667 |
| baseline | baseline_tp4dp1 | 4 | 50.27116554062172 | 56.26445983 | 709.2506074249999 | 15.265875249999999 |
### Baseline Comparison
On a single node you can run a comparison. With aggregated workers we
found the best throughput at the target SLA and input and output
sequence lengths with 2 instances of tensor parallelism 4.
```
bash deploy_llama_70b_baseline_tp4dp2.sh --head-url
```
To see the results use the same `genai-perf` command used to benchmark
the disaggregated setup.
### Stopping deployment
```
pkill -SIGINT -f python3
pkill -SIGINT -f nats
```
## Known issue
Sometimes during the first run there there are nats errors. In that case just restart the deployment.
## References
[^1]: Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao
Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language
model serving. *arXiv:2401.09670v3 [cs.DC]*, 2024.