fix: change the processor number to 5 to reduce the tokenization bottleneck (#865)

We were observing a 40% performance drop compared with trtllm serve when benchmarking with isl=1000 and osl=200 at a concurrency level > 128. The number of the tokenization worker is the bottleneck. After bumping the tokenization processors number to 5, dynamo's benchmarking perf could match the trtllm serve's perf.

fix: change the processor number to 5 to reduce the tokenization bottleneck (#865)
We were observing a 40% performance drop compared with trtllm serve when benchmarking with isl=1000 and osl=200 at a concurrency level > 128. The number of the tokenization worker is the bottleneck. After bumping the tokenization processors number to 5, dynamo's benchmarking perf could match the trtllm serve's perf.
6630fa5c · richardhuo-nv · GitHub · 0f251c90 · 6630fa5c · 6630fa5c
Unverified Commit 6630fa5c authored Apr 28, 2025 by richardhuo-nv Committed by GitHub Apr 28, 2025
4 changed files
--- a/examples/tensorrt_llm/configs/agg.yaml
+++ b/examples/tensorrt_llm/configs/agg.yaml
@@ -21,6 +21,8 @@ Frontend:
 Processor:
  engine_args: "configs/llm_api_config.yaml"
  router: round-robin
+  ServiceArgs:
+    workers: 5 # to reduce the tokenization bottleneck at a high concurrency

 TensorRTLLMWorker:
  engine_args: "configs/llm_api_config.yaml"

--- a/examples/tensorrt_llm/configs/agg_router.yaml
+++ b/examples/tensorrt_llm/configs/agg_router.yaml
@@ -21,6 +21,8 @@ Frontend:
 Processor:
  engine_args: "configs/llm_api_config_router.yaml"
  router: kv
+  ServiceArgs:
+    workers: 5  # to reduce the tokenization bottleneck at a high concurrency

 Router:
  model-name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B

--- a/examples/tensorrt_llm/configs/disagg.yaml
+++ b/examples/tensorrt_llm/configs/disagg.yaml
@@ -22,6 +22,8 @@ Processor:
  engine_args: "configs/llm_api_config.yaml"
  router: round-robin
  remote-prefill: true
+  ServiceArgs:
+    workers: 5 # to reduce the tokenization bottleneck at a high concurrency

 TensorRTLLMWorker:
  engine_args: "configs/llm_api_config.yaml"

--- a/examples/tensorrt_llm/configs/disagg_router.yaml
+++ b/examples/tensorrt_llm/configs/disagg_router.yaml
@@ -22,6 +22,8 @@ Processor:
  engine_args: "configs/llm_api_config_disagg_router.yaml"
  router: "kv"
  remote-prefill: true
+  ServiceArgs:
+    workers: 5 # to reduce the tokenization bottleneck at a high concurrency

 Router:
  model-name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B