Unverified Commit 6630fa5c authored by richardhuo-nv's avatar richardhuo-nv Committed by GitHub
Browse files

fix: change the processor number to 5 to reduce the tokenization bottleneck (#865)

We were observing a 40% performance drop compared with trtllm serve when benchmarking with isl=1000 and osl=200 at a concurrency level > 128.

The number of the tokenization worker is the bottleneck. After bumping the tokenization processors number to 5, dynamo's benchmarking perf could match the trtllm serve's perf.
parent 0f251c90
...@@ -21,6 +21,8 @@ Frontend: ...@@ -21,6 +21,8 @@ Frontend:
Processor: Processor:
engine_args: "configs/llm_api_config.yaml" engine_args: "configs/llm_api_config.yaml"
router: round-robin router: round-robin
ServiceArgs:
workers: 5 # to reduce the tokenization bottleneck at a high concurrency
TensorRTLLMWorker: TensorRTLLMWorker:
engine_args: "configs/llm_api_config.yaml" engine_args: "configs/llm_api_config.yaml"
......
...@@ -21,6 +21,8 @@ Frontend: ...@@ -21,6 +21,8 @@ Frontend:
Processor: Processor:
engine_args: "configs/llm_api_config_router.yaml" engine_args: "configs/llm_api_config_router.yaml"
router: kv router: kv
ServiceArgs:
workers: 5 # to reduce the tokenization bottleneck at a high concurrency
Router: Router:
model-name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B model-name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
......
...@@ -22,6 +22,8 @@ Processor: ...@@ -22,6 +22,8 @@ Processor:
engine_args: "configs/llm_api_config.yaml" engine_args: "configs/llm_api_config.yaml"
router: round-robin router: round-robin
remote-prefill: true remote-prefill: true
ServiceArgs:
workers: 5 # to reduce the tokenization bottleneck at a high concurrency
TensorRTLLMWorker: TensorRTLLMWorker:
engine_args: "configs/llm_api_config.yaml" engine_args: "configs/llm_api_config.yaml"
......
...@@ -22,6 +22,8 @@ Processor: ...@@ -22,6 +22,8 @@ Processor:
engine_args: "configs/llm_api_config_disagg_router.yaml" engine_args: "configs/llm_api_config_disagg_router.yaml"
router: "kv" router: "kv"
remote-prefill: true remote-prefill: true
ServiceArgs:
workers: 5 # to reduce the tokenization bottleneck at a high concurrency
Router: Router:
model-name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B model-name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment