"git@developer.sourcefind.cn:OpenDAS/dcnv3.git" did not exist on "f7df4b3c162b6b0c4403c9c89af443ceb1ac48e4"
Unverified Commit 6630fa5c authored by richardhuo-nv's avatar richardhuo-nv Committed by GitHub
Browse files

fix: change the processor number to 5 to reduce the tokenization bottleneck (#865)

We were observing a 40% performance drop compared with trtllm serve when benchmarking with isl=1000 and osl=200 at a concurrency level > 128.

The number of the tokenization worker is the bottleneck. After bumping the tokenization processors number to 5, dynamo's benchmarking perf could match the trtllm serve's perf.
parent 0f251c90
......@@ -21,6 +21,8 @@ Frontend:
Processor:
engine_args: "configs/llm_api_config.yaml"
router: round-robin
ServiceArgs:
workers: 5 # to reduce the tokenization bottleneck at a high concurrency
TensorRTLLMWorker:
engine_args: "configs/llm_api_config.yaml"
......
......@@ -21,6 +21,8 @@ Frontend:
Processor:
engine_args: "configs/llm_api_config_router.yaml"
router: kv
ServiceArgs:
workers: 5 # to reduce the tokenization bottleneck at a high concurrency
Router:
model-name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
......
......@@ -22,6 +22,8 @@ Processor:
engine_args: "configs/llm_api_config.yaml"
router: round-robin
remote-prefill: true
ServiceArgs:
workers: 5 # to reduce the tokenization bottleneck at a high concurrency
TensorRTLLMWorker:
engine_args: "configs/llm_api_config.yaml"
......
......@@ -22,6 +22,8 @@ Processor:
engine_args: "configs/llm_api_config_disagg_router.yaml"
router: "kv"
remote-prefill: true
ServiceArgs:
workers: 5 # to reduce the tokenization bottleneck at a high concurrency
Router:
model-name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment