@@ -136,8 +136,6 @@ dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_agg.yaml
...
@@ -136,8 +136,6 @@ dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_agg.yaml
```
```
Notes:
Notes:
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- Please keep the `cuda_graph_padding_enabled` setting as `false` in the model engine's configuration. There is a known bug, and the fix will be included in the next release of TensorRT-LLM.
- MTP support for Disaggregation in Dynamo + TensorRT-LLM is coming soon.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking
dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f ./configs/disagg.yaml --service-name TensorRTLLMPrefillWorker &
dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f ./configs/disagg.yaml --service-name TensorRTLLMPrefillWorker &
```
```
...
@@ -235,6 +233,44 @@ Notes:
...
@@ -235,6 +233,44 @@ Notes:
unset SLURM_JOBID SLURM_JOB_ID SLURM_NODELIST
unset SLURM_JOBID SLURM_JOB_ID SLURM_NODELIST
```
```
#### Multi-Node Disaggregated Serving with Multi-Token Prediction(MTP) and DeepSeek R1
Most of the steps remain the same as the above example, but this time we will have `dynamo serve` point to different config files that contains the MTP configurations