Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
dynamo
Commits
f10aab3b
Unverified
Commit
f10aab3b
authored
Jul 31, 2025
by
KrishnanPrash
Committed by
GitHub
Jul 31, 2025
Browse files
fix: Migrating trtllm examples from `1.0.0rc0` to `1.0.4rc4` (#2217)
parent
97390ac0
Changes
18
Hide whitespace changes
Inline
Side-by-side
Showing
18 changed files
with
177 additions
and
158 deletions
+177
-158
components/backends/trtllm/engine_configs/agg.yaml
components/backends/trtllm/engine_configs/agg.yaml
+4
-1
components/backends/trtllm/engine_configs/decode.yaml
components/backends/trtllm/engine_configs/decode.yaml
+7
-2
components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_agg.yaml
...ckends/trtllm/engine_configs/deepseek_r1/mtp/mtp_agg.yaml
+14
-13
components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_decode.yaml
...nds/trtllm/engine_configs/deepseek_r1/mtp/mtp_decode.yaml
+15
-14
components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_prefill.yaml
...ds/trtllm/engine_configs/deepseek_r1/mtp/mtp_prefill.yaml
+2
-1
components/backends/trtllm/engine_configs/deepseek_r1/simple/agg.yaml
...ackends/trtllm/engine_configs/deepseek_r1/simple/agg.yaml
+15
-13
components/backends/trtllm/engine_configs/deepseek_r1/simple/decode.yaml
...ends/trtllm/engine_configs/deepseek_r1/simple/decode.yaml
+17
-15
components/backends/trtllm/engine_configs/deepseek_r1/simple/prefill.yaml
...nds/trtllm/engine_configs/deepseek_r1/simple/prefill.yaml
+2
-3
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/dep16_agg.yaml
.../trtllm/engine_configs/deepseek_r1/wide_ep/dep16_agg.yaml
+15
-13
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_agg.yaml
...rtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_agg.yaml
+26
-22
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_decode.yaml
...lm/engine_configs/deepseek_r1/wide_ep/wide_ep_decode.yaml
+21
-17
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml
...m/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml
+6
-6
components/backends/trtllm/engine_configs/llama4/eagle/eagle_agg.yaml
...ackends/trtllm/engine_configs/llama4/eagle/eagle_agg.yaml
+7
-17
components/backends/trtllm/engine_configs/llama4/eagle/eagle_decode.yaml
...ends/trtllm/engine_configs/llama4/eagle/eagle_decode.yaml
+15
-14
components/backends/trtllm/engine_configs/llama4/eagle/eagle_prefill.yaml
...nds/trtllm/engine_configs/llama4/eagle/eagle_prefill.yaml
+1
-1
components/backends/trtllm/engine_configs/prefill.yaml
components/backends/trtllm/engine_configs/prefill.yaml
+5
-3
components/backends/trtllm/src/dynamo/trtllm/main.py
components/backends/trtllm/src/dynamo/trtllm/main.py
+4
-2
container/build.sh
container/build.sh
+1
-1
No files found.
components/backends/trtllm/engine_configs/agg.yaml
View file @
f10aab3b
...
...
@@ -28,4 +28,7 @@ kv_cache_config:
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
use_cuda_graph
:
true
cuda_graph_config
:
max_batch_size
:
16
\ No newline at end of file
components/backends/trtllm/engine_configs/decode.yaml
View file @
f10aab3b
...
...
@@ -16,11 +16,16 @@ tensor_parallel_size: 1
moe_expert_parallel_size
:
1
enable_attention_dp
:
false
max_num_tokens
:
8192
max_batch_size
:
16
trust_remote_code
:
true
backend
:
pytorch
enable_chunked_prefill
:
true
disable_overlap_scheduler
:
false
use_cuda_graph
:
true
cuda_graph_config
:
max_batch_size
:
16
kv_cache_config
:
free_gpu_memory_fraction
:
0.95
cache_transceiver_config
:
backend
:
default
\ No newline at end of file
components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_agg.yaml
View file @
f10aab3b
...
...
@@ -28,23 +28,24 @@ max_num_tokens: 8448
max_seq_len
:
8448
kv_cache_config
:
free_gpu_memory_fraction
:
0.30
dtype
:
fp8
# Enable the MTP(Multi-Token Prediction) in the model engine
speculative_config
:
decoding_type
:
MTP
num_nextn_predict_layers
:
1
use_cuda_graph
:
true
cuda_graph_padding_enabled
:
true
cuda_graph_batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
cuda_graph_config
:
enable_padding
:
true
batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
print_iter_log
:
true
kv_cache_dtype
:
fp8
components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_decode.yaml
View file @
f10aab3b
...
...
@@ -31,23 +31,24 @@ max_num_tokens: 512
max_seq_len
:
8704
kv_cache_config
:
free_gpu_memory_fraction
:
0.85
dtype
:
fp8
# Enable the MTP(Multi-Token Prediction) in decode model engine
speculative_config
:
decoding_type
:
MTP
num_nextn_predict_layers
:
1
use_
cuda_graph
:
true
cuda_graph_padding_enabled
:
true
cuda_graph_
batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
print_iter_log
:
true
kv_cache_dtype
:
fp8
cuda_graph
_config
:
enable_padding
:
true
batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
print_iter_log
:
true
\ No newline at end of file
components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_prefill.yaml
View file @
f10aab3b
...
...
@@ -27,8 +27,9 @@ max_num_tokens: 8192
max_seq_len
:
8192
kv_cache_config
:
free_gpu_memory_fraction
:
0.75
dtype
:
fp8
print_iter_log
:
true
kv_cache_dtype
:
fp8
disable_overlap_scheduler
:
true
# Enable the MTP(Multi-Token Prediction) in the prefill model engine
...
...
components/backends/trtllm/engine_configs/deepseek_r1/simple/agg.yaml
View file @
f10aab3b
...
...
@@ -31,24 +31,26 @@ kv_cache_config:
# With dp attention enabled: large ISL at high concurrency may need
# free_gpu_memory_fraction low to have enough available memory.
# free_gpu_memory_fraction: 0.30
dtype
:
fp8
# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
use_
cuda_graph
:
true
cuda_graph_padding_enabled
:
true
cuda_graph
_config
:
enable_padding
:
true
# NOTE: For larger max batch size, you may want to add larger cuda graph
# batch sizes below to match.
cuda_graph_batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
print_iter_log
:
true
kv_cache_dtype
:
fp8
components/backends/trtllm/engine_configs/deepseek_r1/simple/decode.yaml
View file @
f10aab3b
...
...
@@ -31,25 +31,27 @@ kv_cache_config:
# With dp attention enabled: large ISL at high concurrency may need
# free_gpu_memory_fraction low to have enough available memory.
# free_gpu_memory_fraction: 0.30
dtype
:
fp8
# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
disable_overlap_scheduler
:
false
use_cuda_graph
:
true
cuda_graph_padding_enabled
:
true
# NOTE: For larger max batch size, you may want to add larger cuda graph
# batch sizes below to match.
cuda_graph_batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
cuda_graph_config
:
enable_padding
:
true
# NOTE: For larger max batch size, you may want to
# add larger cuda graph batch sizes below to match.
batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
print_iter_log
:
true
kv_cache_dtype
:
fp8
components/backends/trtllm/engine_configs/deepseek_r1/simple/prefill.yaml
View file @
f10aab3b
...
...
@@ -26,12 +26,11 @@ max_seq_len: 8192
kv_cache_config
:
free_gpu_memory_fraction
:
0.75
dtype
:
fp8
# NOTE: This dtype must match in both prefill/decode configs
# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
disable_overlap_scheduler
:
true
print_iter_log
:
true
# NOTE: This dtype must match in both prefill/decode configs
kv_cache_dtype
:
fp8
print_iter_log
:
true
\ No newline at end of file
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/dep16_agg.yaml
View file @
f10aab3b
...
...
@@ -10,18 +10,20 @@ enable_attention_dp: true
max_batch_size
:
256
max_num_tokens
:
256
max_seq_len
:
8448
kv_cache_config
:
free_gpu_memory_fraction
:
0.7
use_cuda_graph
:
true
cuda_graph_padding_enabled
:
true
cuda_graph_batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
kv_cache_dtype
:
fp8
dtype
:
fp8
cuda_graph_config
:
enable_padding
:
true
batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_agg.yaml
View file @
f10aab3b
...
...
@@ -3,14 +3,16 @@
backend
:
pytorch
# WideEP related settings
moe_backend
:
WideEP
# moe_max_num_tokens will default to max_num_tokens if left unspecified.
#
# If you want to set this value explicitly, one recommendation is below:
# moe_max_num_tokens = max_batch_size * moe_expert_parallel_size
# 4096 = 256 * 16
# moe_max_num_tokens: 4096
moe_load_balancer
:
/mnt/engine_configs/deepseek_r1/wide_ep/eplb.yaml
moe_config
:
backend
:
WIDEEP
# moe_max_num_tokens will default to max_num_tokens if left unspecified.
#
# If you want to set this value explicitly, one recommendation is below:
# moe_max_num_tokens = max_batch_size * moe_expert_parallel_size
# 4096 = 256 * 16
# moe_max_num_tokens: 4096
load_balancer
:
/mnt/engine_configs/deepseek_r1/wide_ep/eplb.yaml
tensor_parallel_size
:
16
moe_expert_parallel_size
:
16
...
...
@@ -18,18 +20,20 @@ enable_attention_dp: true
max_batch_size
:
256
max_num_tokens
:
256
max_seq_len
:
8448
kv_cache_config
:
free_gpu_memory_fraction
:
0.7
use_cuda_graph
:
true
cuda_graph_padding_enabled
:
true
cuda_graph_batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
kv_cache_dtype
:
fp8
free_gpu_memory_fraction
:
0.3
dtype
:
fp8
cuda_graph_config
:
enable_padding
:
true
batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
\ No newline at end of file
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_decode.yaml
View file @
f10aab3b
...
...
@@ -15,8 +15,9 @@
backend
:
pytorch
# WideEP related settings
moe_backend
:
WideEP
moe_load_balancer
:
/mnt/engine_configs/deepseek_r1/wide_ep/eplb.yaml
moe_config
:
backend
:
WIDEEP
load_balancer
:
/mnt/engine_configs/deepseek_r1/wide_ep/eplb.yaml
# TP/EP/PP/DP
tensor_parallel_size
:
16
...
...
@@ -35,25 +36,28 @@ kv_cache_config:
# With dp attention enabled: large ISL at high concurrency may need
# free_gpu_memory_fraction low to have enough available memory.
free_gpu_memory_fraction
:
0.30
dtype
:
fp8
# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
disable_overlap_scheduler
:
false
use_cuda_graph
:
true
cuda_graph_padding_enabled
:
true
# NOTE: For larger max batch size, you may want to add larger cuda graph
# batch sizes below to match.
cuda_graph_batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
cuda_graph_config
:
enable_padding
:
true
# NOTE: For larger max batch size, you may want to
# add larger cuda graph batch sizes below to match.
batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
print_iter_log
:
true
kv_cache_dtype
:
fp8
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml
View file @
f10aab3b
...
...
@@ -15,8 +15,9 @@
backend
:
pytorch
# WideEP related settings
moe_backend
:
WideEP
moe_load_balancer
:
/mnt/engine_configs/deepseek_r1/wide_ep/eplb.yaml
moe_config
:
backend
:
WIDEEP
load_balancer
:
/mnt/engine_configs/deepseek_r1/wide_ep/eplb.yaml
# TP/EP/PP/DP
tensor_parallel_size
:
16
...
...
@@ -29,13 +30,12 @@ max_num_tokens: 8192
max_seq_len
:
8192
kv_cache_config
:
free_gpu_memory_fraction
:
0.75
free_gpu_memory_fraction
:
0.3
dtype
:
fp8
# NOTE: This dtype must match in both prefill/decode configs
# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
disable_overlap_scheduler
:
true
print_iter_log
:
true
# NOTE: This dtype must match in both prefill/decode configs
kv_cache_dtype
:
fp8
print_iter_log
:
true
\ No newline at end of file
components/backends/trtllm/engine_configs/llama4/eagle/eagle_agg.yaml
View file @
f10aab3b
...
...
@@ -21,31 +21,21 @@ max_batch_size: 256
# Will be investigated in the future with TRTLLM team.
max_num_tokens
:
1024
max_seq_len
:
8448
autotuner
_enabled
:
false
enable_
autotuner
:
false
disable_overlap_scheduler
:
true
# Enable Speculative Decoding in the model engine
speculative_config
:
decoding_type
:
Eagle
max_draft_len
:
1
pytorch_weights_path
:
nvidia/Llama-4-Maverick-17B-128E-Eagle3
eagle3_one_model
:
F
alse
speculative_model_dir
:
nvidia/Llama-4-Maverick-17B-128E-Eagle3
eagle3_one_model
:
f
alse
kv_cache_config
:
free_gpu_memory_fraction
:
0.5
enable_block_reuse
:
false
use_cuda_graph
:
true
cuda_graph_padding_enabled
:
true
cuda_graph_batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
print_iter_log
:
true
kv_cache_dtype
:
fp8
cuda_graph_config
:
max_batch_size
:
8
components/backends/trtllm/engine_configs/llama4/eagle/eagle_decode.yaml
View file @
f10aab3b
...
...
@@ -28,23 +28,24 @@ speculative_config:
decoding_type
:
Eagle
max_draft_len
:
1
pytorch_weights_path
:
nvidia/Llama-4-Maverick-17B-128E-Eagle3
eagle3_one_model
:
F
alse
eagle3_one_model
:
f
alse
kv_cache_config
:
free_gpu_memory_fraction
:
0.5
enable_block_reuse
:
false
dtype
:
fp8
cuda_graph_config
:
enable_padding
:
true
batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
use_cuda_graph
:
true
cuda_graph_padding_enabled
:
true
cuda_graph_batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
print_iter_log
:
true
kv_cache_dtype
:
fp8
components/backends/trtllm/engine_configs/llama4/eagle/eagle_prefill.yaml
View file @
f10aab3b
...
...
@@ -29,7 +29,7 @@ speculative_config:
decoding_type
:
Eagle
max_draft_len
:
1
pytorch_weights_path
:
nvidia/Llama-4-Maverick-17B-128E-Eagle3
eagle3_one_model
:
F
alse
eagle3_one_model
:
f
alse
kv_cache_config
:
free_gpu_memory_fraction
:
0.5
...
...
components/backends/trtllm/engine_configs/prefill.yaml
View file @
f10aab3b
...
...
@@ -16,13 +16,15 @@ tensor_parallel_size: 1
moe_expert_parallel_size
:
1
enable_attention_dp
:
false
max_num_tokens
:
8192
max_batch_size
:
16
trust_remote_code
:
true
backend
:
pytorch
enable_chunked_prefill
:
true
# Overlap scheduler not currently supported in prefill only workers.
disable_overlap_scheduler
:
true
use_
cuda_graph
:
false
cuda_graph
_config
:
max_batch_size
:
16
kv_cache_config
:
free_gpu_memory_fraction
:
0.95
cache_transceiver_config
:
backend
:
default
\ No newline at end of file
components/backends/trtllm/src/dynamo/trtllm/main.py
View file @
f10aab3b
...
...
@@ -101,8 +101,10 @@ async def init(runtime: DistributedRuntime, config: Config):
kv_cache_config
[
"event_buffer_max_size"
]
=
DEFAULT_KV_EVENT_BUFFER_MAX_SIZE
else
:
kv_cache_config
=
arg_map
[
"kv_cache_config"
]
if
not
kv_cache_config
.
event_buffer_max_size
:
kv_cache_config
.
event_buffer_max_size
=
DEFAULT_KV_EVENT_BUFFER_MAX_SIZE
if
"event_buffer_max_size"
not
in
kv_cache_config
:
kv_cache_config
[
"event_buffer_max_size"
]
=
DEFAULT_KV_EVENT_BUFFER_MAX_SIZE
arg_map
[
"kv_cache_config"
]
=
kv_cache_config
# Only pytorch backend is supported for now to publish events and metrics.
...
...
container/build.sh
View file @
f10aab3b
...
...
@@ -96,7 +96,7 @@ TRTLLM_USE_NIXL_KVCACHE_EXPERIMENTAL="0"
TENSORRTLLM_INDEX_URL
=
"https://pypi.python.org/simple"
# TODO: Remove the version specification from here and use the ai-dynamo[trtllm] package.
# Need to update the Dockerfile.tensorrt_llm to use the ai-dynamo[trtllm] package.
DEFAULT_TENSORRTLLM_PIP_WHEEL
=
"tensorrt-llm==1.0.0rc
0
"
DEFAULT_TENSORRTLLM_PIP_WHEEL
=
"tensorrt-llm==1.0.0rc
4
"
TENSORRTLLM_PIP_WHEEL
=
""
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment