feat: add trtllm example with config (#2895)

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

feat: add trtllm example with config (#2895)
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
0d9c8994 · julienmancuso · GitHub · 357efee3 · 0d9c8994 · 0d9c8994
Unverified Commit 0d9c8994 authored Sep 05, 2025 by julienmancuso Committed by GitHub Sep 05, 2025
3 changed files
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -57,6 +57,7 @@ repos:
  - id: check-toml
  - id: check-yaml
    exclude: ^.*/templates/.*\.yaml$ #ignore all yaml files in helm chart templates
+    args: ['--allow-multiple-documents']
  - id: check-shebang-scripts-are-executable
  - id: end-of-file-fixer
    types_or: [c, c++, cuda, proto, textproto, java, python]

--- a/components/backends/trtllm/deploy/README.md
+++ b/components/backends/trtllm/deploy/README.md
@@ -34,6 +34,14 @@ Advanced disaggregated deployment with KV cache routing capabilities.
 - `TRTLLMDecodeWorker`: Specialized decode-only worker
 - `TRTLLMPrefillWorker`: Specialized prefill-only worker (2 replicas for load balancing)

+### 5. **Aggregated Deployment with Config** (`agg-with-config.yaml`)
+Aggregated deployment with custom configuration.
+
+**Architecture:**
+- `nvidia-config`: ConfigMap containing a custom trtllm configuration
+- `Frontend`: OpenAI-compatible API server (with kv router mode disabled)
+- `TRTLLMWorker`: Single worker handling both prefill and decode with custom configuration mounted from the configmap
+
 ## CRD Structure

 All templates use the **DynamoGraphDeployment** CRD:

--- a/components/backends/trtllm/deploy/agg-with-config.yaml
+++ b/components/backends/trtllm/deploy/agg-with-config.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+# configmap that contains the custom trtllm configuration
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: nvidia-config
+data:
+  agg.yaml: |
+    tensor_parallel_size: 1
+    moe_expert_parallel_size: 1
+    enable_attention_dp: false
+    max_num_tokens: 8192
+    max_batch_size: 16
+    trust_remote_code: true
+    backend: pytorch
+    enable_chunked_prefill: true
+    disable_overlap_scheduler: true
+    kv_cache_config:
+      free_gpu_memory_fraction: 0.95
+    cuda_graph_config:
+      max_batch_size: 16
+---
+# dynamo graph deployment which uses the custom configuration contained in the configmap
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: trtllm-agg
+spec:
+  services:
+    Frontend:
+      dynamoNamespace: trtllm-agg
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.4.1
+    TRTLLMWorker:
+      envFromSecret: hf-token-secret
+      dynamoNamespace: trtllm-agg
+      componentType: worker
+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+      extraPodSpec:
+        # declare the configmap as a volume
+        volumes:
+        - name: nvidia-config
+          configMap:
+            name: nvidia-config
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.4.1
+          workingDir: /workspace/components/backends/trtllm
+          # mount the configmap as a volume
+          volumeMounts:
+          - name: nvidia-config
+            mountPath: /workspace/components/backends/trtllm/engine_configs
+            readOnly: true
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - >-
+              python3 -m dynamo.trtllm
+              --model-path Qwen/Qwen3-0.6B
+              --served-model-name Qwen/Qwen3-0.6B
+              --extra-engine-args engine_configs/agg.yaml