refactor: Refactor the TRTLLM examples remove dynamo SDK (#1884)

901715b5 · Tanmay Verma · GitHub · 5bf23d54 · 901715b5 · 5bf23d54
Unverified Commit 901715b5 authored Jul 14, 2025 by Tanmay Verma Committed by GitHub Jul 15, 2025
20 changed files
--- a/examples/tensorrt_llm/README.md
+++ b/examples/tensorrt_llm/README.md
@@ -19,6 +19,25 @@ limitations under the License.

 This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.

+# User Documentation
+
+- [Deployment Architectures](#deployment-architectures)
+- [Getting Started](#getting-started)
+  - [Prerequisites](#prerequisites)
+  - [Build docker](#build-docker)
+  - [Run container](#run-container)
+  - [Run deployment](#run-deployment)
+    - [Single Node deployment](#single-node-deployments)
+    - [Multinode deployment](#multinode-deployment)
+  - [Client](#client)
+  - [Benchmarking](#benchmarking)
+- [Disaggregation Strategy](#disaggregation-strategy)
+- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
+- [More Example Architectures](#more-example-architectures)
+  - [Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)
+
+# Quick Start
+
 ## Use the Latest Release

 We recommend using the latest stable release of dynamo to avoid breaking changes:
@@ -34,9 +53,8 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 ## Deployment Architectures

 See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture.
-Note that this TensorRT-LLM version does not support all the options yet.

-Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can only configure the deployment to always use aggregate or disaggregated serving.
+Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can configure the deployment to always use either aggregate or disaggregated serving.

 ## Getting Started

@@ -83,8 +101,8 @@ This figure shows an overview of the major components to deploy:
 ```

 +------+      +-----------+      +------------------+             +---------------+
-| HTTP |----->| processor |----->|      Worker      |------------>|     Prefill   |
-|      |<-----|           |<-----|                  |<------------|     Worker    |
+| HTTP |----->| processor |----->|      Worker1     |------------>|    Worker2    |
+|      |<-----|           |<-----|                  |<------------|               |
 +------+      +-----------+      +------------------+             +---------------+
                  |    ^                  |
       query best |    | return           | publish kv events
@@ -96,201 +114,73 @@ This figure shows an overview of the major components to deploy:

 ```

-Note: The above architecture illustrates all the components. The final components
-that get spawned depend upon the chosen graph.
+**Note:** The diagram above shows all possible components in a deployment. Depending on the chosen disaggregation strategy, you can configure whether Worker1 handles prefill and Worker2 handles decode, or vice versa. For more information on how to select and configure these strategies, see the [Disaggregation Strategy](#disaggregation-strategy) section below.

-### Example architectures
+### Single-Node Deployments

-#### Aggregated serving
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
-```
-
-#### Aggregated serving with KV Routing
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.agg:Frontend -f ./configs/agg_router.yaml
-```
-
-#### Disaggregated serving
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
-```
+> [!IMPORTANT]
+> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each command and run them in separate terminals.

-#### Disaggregated serving with KV Routing
+#### Aggregated
 ```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.disagg:Frontend -f ./configs/disagg_router.yaml
+cd $DYNAMO_ROOT/examples/tensorrt_llm
+./launch/agg.sh
 ```

-#### Aggregated serving with Multi-Token Prediction (MTP) and DeepSeek R1
+#### Aggregated with KV Routing
 ```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_agg.yaml
+cd $DYNAMO_ROOT/examples/tensorrt_llm
+./launch/agg_router.sh
 ```

-Notes:
- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script.
-
-  Example: `./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit`
-
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
-
-#### Multi-Node Disaggregated Serving
-
-In the following example, we will demonstrate how to run a Disaggregated Serving
-deployment across multiple nodes. For simplicity, we will demonstrate how to
-deploy a single Decode worker on one node, and a single Prefill worker on the other node.
-However, the instance counts, TP sizes, other configs, and responsibilities of each node
-can be customized and deployed in similar ways.
+#### Disaggregated

-For example, to deploy Deepseek R1, you could replace the referenced example
-configs (`configs/agg.yaml`, `configs/disagg.yaml`) with corresponding Deepseek R1
-example configs (`configs/deepseek_r1/agg.yaml`, `configs/deepseek_r1/disagg.yaml`).
-You can find the example Deepseek R1 configs for GB200
-[here](configs/deepseek_r1), but the config settings can be customized for testing
-other hardware configurations or parallelism strategies.
+> [!IMPORTANT]
+> Disaggregated serving supports two strategies for request flow: `"prefill_first"` and `"decode_first"`. By default, the script below uses the `"decode_first"` strategy, which can reduce response latency by minimizing extra hops in the return path. You can switch strategies by setting the `DISAGGREGATION_STRATEGY` environment variable.

-This "multi-node" example demonstrates how to generally connect dynamo workers from
-different nodes, but for simplicity, each worker individually fits on a single node.
-For details on how to launch a worker that spans multiple nodes due to sheer model
-size, or for features like large scale expert parallelism, see the
-[multinode worker example](configs/deepseek_r1/multinode).
-
-##### Head Node
-
-Start nats/etcd:
 ```bash
-# NATS data persisted to /tmp/nats/jetstream by default
-nats-server -js &
-
-# Persist data to /tmp/etcd, otherwise defaults to ${PWD}/default.etcd if left unspecified
-etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
-
-# NOTE: Clearing out the etcd and nats jetstream data directories across runs
-#       helps to guarantee a clean and reproducible results.
+cd $DYNAMO_ROOT/examples/tensorrt_llm
+./launch/disagg.sh
 ```

-Launch graph of Frontend and TensorRTLLMWorker (decode) on head node:
+#### Disaggregated with KV Routing

-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.agg:Frontend -f ./configs/disagg.yaml &
-```
-
-Notes:
- The aggregated graph (`graphs.agg`) is chosen here because it also describes
-  our desired deployment settings for the head node: launching the utility components
-  (Frontend, Processor), and only the decode worker (TensorRTLLMWorker configured with
-  `remote-prefill` enabled). We plan to launch the `TensorRTLLMPrefillWorker`
-  independently on a separate node in the next step of this demonstration.
-  You are free to customize the graph and configuration of components launched on
-  each node.
- The disaggregated config `configs/disagg.yaml` is intentionally chosen here as a
-  single source of truth to be used for deployments on all of our nodes, describing
-  the configurations for all of our components, including both decode and prefill
-  workers, but can be customized based on your deployment needs.
-
-##### Worker Node(s)
-
-Set environment variables pointing at the etcd/nats endpoints on the head node
-so the Dynamo Distributed Runtime can orchestrate communication and
-discoverability between the head node and worker nodes:
-```bash
-# if not head node
-export HEAD_NODE_IP="<head-node-ip>"
-export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
-export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
-```
+> [!IMPORTANT]
+> Disaggregated serving with KV routing uses a "prefill first" workflow by default. Currently, Dynamo supports KV routing to only one endpoint per model. In disaggregated workflow, it is generally more effective to route requests to the prefill worker. If you wish to use a "decode first" workflow instead, you can simply set the `DISAGGREGATION_STRATEGY` environment variable accordingly.

-Deploy a Prefill worker:
 ```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f ./configs/disagg.yaml --service-name TensorRTLLMPrefillWorker &
+cd $DYNAMO_ROOT/examples/tensorrt_llm
+./launch/disagg_router.sh
 ```

-Now you have a 2-node deployment with 1 Decode worker on the head node, and 1 Prefill worker on a worker node!
-
-##### Additional Notes for Multi-Node Deployments
-
-Notes:
- To include a router in this deployment, change the graph to one that includes the router, such as `graphs.agg_router`,
-  and change the config to one that includes the router, such as `configs/disagg_router.yaml`
- This step is assuming you're disaggregated serving and planning to launch prefill workers on separate nodes.
-  Howerver, for an aggregated deployment with additional aggregated worker replicas on other nodes, this step
-  remains mostly the same. The primary difference between aggregation and disaggregation for this step is
-  whether or not the `TensorRTLLMWorker` is configured to do `remote-prefill` or not in the config file
-  (ex: `configs/disagg.yaml` vs `configs/agg.yaml`).
- To apply the same concept for launching additional decode workers on worker nodes, you can
-  directly start them, similar to the prefill worker step above:
-  ```bash
-  # Example: deploy decode worker only
-  cd /workspace/examples/tensorrt_llm
-  dynamo serve components.worker:TensorRTLLMWorker -f ./configs/disagg.yaml --service-name TensorRTLLMWorker &
-  ```
- If you see an error about MPI Spawn failing during TRTLLM Worker initialziation on a Slurm-based cluster,
-  try unsetting the following environment variables before launching the TRTLLM worker. If you intend to
-  run other slurm-based commands or processes on the same node after deploying the TRTLLM worker, you may
-  want to save these values into temporary variables and then restore them afterwards.
-  ```bash
-  # Workaround for error: `mpi4py.MPI.Exception: MPI_ERR_SPAWN: could not spawn processes`
-  unset SLURM_JOBID SLURM_JOB_ID SLURM_NODELIST
-  ```
-
-#### Multi-Node Disaggregated Serving with Multi-Token Prediction (MTP) and DeepSeek R1
-
-Most of the steps remain the same as the above example, but this time we will have `dynamo serve` point to different config files that contains the MTP configurations
-
-##### Head Node
-
-Start nats/etcd
+#### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
 ```bash
-nats-server -js &
-etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
-```
+cd $DYNAMO_ROOT/examples/tensorrt_llm

-Launch graph of Frontend and TensorRTLLMWorker (decode) on head node:
-
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_disagg.yaml  &
-```
-
-##### Worker Node(s)
-
-Set environment variables pointing at the etcd/nats endpoints on the head node.
-```bash
-export HEAD_NODE_IP="<head-node-ip>"
-export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
-export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
-```
-
-Deploy a Prefill worker:
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f configs/deepseek_r1/mtp/mtp_disagg.yaml --service-name TensorRTLLMPrefillWorker &
+export AGG_ENGINE_ARGS=./engine_configs/deepseek_r1/mtp/mtp_agg.yaml
+export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
+# nvidia/DeepSeek-R1-FP4 is a large model
+export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
+./launch/agg.sh
 ```

 Notes:
 - MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script.

  Example: `./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit`
+
 - There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
 - MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.

+### Multinode Deployment
+
+For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.

 ### Client

 See [client](../llm/README.md#client) section to learn how to send request to the deployment.

-NOTE: To send a request to a multi-node deployment, target the node which deployed the `Frontend` component.
-
-### Close deployment
-
-See [close deployment](../../docs/guides/dynamo_serve.md#close-deployment) section to learn about how to close the deployment.
+NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`.

 ### Benchmarking

@@ -298,103 +188,23 @@ To benchmark your deployment with GenAI-Perf, see this utility script, configuri
 `model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)


-### KV Cache Transfer for Disaggregated Serving
-
-In disaggregated serving architectures, KV cache must be transferred between prefill and decode nodes. TensorRT-LLM supports two methods for this transfer:
-
-#### Default Method: UCX
-By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode nodes. UCX provides high-performance communication optimized for GPU-to-GPU transfers.
-
-#### Experimental Method: NIXL
-TensorRT-LLM also provides experimental support for using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
-
-**Note:** NIXL support in TensorRT-LLM is experimental and is not suitable for production environments yet.
-
-#### Using NIXL for KV Cache Transfer
-
-**Note:** NIXL backend for TensorRT-LLM is currently only supported on AMD64 (x86_64) architecture. If you're running on ARM64, you'll need to use the default UCX method for KV cache transfer.
-
-To enable NIXL for KV cache transfer in disaggregated serving:
-
-1. **Build the container with NIXL support:**
-   The TensorRT-LLM wheel must be built from source with NIXL support. The `./container/build.sh` script caches previously built TensorRT-LLM wheels to reduce build time. If you have previously built a TensorRT-LLM wheel without NIXL support, you must delete the cached wheel to force a rebuild with NIXL support.
-
-   **Remove cached TensorRT-LLM wheel (only if previously built without NIXL support):**
-   ```bash
-   rm -rf /tmp/trtllm_wheel
-   ```
+## Disaggregation Strategy

-   **Build the container with NIXL support:**
-   ```bash
-   ./container/build.sh --framework tensorrtllm \
-     --use-default-experimental-tensorrtllm-commit \
-     --trtllm-use-nixl-kvcache-experimental
-   ```
+The disaggregation strategy controls how requests are distributed between the prefill and decode workers in a disaggregated deployment.

-   **Note:** Both `--use-default-experimental-tensorrtllm-commit` and `--trtllm-use-nixl-kvcache-experimental` flags are required to enable NIXL support.
+By default, Dynamo uses a `decode first` strategy: incoming requests are initially routed to the decode worker, which then forwards them to the prefill worker in round-robin fashion. The prefill worker processes the request and returns results to the decode worker for any remaining decode operations.

-2. **Run the containerized environment:**
-   See [run container](#run-container) section to learn how to start the container image built in previous step.
+When using KV routing, however, Dynamo switches to a `prefill first` strategy. In this mode, requests are routed directly to the prefill worker, which can help maximize KV cache reuse and improve overall efficiency for certain workloads. Choosing the appropriate strategy can have a significant impact on performance, depending on your use case.

-3. **Start the disaggregated service:**
-   See [disaggregated serving](#disaggregated-serving) to see how to start the deployment.
-
-4. **Send the request:**
-   See [client](#client) section to learn how to send the request to deployment.
-
-**Important:** Ensure that ETCD and NATS services are running before starting the service.
-
-The container will automatically configure the appropriate environment variables (`TRTLLM_USE_NIXL_KVCACHE=1`) when built with the NIXL flag. The same container image can be used to use UCX for KV cache transfer.
+The disaggregation strategy can be set using the `DISAGGREGATION_STRATEGY` environment variable. You can set the strategy before launching your deployment, for example:
 ```bash
-unset TRTLLM_USE_NIXL_KVCACHE
-export TRTLLM_USE_UCX_KVCACHE=1
+DISAGGREGATION_STRATEGY="prefill_first" ./launch/disagg.sh
 ```

+## KV Cache Transfer in Disaggregated Serving

-### Example architectures for Llama 4 Maverick Instruct + Eagle Speculative Decoding
+Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-tranfer.md).

-#### Notes
-* Testing for the current example used:
-  * One GB200x4 node for aggregate serving
-  * Two GB200x4 nodes for disaggregate serving
-* To run Eagle Speculative Decoding with Llama 4, ensure the container meets the following criteria:
-  * Built with a version of TensorRT-LLM based on the 0.21 release [Link](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.21)
-  * The TensorRT-LLM build includes the changes from this PR [Link](https://github.com/NVIDIA/TensorRT-LLM/pull/5975)
-* If you need to download model weights off huggingface, make sure you run the command `huggingface-cli login` and have access to the necessary gated models.
-
-##### Aggregated Serving
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.disagg:Frontend -f configs/llama4/eagle/eagle_agg.yaml
-```
-* Known Issue: In Aggregated Serving, setting `max_num_tokens` to higher values (e.g. `max_num_tokens: 8448`) can lead to Out of Memory (OOM) errors. This is being investigated by the TRTLLM team.
-
-##### Disaggregated Serving
-
-###### Head Node
-Start nats/etcd
-``` bash
-nats-server -js &
-etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
-```
+## More Example Architectures

-Launch graph of Frontend and TensorRTLLMWorker (decode) on head node:
-
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.agg:Frontend -f configs/llama4/eagle/eagle_disagg.yaml  &
-```
-
-###### Worker Node(s)
-Set environment variables pointing at the etcd/nats endpoints on the head node.
-```bash
-export HEAD_NODE_IP="<head-node-ip>"
-export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
-export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
-```
-
-Deploy a Prefill worker:
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f configs/llama4/eagle/eagle_disagg.yaml --service-name TensorRTLLMPrefillWorker &
-```
\ No newline at end of file
+- [Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)
--- a/examples/tensorrt_llm/__init__.py
+++ b/examples/tensorrt_llm/__init__.py
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
--- a/examples/tensorrt_llm/common/__init__.py
+++ b/examples/tensorrt_llm/common/__init__.py
--- a/examples/tensorrt_llm/common/base_engine.py
+++ b/examples/tensorrt_llm/common/base_engine.py
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import logging
-from dataclasses import dataclass
-from typing import Any, Optional
-
-from common.protocol import DisaggregatedTypeConverter, TRTLLMWorkerRequest
-from tensorrt_llm import SamplingParams
-from tensorrt_llm.llmapi.llm_utils import update_llm_args_with_extra_options
-from tensorrt_llm.llmapi.tokenizer import tokenizer_factory
-from tensorrt_llm.serve.openai_protocol import (
-    DisaggregatedParams as OAIDisaggregatedParams,
-)
-
-from dynamo.llm import get_tensorrtllm_engine, get_tensorrtllm_publisher
-from dynamo.runtime import DistributedRuntime
-
-logger = logging.getLogger(__name__)
-
-logger.setLevel(logging.DEBUG)
-
-# Default buffer size for kv cache events.
-DEFAULT_KV_EVENT_BUFFER_MAX_SIZE = 1024
-
-
-def parse_endpoint(endpoint: str) -> tuple[str, str, str]:
-    endpoint_str = endpoint.replace("dyn://", "", 1)
-    endpoint_parts = endpoint_str.split(".")
-    if len(endpoint_parts) != 3:
-        raise ValueError(
-            f"Invalid endpoint format: '{endpoint}'. "
-            "Expected 'dyn://namespace.component.endpoint' or 'namespace.component.endpoint'."
-        )
-
-    return (endpoint_parts[0], endpoint_parts[1], endpoint_parts[2])
-
-
-@dataclass
-class BaseEngineConfig:
-    """Base engine configuration"""
-
-    namespace: str
-    component: str
-    endpoint: str
-    model_path: str
-    served_model_name: Optional[str] = None
-    kv_block_size: int = 32
-    extra_engine_args: str = ""
-    publish_events_and_metrics: bool = False
-    disaggregation_mode: str = "prefill_and_decode"
-    remote_prefill_endpoint: Optional[str] = None
-    lease_id: int = 0
-
-    def __str__(self) -> str:
-        return (
-            f"Config(namespace={self.namespace}, "
-            f"component={self.component}, "
-            f"endpoint={self.endpoint}, "
-            f"model_path={self.model_path}, "
-            f"served_model_name={self.served_model_name}, "
-            f"kv_block_size={self.kv_block_size}, "
-            f"extra_engine_args={self.extra_engine_args}, "
-            f"publish_events_and_metrics={self.publish_events_and_metrics}, "
-            f"disaggregation_mode={self.disaggregation_mode}, "
-            f"remote_prefill_endpoint={self.remote_prefill_endpoint}, "
-            f"lease_id={self.lease_id})"
-        )
-
-
-class BaseTensorrtLLMEngine:
-    def __init__(
-        self,
-        config: BaseEngineConfig,
-    ):
-        self._config = config
-        self._prefill_client = None
-        self._llm_engine = None
-        self._llm_engine_context = None
-        self._llm_publisher = None
-        self._llm_publisher_context = None
-        self._runtime = None
-        self._first_generation = True
-        # Initialize default sampling params
-        self.default_sampling_params = SamplingParams()
-
-    async def initialize(self, runtime: DistributedRuntime):
-        """Initialize the engine and prefill client if needed"""
-        self._runtime = runtime
-
-        # Convert model path to Path object if it's a local path, otherwise keep as string
-        model_path = str(self._config.model_path)
-
-        # Initialize the LLM engine
-        engine_args: dict[str, Any] = {
-            "model": model_path,
-            "tensor_parallel_size": 1,
-            "backend": "pytorch",
-            "skip_tokenizer_init": True,
-        }
-
-        if self._config.extra_engine_args:
-            # TODO: Support extra engine args from json file as well.
-            engine_args = update_llm_args_with_extra_options(
-                engine_args, self._config.extra_engine_args
-            )
-        # Update the model path in the config to the model path used by the engine.
-        self._config.model_path = str(engine_args["model"])
-        if not self._config.model_path:
-            raise ValueError(
-                "Model specification is required. Present neither in the config nor in the extra engine args."
-            )
-
-        # Populate default sampling params from the model
-        tokenizer = tokenizer_factory(self._config.model_path)
-        self.default_sampling_params = SamplingParams()
-        self.default_sampling_params._setup(tokenizer)
-        self.default_sampling_params.stop = None
-
-        if self._config.publish_events_and_metrics:
-            # 'event_buffer_max_size' is required to enable TRTLLM to publish kv cache events.
-            kv_cache_config: dict[str, Any] | Any = None
-            if "kv_cache_config" not in engine_args:
-                kv_cache_config = {}
-                kv_cache_config[
-                    "event_buffer_max_size"
-                ] = DEFAULT_KV_EVENT_BUFFER_MAX_SIZE
-            else:
-                kv_cache_config = engine_args["kv_cache_config"]
-                if (
-                    hasattr(kv_cache_config, "event_buffer_max_size")
-                    and not kv_cache_config.event_buffer_max_size
-                ):
-                    kv_cache_config.event_buffer_max_size = (
-                        DEFAULT_KV_EVENT_BUFFER_MAX_SIZE
-                    )
-                elif (
-                    isinstance(kv_cache_config, dict)
-                    and "event_buffer_max_size" not in kv_cache_config
-                ):
-                    kv_cache_config[
-                        "event_buffer_max_size"
-                    ] = DEFAULT_KV_EVENT_BUFFER_MAX_SIZE
-                engine_args["kv_cache_config"] = kv_cache_config
-
-            # Enable iter perf stats by default if we are publishing events and metrics.
-            if not engine_args.get("enable_iter_perf_stats"):
-                engine_args["enable_iter_perf_stats"] = True
-
-            # Only pytorch backend is supported for now to publish events and metrics.
-            if engine_args.get("backend") != "pytorch":
-                logging.error(
-                    "Only pytorch backend is supported for now to publish events and metrics."
-                )
-                raise RuntimeError(
-                    "Only pytorch backend is supported for now to publish events and metrics. Hence, KV router is not supported."
-                )
-
-        logging.info(f"TRTLLM engine args: {engine_args}")
-
-        # Get the engine using the asynccontextmanager
-        self._llm_engine_context = get_tensorrtllm_engine(engine_args)
-        if self._llm_engine_context is not None:
-            self._llm_engine = await self._llm_engine_context.__aenter__()
-        else:
-            raise RuntimeError("Failed to create LLM engine context")
-
-        if (
-            self._config.publish_events_and_metrics
-            and self._config.disaggregation_mode != "prefill"
-        ):
-            kv_listener = runtime.namespace(self._config.namespace).component(
-                self._config.component
-            )
-            self._llm_publisher_context = get_tensorrtllm_publisher(
-                kv_listener,
-                self._llm_engine,
-                kv_listener,
-                self._config.lease_id,
-                self._config.kv_block_size,
-            )
-            if self._llm_publisher_context is not None:
-                self._llm_publisher = await self._llm_publisher_context.__aenter__()
-            else:
-                raise RuntimeError("Failed to create LLM publisher context")
-
-        # Initialize prefill client if in decode mode
-        if self._config.disaggregation_mode == "decode":
-            if self._config.remote_prefill_endpoint is None:
-                raise ValueError("remote_prefill_endpoint is required for decode mode")
-            logging.info(
-                f"Initializing remote prefill client for endpoint: {self._config.remote_prefill_endpoint}"
-            )
-            (
-                parsed_namespace,
-                parsed_component_name,
-                parsed_endpoint_name,
-            ) = parse_endpoint(self._config.remote_prefill_endpoint)
-            if self._runtime is not None:
-                self._prefill_client = (
-                    await self._runtime.namespace(parsed_namespace)
-                    .component(parsed_component_name)
-                    .endpoint(parsed_endpoint_name)
-                    .client()
-                )
-            else:
-                raise RuntimeError("Runtime not initialized")
-
-    async def cleanup(self):
-        """Cleanup resources"""
-        if self._llm_publisher_context:
-            try:
-                await self._llm_publisher_context.__aexit__(None, None, None)
-            except Exception as e:
-                logging.error(f"Error during publisher cleanup: {e}")
-            finally:
-                self._llm_publisher = None
-                self._llm_publisher_context = None
-
-        if self._llm_engine_context:
-            try:
-                await self._llm_engine_context.__aexit__(None, None, None)
-            except Exception as e:
-                logging.error(f"Error during engine cleanup: {e}")
-            finally:
-                self._llm_engine = None
-                self._llm_engine_context = None
-
-        self._prefill_client = None
-
-    async def remote_prefill(self, request: TRTLLMWorkerRequest):
-        """
-        Send a prefill request to the remote prefill worker.
-
-        Args:
-            request: The original request to be sent for prefill
-
-        Returns:
-            The response from the remote prefill worker
-
-        Raises:
-            ValueError: If prefill client is not initialized or multiple responses received
-        """
-        prefill_request = request.model_copy(deep=True)
-        # TRTLLM requires max_tokens to be set for prefill requests.
-        prefill_request.stop_conditions.max_tokens = 1
-        prefill_request.disaggregated_params = OAIDisaggregatedParams(
-            request_type="context_only"
-        )
-
-        if self._prefill_client is None:
-            raise ValueError("Prefill client not initialized")
-        try:
-            # TODO: Use smart KV router to determine which prefill worker to use. This would also require supporting publishing events for prefill workers.
-            remote_prefill_responses = [
-                remote_prefill_response
-                async for remote_prefill_response in await self._prefill_client.round_robin(
-                    prefill_request.model_dump_json()
-                )
-            ]
-        except Exception as e:
-            raise ValueError(f"Error in remote prefill: {e}")
-
-        if len(remote_prefill_responses) > 1:
-            raise ValueError(
-                "Prefill worker returned more than one response. This is currently not supported in remote prefill mode."
-            )
-
-        if len(remote_prefill_responses) == 0:
-            raise ValueError("No response received from remote prefill worker")
-
-        remote_prefill_response = remote_prefill_responses[0]
-        return remote_prefill_response
-
-    async def generate(self, request: TRTLLMWorkerRequest):
-        if self._llm_engine is None:
-            raise RuntimeError("Engine not initialized")
-
-        if self._llm_publisher:
-            publishers_error = self._llm_publisher.check_error_queue()
-            if publishers_error:
-                raise publishers_error
-
-        inputs = request.token_ids
-
-        # Decode the disaggregated params from the request
-        disaggregated_params = DisaggregatedTypeConverter.to_llm_disaggregated_params(
-            request.disaggregated_params
-        )
-        num_output_tokens_so_far = 0
-
-        if self._config.disaggregation_mode == "decode":
-            # Run prefill/context phase remotely if disaggregation mode is decode.
-            try:
-                prefill_result = await self.remote_prefill(request)
-            except Exception as e:
-                raise ValueError(f"Error in remote prefill: {e}")
-
-            remote_prefill_response = prefill_result.data()
-            if (
-                remote_prefill_response["finish_reason"] == "stop"
-                or remote_prefill_response["finish_reason"] == "error"
-            ):
-                yield remote_prefill_response
-                return
-
-            # Decode the disaggregated params from the remote prefill response
-            # Decode the disaggregated params from the remote prefill response
-            disaggregated_params = (
-                DisaggregatedTypeConverter.to_llm_disaggregated_params(
-                    OAIDisaggregatedParams(
-                        **remote_prefill_response["disaggregated_params"]
-                    )
-                )
-            )
-
-            # Set the disaggregated params to generation_only for the rest of the generation
-            disaggregated_params.request_type = "generation_only"
-
-        sampling_params = self.default_sampling_params
-        for key, value in request.sampling_options.model_dump().items():
-            if not value:
-                continue
-            if hasattr(sampling_params, key):
-                setattr(sampling_params, key, value)
-
-        max_tokens = request.stop_conditions.max_tokens
-        if max_tokens:
-            sampling_params.max_tokens = max_tokens
-
-        ignore_eos = request.stop_conditions.ignore_eos
-        if ignore_eos:
-            sampling_params.ignore_eos = ignore_eos
-
-        # TODO: Disable streaming for context only requests when adding disagg support
-        async for res in self._llm_engine.llm.generate_async(
-            inputs=inputs,
-            sampling_params=sampling_params,
-            disaggregated_params=disaggregated_params,
-            streaming=(self._config.disaggregation_mode != "prefill"),
-        ):
-            # TRTLLM engine needs to start generating tokens first before stats
-            # can be retrieved.
-            if self._first_generation and self._llm_publisher:
-                self._llm_publisher.start()
-                self._first_generation = False
-
-            if res.finished and self._config.disaggregation_mode != "prefill":
-                yield {"finish_reason": "stop", "token_ids": []}
-                break
-
-            if not res.outputs:
-                yield {"finish_reason": "error", "token_ids": []}
-                break
-
-            output = res.outputs[0]
-            next_total_toks = len(output.token_ids)
-            out = {"token_ids": output.token_ids[num_output_tokens_so_far:]}
-            if output.finish_reason:
-                out["finish_reason"] = output.finish_reason
-            if output.stop_reason:
-                out["stop_reason"] = output.stop_reason
-            if self._config.disaggregation_mode == "prefill":
-                # Return the disaggregated params only when operating in prefill mode.
-                out[
-                    "disaggregated_params"
-                ] = DisaggregatedTypeConverter.to_oai_disaggregated_params(
-                    output.disaggregated_params
-                ).model_dump()
-
-            yield out
-            num_output_tokens_so_far = next_total_toks
--- a/examples/tensorrt_llm/common/parser.py
+++ b/examples/tensorrt_llm/common/parser.py
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-
-
-def parse_tensorrt_llm_args(
-    config_args,
-) -> argparse.Namespace:
-    parser = argparse.ArgumentParser(description="A TensorRT-LLM Worker parser")
-    parser.add_argument(
-        "--extra-engine-args",
-        type=str,
-        default="",
-        help="Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.",
-    )
-    parser.add_argument(
-        "--model-path",
-        type=str,
-        default=None,
-        help="Path to disk model or HuggingFace model identifier to load.",
-    )
-    parser.add_argument(
-        "--served_model_name",
-        type=str,
-        help="Name to serve the model under.",
-    )
-    parser.add_argument(
-        "--router",
-        type=str,
-        choices=["random", "round-robin", "kv"],
-        default="random",
-        help="Router type to use for scheduling requests to workers",
-    )
-
-    parser.add_argument(
-        "--kv-block-size",
-        type=int,
-        default=32,
-        help="Number of tokens per KV block in TRTLLM worker. Default is 32 for pytorch backend.",
-    )
-
-    parser.add_argument(
-        "--enable-disagg",
-        action="store_true",
-        help="Enable remote prefill for the worker",
-    )
-
-    args = parser.parse_args(config_args)
-    return args
--- a/examples/tensorrt_llm/common/protocol.py
+++ b/examples/tensorrt_llm/common/protocol.py
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import base64
-from typing import List, Optional
-
-from pydantic import BaseModel, Field
-from tensorrt_llm.llmapi import DisaggregatedParams as LlmDisaggregatedParams
-from tensorrt_llm.serve.openai_protocol import DisaggregatedParams
-
-
-class Tokens(BaseModel):
-    tokens: list[int]
-
-
-TokenIdType = int
-
-
-class DisaggregatedTypeConverter:
-    @staticmethod
-    def to_llm_disaggregated_params(
-        disaggregated_params: DisaggregatedParams,
-    ) -> LlmDisaggregatedParams:
-        if disaggregated_params is None:
-            return None
-        else:
-            opaque_state = (
-                base64.b64decode(disaggregated_params.encoded_opaque_state)
-                if disaggregated_params.encoded_opaque_state is not None
-                else None
-            )
-
-            return LlmDisaggregatedParams(
-                request_type=disaggregated_params.request_type,
-                first_gen_tokens=disaggregated_params.first_gen_tokens,
-                ctx_request_id=disaggregated_params.ctx_request_id,
-                opaque_state=opaque_state,
-            )
-
-    @staticmethod
-    def to_oai_disaggregated_params(
-        tllm_disagg_params: LlmDisaggregatedParams,
-    ) -> DisaggregatedParams:
-        if tllm_disagg_params is None:
-            return None
-        else:
-            encoded_opaque_state = (
-                base64.b64encode(tllm_disagg_params.opaque_state).decode("utf-8")
-                if tllm_disagg_params.opaque_state is not None
-                else None
-            )
-            return DisaggregatedParams(
-                request_type=tllm_disagg_params.request_type,
-                first_gen_tokens=tllm_disagg_params.first_gen_tokens,
-                ctx_request_id=tllm_disagg_params.ctx_request_id,
-                encoded_opaque_state=encoded_opaque_state,
-            )
-
-
-# TODO: move these to common for all LLMs once we adopt dynamo-run
-# derived from lib/llm/src/protocols/common/preprocessor.rs
-class StopConditions(BaseModel):
-    max_tokens: Optional[int] = None
-    stop: Optional[List[str]] = None
-    stop_token_ids_hidden: Optional[List[TokenIdType]] = None
-    min_tokens: Optional[int] = None
-    ignore_eos: Optional[bool] = None
-
-
-class SamplingOptions(BaseModel):
-    n: Optional[int] = None
-    best_of: Optional[int] = None
-    presence_penalty: Optional[float] = None
-    frequency_penalty: Optional[float] = None
-    repetition_penalty: Optional[float] = None
-    temperature: Optional[float] = None
-    top_p: Optional[float] = None
-    top_k: Optional[int] = None
-    min_p: Optional[float] = None
-    use_beam_search: Optional[bool] = None
-    length_penalty: Optional[float] = None
-    seed: Optional[int] = None
-
-
-class TRTLLMWorkerRequest(BaseModel):
-    token_ids: List[TokenIdType]
-    stop_conditions: StopConditions
-    sampling_options: SamplingOptions
-    eos_token_ids: List[TokenIdType] = Field(default_factory=list)
-    mdc_sum: Optional[str] = None
-    annotations: List[str] = Field(default_factory=list)
-    estimated_prefix_hit_num_blocks: Optional[int] = None
-    disaggregated_params: Optional[DisaggregatedParams] = Field(default=None)
--- a/examples/tensorrt_llm/components/frontend.py
+++ b/examples/tensorrt_llm/components/frontend.py
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import logging
-import subprocess
-from pathlib import Path
-
-from components.worker import TensorRTLLMWorker
-from fastapi import FastAPI
-from pydantic import BaseModel
-
-from dynamo import sdk
-from dynamo.sdk import depends, service
-from dynamo.sdk.lib.config import ServiceConfig
-from dynamo.sdk.lib.image import DYNAMO_IMAGE
-
-logger = logging.getLogger(__name__)
-
-
-def get_dynamo_run_binary():
-    """Find the dynamo-run binary path in SDK or fallback to 'dynamo-run' command."""
-    sdk_path = Path(sdk.__file__)
-    binary_path = sdk_path.parent / "cli/bin/dynamo-run"
-    if not binary_path.exists():
-        return "dynamo-run"
-    else:
-        return str(binary_path)
-
-
-class FrontendConfig(BaseModel):
-    """Configuration for the Frontend service including model and HTTP server settings."""
-
-    served_model_name: str
-    endpoint: str
-    port: int = 8000
-    router: str = "round-robin"
-    block_size: int = 32
-
-
-# todo this should be called ApiServer
-@service(
-    dynamo={
-        "namespace": "dynamo",
-    },
-    workers=1,
-    image=DYNAMO_IMAGE,
-    app=FastAPI(title="TensorRT-LLM Example"),
-)
-class Frontend:
-    worker = depends(TensorRTLLMWorker)
-
-    def __init__(self):
-        """Initialize Frontend service with HTTP server and model configuration."""
-        self.frontend_config = FrontendConfig(
-            **ServiceConfig.get_parsed_config("Frontend")
-        )
-        self.process = None
-
-        logger.warning(f"Frontend config: {self.frontend_config}")
-
-        self.start_ingress_and_processor()
-
-    def start_ingress_and_processor(self):
-        """Starting dynamo-run based ingress and processor"""
-        logger.info(
-            f"Starting HTTP server and processor on port {self.frontend_config.port}"
-        )
-        dynamo_run_binary = get_dynamo_run_binary()
-
-        cmd = [
-            dynamo_run_binary,
-            "in=http",
-            "out=dyn",
-            "--http-port",
-            str(self.frontend_config.port),
-            "--router-mode",
-            self.frontend_config.router,
-        ]
-
-        logger.info(f"Frontend cmd: {cmd}")
-
-        self.process = subprocess.Popen(
-            cmd,
-            stdout=None,
-            stderr=None,
-        )
-
-    def close(self):
-        """Clean up resources by terminating the subprocess."""
-        if self.process is not None:
-            try:
-                logger.info("Terminating subprocess...")
-                self.process.terminate()
-                # Wait for process to terminate with a timeout
-                self.process.wait(timeout=5)
-            except subprocess.TimeoutExpired:
-                logger.warning("Subprocess did not terminate gracefully, forcing kill")
-                self.process.kill()
-                self.process.wait()
-            except Exception as e:
-                logger.error(f"Error while terminating subprocess: {e}")
-            finally:
-                self.process = None
-
-    def __del__(self):
-        """Destructor to ensure subprocess is cleaned up."""
-        self.close()
--- a/examples/tensorrt_llm/components/prefill_worker.py
+++ b/examples/tensorrt_llm/components/prefill_worker.py
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import logging
-
-from common.base_engine import BaseEngineConfig, BaseTensorrtLLMEngine
-from common.parser import parse_tensorrt_llm_args
-from common.protocol import TRTLLMWorkerRequest
-
-from dynamo.sdk import async_on_start, dynamo_context, endpoint, on_shutdown, service
-from dynamo.sdk.lib.config import ServiceConfig
-
-logger = logging.getLogger(__name__)
-
-
-@service(
-    dynamo={
-        "namespace": "dynamo",
-    },
-    resources={"gpu": 1, "cpu": "10", "memory": "20Gi"},
-    workers=1,
-)
-class TensorRTLLMPrefillWorker(BaseTensorrtLLMEngine):
-    def __init__(self):
-        logger.info("Initializing TensorRT-LLM Prefill Worker")
-        class_name = self.__class__.__name__
-        config = ServiceConfig.get_instance()
-        config_args = config.as_args(class_name, prefix="")
-        args = parse_tensorrt_llm_args(config_args)
-        lease_id = dynamo_context["endpoints"][0].lease_id()
-        namespace, _ = TensorRTLLMPrefillWorker.dynamo_address()  # type: ignore
-
-        engine_config = BaseEngineConfig(
-            namespace=namespace,
-            component=class_name,
-            endpoint="generate",
-            model_path=args.model_path,
-            served_model_name=args.served_model_name,
-            kv_block_size=args.kv_block_size,
-            extra_engine_args=args.extra_engine_args,
-            publish_events_and_metrics=False,
-            disaggregation_mode="prefill",
-            remote_prefill_endpoint=None,
-            lease_id=lease_id,
-        )
-
-        super().__init__(config=engine_config)
-
-    @async_on_start
-    async def async_init(self):
-        runtime = dynamo_context["runtime"]
-        await self.initialize(runtime)
-        logger.info("TensorRT-LLM Prefill Worker initialized")
-
-    @on_shutdown
-    async def async_cleanup(self):
-        logger.info("Cleaning up TensorRT-LLM Prefill Worker")
-        await self.cleanup()
-        logger.info("TensorRT-LLM Prefill Worker cleanup completed")
-
-    @endpoint()
-    async def generate(self, request: TRTLLMWorkerRequest):
-        async for response in super().generate(request):
-            yield response
--- a/examples/tensorrt_llm/components/worker.py
+++ b/examples/tensorrt_llm/components/worker.py
 # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
+
+import asyncio
 import logging
+import os
+import signal
+import sys
+from typing import TYPE_CHECKING
+
+import uvloop
+from tensorrt_llm import SamplingParams
+from tensorrt_llm.llmapi.llm_utils import update_llm_args_with_extra_options
+from tensorrt_llm.llmapi.tokenizer import tokenizer_factory

-from common.base_engine import BaseEngineConfig, BaseTensorrtLLMEngine
-from common.parser import parse_tensorrt_llm_args
-from common.protocol import TRTLLMWorkerRequest
-from components.prefill_worker import TensorRTLLMPrefillWorker
-
-from dynamo.llm import ModelType, register_llm
-from dynamo.sdk import (
-    async_on_start,
-    depends,
-    dynamo_context,
-    endpoint,
-    on_shutdown,
-    service,
+from dynamo.llm import (
+    ModelType,
+    get_tensorrtllm_engine,
+    get_tensorrtllm_publisher,
+    register_llm,
 )
-from dynamo.sdk.lib.config import ServiceConfig
+from dynamo.runtime import DistributedRuntime, dynamo_worker
+from dynamo.runtime.logging import configure_dynamo_logging

-logger = logging.getLogger(__name__)
+if TYPE_CHECKING:
+    from utils.trtllm_utils import Config


-@service(
-    dynamo={
-        "namespace": "dynamo",
-    },
-    resources={"gpu": 1, "cpu": "10", "memory": "20Gi"},
-    workers=1,
-)
-class TensorRTLLMWorker(BaseTensorrtLLMEngine):
-    prefill_worker = depends(TensorRTLLMPrefillWorker)
-
-    def __init__(self):
-        logger.info("Initializing TensorRT-LLM Worker")
-        class_name = self.__class__.__name__
-        config = ServiceConfig.get_instance()
-        config_args = config.as_args(class_name, prefix="")
-        args = parse_tensorrt_llm_args(config_args)
-        lease_id = dynamo_context["endpoints"][0].lease_id()
-        namespace, _ = TensorRTLLMWorker.dynamo_address()  # type: ignore
-        endpoint_name = "generate"
-        publish_events_and_metrics = args.router == "kv"
-        prefill_class_name = "TensorRTLLMPrefillWorker"
-
-        if args.enable_disagg:
-            disaggregation_mode = "decode"
-        else:
-            disaggregation_mode = "prefill_and_decode"
-
-        engine_config = BaseEngineConfig(
-            namespace=namespace,
-            component=class_name,
-            endpoint=endpoint_name,
-            model_path=args.model_path,
-            served_model_name=args.served_model_name,
-            kv_block_size=args.kv_block_size,
-            extra_engine_args=args.extra_engine_args,
-            publish_events_and_metrics=publish_events_and_metrics,
-            disaggregation_mode=disaggregation_mode,
-            remote_prefill_endpoint=f"dyn://{namespace}.{prefill_class_name}.generate",
-            lease_id=lease_id,
-        )
+def _setup_path_and_imports():
+    """Setup path and import utils modules"""
+    # Add the parent directory to the Python path so we can import utils
+    parent_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
+    if parent_dir not in sys.path:
+        sys.path.insert(0, parent_dir)
+
+    from utils.request_handlers.handlers import (
+        RequestHandlerConfig,
+        RequestHandlerFactory,
+    )
+    from utils.trtllm_utils import (
+        Config,
+        cmd_line_args,
+        is_first_worker,
+        parse_endpoint,
+    )
+
+    return (
+        RequestHandlerConfig,
+        RequestHandlerFactory,
+        Config,
+        cmd_line_args,
+        is_first_worker,
+        parse_endpoint,
+    )
+
+
+# Import utils modules
+(
+    RequestHandlerConfig,
+    RequestHandlerFactory,
+    Config,
+    cmd_line_args,
+    is_first_worker,
+    parse_endpoint,
+) = _setup_path_and_imports()
+
+# Default buffer size for kv cache events.
+DEFAULT_KV_EVENT_BUFFER_MAX_SIZE = 1024
+
+configure_dynamo_logging()
+
+
+async def graceful_shutdown(runtime):
+    logging.info("Received shutdown signal, shutting down DistributedRuntime")
+    runtime.shutdown()
+    logging.info("DistributedRuntime shutdown complete")
+
+
+@dynamo_worker(static=False)
+async def worker(runtime: DistributedRuntime):
+    # Set up signal handler for graceful shutdown
+    loop = asyncio.get_running_loop()
+
+    def signal_handler():
+        # Schedule the shutdown coroutine instead of calling it directly
+        asyncio.create_task(graceful_shutdown(runtime))
+
+    for sig in (signal.SIGTERM, signal.SIGINT):
+        loop.add_signal_handler(sig, signal_handler)
+
+    logging.info("Signal handlers set up for graceful shutdown")
+
+    config = cmd_line_args()
+    await init(runtime, config)

-        super().__init__(config=engine_config)

-    @async_on_start
-    async def async_init(self):
-        runtime = dynamo_context["runtime"]
-        await self.initialize(runtime)
+async def init(runtime: DistributedRuntime, config: Config):
+    """
+    Instantiate and serve
+    """
+    logging.info(f"Initializing the worker with config: {config}")

-        logger.info("Registering LLM for discovery")
-        endpoint = (
-            runtime.namespace(self._config.namespace)
-            .component(self._config.component)
-            .endpoint(self._config.endpoint)
+    next_client = None
+    if config.next_endpoint:
+        logging.info(
+            f"Initializing next worker client for endpoint: {config.next_endpoint}"
        )
+        parsed_namespace, parsed_component_name, parsed_endpoint_name = parse_endpoint(
+            config.next_endpoint
+        )
+        next_client = (
+            await runtime.namespace(parsed_namespace)
+            .component(parsed_component_name)
+            .endpoint(parsed_endpoint_name)
+            .client()
+        )
+
+    component = runtime.namespace(config.namespace).component(config.component)
+    await component.create_service()
+
+    # Convert model path to Path object if it's a local path, otherwise keep as string
+    model_path = str(config.model_path)
+
+    arg_map = {
+        "model": model_path,
+        "tensor_parallel_size": config.tensor_parallel_size,
+        "backend": "pytorch",
+        "skip_tokenizer_init": True,
+    }
+    if config.extra_engine_args != "":
+        # TODO: Support extra engine args from json file as well.
+        arg_map = update_llm_args_with_extra_options(arg_map, config.extra_engine_args)
+    if config.publish_events_and_metrics:
+        # 'event_buffer_max_size' is required to enable TRTLLM to publish kv cache events.
+        kv_cache_config = None
+        if "kv_cache_config" not in arg_map:
+            kv_cache_config = {}
+            kv_cache_config["event_buffer_max_size"] = DEFAULT_KV_EVENT_BUFFER_MAX_SIZE
+        else:
+            kv_cache_config = arg_map["kv_cache_config"]
+            if not kv_cache_config.event_buffer_max_size:
+                kv_cache_config.event_buffer_max_size = DEFAULT_KV_EVENT_BUFFER_MAX_SIZE
+        arg_map["kv_cache_config"] = kv_cache_config
+
+        # Only pytorch backend is supported for now to publish events and metrics.
+        if "backend" not in arg_map:
+            arg_map["backend"] = "pytorch"
+        elif arg_map["backend"] != "pytorch":
+            logging.error(
+                "Only pytorch backend is supported for now to publish events and metrics."
+            )
+            sys.exit(1)
+
+    logging.info(f"TensorRT-LLM engine args: {arg_map}")
+    engine_args = arg_map

-        try:
+    # Populate default sampling params from the model
+    tokenizer = tokenizer_factory(arg_map["model"])
+    default_sampling_params = SamplingParams()
+    default_sampling_params._setup(tokenizer)
+    default_sampling_params.stop = None
+
+    async with get_tensorrtllm_engine(engine_args) as engine:
+        endpoint = component.endpoint(config.endpoint)
+
+        if is_first_worker(config):
+            # Register the model with the endpoint if only the worker is first in the disaggregation chain.
            await register_llm(
                ModelType.Backend,
                endpoint,
-                self._config.model_path,
-                self._config.served_model_name,
-                kv_cache_block_size=self._config.kv_block_size,
+                config.model_path,
+                config.served_model_name,
+                kv_cache_block_size=config.kv_block_size,
            )
-            logger.info("Successfully registered LLM for discovery")
-        except Exception as e:
-            logger.error(f"Failed to register LLM for discovery: {e}")
-            raise
-
-        logger.info("TensorRT-LLM Worker initialized")
-
-    @on_shutdown
-    async def async_cleanup(self):
-        logger.info("Cleaning up TensorRT-LLM Worker")
-        await self.cleanup()
-        logger.info("TensorRT-LLM Worker cleanup completed")
-
-    @endpoint()
-    async def generate(self, request: TRTLLMWorkerRequest):
-        async for response in super().generate(request):
-            yield response
+
+        # publisher will be set later if publishing is enabled.
+        handler_config = RequestHandlerConfig(
+            component=component,
+            engine=engine,
+            default_sampling_params=default_sampling_params,
+            publisher=None,
+            disaggregation_mode=config.disaggregation_mode,
+            disaggregation_strategy=config.disaggregation_strategy,
+            next_client=next_client,
+        )
+
+        if config.publish_events_and_metrics and is_first_worker(config):
+            # Initialize and pass in the publisher to the request handler to
+            # publish events and metrics.
+            kv_listener = runtime.namespace(config.namespace).component(
+                config.component
+            )
+            async with get_tensorrtllm_publisher(
+                component,
+                engine,
+                kv_listener,
+                int(endpoint.lease_id()),
+                config.kv_block_size,
+            ) as publisher:
+                handler_config.publisher = publisher
+                handler = RequestHandlerFactory().get_request_handler(handler_config)
+                await endpoint.serve_endpoint(handler.generate)
+        else:
+            handler = RequestHandlerFactory().get_request_handler(handler_config)
+            await endpoint.serve_endpoint(handler.generate)
+
+
+if __name__ == "__main__":
+    uvloop.install()
+    asyncio.run(worker())
--- a/examples/tensorrt_llm/configs/agg.yaml
+++ b/examples/tensorrt_llm/configs/agg.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: round-robin
-
-TensorRTLLMWorker:
-  # Path to disk model or HuggingFace model identifier to load
-  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Name to serve the model under
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/engine_configs/agg_config.yaml"
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
\ No newline at end of file
--- a/examples/tensorrt_llm/configs/agg_router.yaml
+++ b/examples/tensorrt_llm/configs/agg_router.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: kv
-
-TensorRTLLMWorker:
-  # Path to disk model or HuggingFace model identifier to load
-  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Name to serve the model under
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/engine_configs/agg_config.yaml"
-  router: kv
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
\ No newline at end of file
--- a/examples/tensorrt_llm/configs/deepseek_r1/agg.yaml
+++ b/examples/tensorrt_llm/configs/deepseek_r1/agg.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  # This is the client-facing model name, you can set this to anything you'd like.
-  served_model_name: "nvidia/DeepSeek-R1-FP4"
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: round-robin
-
-TensorRTLLMWorker:
-  served_model_name: "nvidia/DeepSeek-R1-FP4"
-  # NOTE: FP4 only supported starting with Blackwell GPUs.
-  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
-  # You can also specify the full path to locally downloaded weights
-  # instead of a HuggingFace ID here.
-  model-path: "nvidia/DeepSeek-R1-FP4"
-  extra-engine-args: "configs/deepseek_r1/engine_configs/agg_config.yaml"
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 4
--- a/examples/tensorrt_llm/configs/deepseek_r1/disagg.yaml
+++ b/examples/tensorrt_llm/configs/deepseek_r1/disagg.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  # This is the client-facing model name, you can set this to anything you'd like.
-  served_model_name: "nvidia/DeepSeek-R1-FP4"
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: round-robin
-
-TensorRTLLMWorker:
-  served_model_name: "nvidia/DeepSeek-R1-FP4"
-  # NOTE: FP4 only supported starting with Blackwell GPUs.
-  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
-  # You can also specify the full path to locally downloaded weights
-  # instead of a HuggingFace ID here.
-  model-path: "nvidia/DeepSeek-R1-FP4"
-  extra-engine-args: "configs/deepseek_r1/engine_configs/decode_config.yaml"
-  enable-disagg: true
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 4
-
-TensorRTLLMPrefillWorker:
-  # NOTE: FP4 only supported starting with Blackwell GPUs.
-  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
-  # You can also specify the full path to locally downloaded weights
-  # instead of a HuggingFace ID here.
-  model-path: "nvidia/DeepSeek-R1-FP4"
-  extra-engine-args: "configs/deepseek_r1/engine_configs/prefill_config.yaml"
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 4
--- a/examples/tensorrt_llm/configs/deepseek_r1/mtp/mtp_agg.yaml
+++ b/examples/tensorrt_llm/configs/deepseek_r1/mtp/mtp_agg.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  served_model_name: "nvidia/DeepSeek-R1-FP4"
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: round-robin
-
-TensorRTLLMWorker:
-  served_model_name: "nvidia/DeepSeek-R1-FP4"
-  # NOTE: FP4 only supported starting with Blackwell GPUs.
-  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
-  # You can also specify the full path to locally downloaded weights
-  # instead of a HuggingFace ID here.
-  model-path: "nvidia/DeepSeek-R1-FP4"
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/deepseek_r1/mtp/engine_configs/agg_config.yaml"
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 4
--- a/examples/tensorrt_llm/configs/deepseek_r1/mtp/mtp_disagg.yaml
+++ b/examples/tensorrt_llm/configs/deepseek_r1/mtp/mtp_disagg.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  served_model_name: "nvidia/DeepSeek-R1-FP4"
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: round-robin
-
-TensorRTLLMWorker:
-  served_model_name: "nvidia/DeepSeek-R1-FP4"
-  # NOTE: FP4 only supported starting with Blackwell GPUs.
-  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
-  # You can also specify the full path to locally downloaded weights
-  # instead of a HuggingFace ID here.
-  model-path: "nvidia/DeepSeek-R1-FP4"
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/deepseek_r1/mtp/engine_configs/decode_config.yaml"
-  router: round-robin
-  enable-disagg: true
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 4
-
-TensorRTLLMPrefillWorker:
-  # NOTE: FP4 only supported starting with Blackwell GPUs.
-  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
-  # You can also specify the full path to locally downloaded weights
-  # instead of a HuggingFace ID here.
-  model-path: "nvidia/DeepSeek-R1-FP4"
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml"
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 4
--- a/examples/tensorrt_llm/configs/disagg.yaml
+++ b/examples/tensorrt_llm/configs/disagg.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: round-robin
-
-TensorRTLLMWorker:
-  # Path to disk model or HuggingFace model identifier to load
-  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Name to serve the model under
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/engine_configs/decode_config.yaml"
-  enable-disagg: true
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
-
-TensorRTLLMPrefillWorker:
-  # Path to disk model or HuggingFace model identifier to load
-  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/engine_configs/prefill_config.yaml"
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
-
--- a/examples/tensorrt_llm/configs/disagg_router.yaml
+++ b/examples/tensorrt_llm/configs/disagg_router.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: kv
-
-TensorRTLLMWorker:
-  # Path to disk model or HuggingFace model identifier to load
-  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Name to serve the model under
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/engine_configs/decode_config.yaml"
-  enable-disagg: true
-  router: kv
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
-
-TensorRTLLMPrefillWorker:
-  # Path to disk model or HuggingFace model identifier to load
-  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/engine_configs/prefill_config.yaml"
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
\ No newline at end of file
--- a/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  # This is the client-facing model name, you can set this to anything you'd like.
-  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: round-robin
-
-TensorRTLLMWorker:
-  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
-  model-path: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
-  extra-engine-args: "configs/llama4/eagle/engine_configs/agg_config.yaml"
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 4
--- a/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: round-robin
-
-TensorRTLLMWorker:
-  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
-  model-path: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/llama4/eagle/engine_configs/decode_config.yaml"
-  router: round-robin
-  enable-disagg: true
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 4
-
-TensorRTLLMPrefillWorker:
-  model-path: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/llama4/eagle/engine_configs/prefill_config.yaml"
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 4
--- a/examples/tensorrt_llm/configs/engine_configs/agg_config.yaml
+++ b/examples/tensorrt_llm/configs/engine_configs/agg_config.yaml