refactor: remove python native runtime

0bfd9a76 · Neelay Shah · GitHub · 8f741f14 · 0bfd9a76 · 0bfd9a76
Commit 0bfd9a76 authored Feb 24, 2025 by Neelay Shah Committed by GitHub Feb 24, 2025
20 changed files
--- a/README.md
+++ b/README.md
@@ -109,37 +109,6 @@ in the Hello World example. In this example, we demonstrate
 [Disaggregated Serving](https://arxiv.org/abs/2401.09670) as an
 application of the components defined in Triton Distributed.
-## Python Native Distributed Runtime
-Triton distributed has a python native based distributed runtime with
-implementation under development. Please note the APIs are subject to
-change.
-### Hello World
-[Hello World](./examples/python/hello_world)
-A basic example demonstrating the new interfaces and concepts of
-Triton Distributed. In the Hello World example, you can deploy a set
-of simple workers to load balance requests from a local work queue.
-### LLM
-[TENSORRTLLM](./examples/python/llm/tensorrtllm)
-An intermediate example expanding further on the concepts indroduced
-in the Hello World example. In this example, we demonstrate
-[Disaggregated Serving](https://arxiv.org/abs/2401.09670)
-as an application of the components defined in Triton Distributed.
-[VLLM](./examples/python/llm/vllm)
-An intermediate example expanding further on the concepts indroduced
-in the Hello World example. In this example, we demonstrate
-[Disaggregated Serving](https://arxiv.org/abs/2401.09670)
-as an application of the components defined in Triton Distributed.
 # Disclaimers
 > [!NOTE]

--- a/container/Dockerfile
+++ b/container/Dockerfile
@@ -131,9 +131,6 @@ RUN apt-get install tmux -y
 # Working directory
 WORKDIR /workspace
-COPY icp /workspace/icp
-RUN /workspace/icp/protos/gen_python.sh
 COPY runtime /workspace/runtime
 RUN cd runtime/rust && \
    cargo build --release --locked && cargo doc --no-deps
@@ -173,24 +170,10 @@ RUN mkdir -p /opt/triton/llm_binding/wheels && \
 # TODO: In future, we may use a virtualenv for everything and remove this.
 RUN pip install /opt/triton/llm_binding/wheels/triton_distributed_rs*cp312*.whl
-# Install python packages
-ARG PYTHON_PACKAGE_VERSION=0.0.1.dev+unknown
-# SETUPTOOLS_SCM_PRETEND_VERSION allows dynamically setting the package versions during build/install.
-# This allows having versioned packages during development between releases, such as commit IDs.
-#
-# Normally SCM version is taken directly from .git but this is not available in the Dockerfile
-# and so we pass in via a buildarg
-RUN SETUPTOOLS_SCM_PRETEND_VERSION_FOR_TRITON_DISTRIBUTED_ICP=${PYTHON_PACKAGE_VERSION} pip install -e /workspace/icp/python
-RUN SETUPTOOLS_SCM_PRETEND_VERSION_FOR_TRITON_DISTRIBUTED_RUNTIME=${PYTHON_PACKAGE_VERSION} pip install -e /workspace/runtime/python
 # Copy everything in after install steps to avoid re-running build/install
 # commands on unrelated changes in other dirs.
 COPY . /workspace
-# Sets pythonpath for python modules
-ENV PYTHONPATH="${PYTHONPATH}:/workspace/examples/python:/opt/tritonserver/python/openai/openai_frontend"
 # Enable system UCX
 ENV RAPIDS_LIBUCX_PREFER_SYSTEM_LIBRARY=true

--- a/container/run.sh
+++ b/container/run.sh
@@ -217,9 +217,6 @@ get_options() {
 	ENVIRONMENT_VARIABLES+=" -e HF_TOKEN"
-	if [ ! -d "${SOURCE_DIR}/icp/src/python/tdist/icp/protos" ]; then
-	    $RUN_PREFIX docker run --rm -t -v ${SOURCE_DIR}/..:/workspace -w /workspace $IMAGE /workspace/icp/protos/gen_python.sh > /dev/null 2>&1
-	fi
 	INTERACTIVE=" -it "
    fi

--- a/examples/python/__init__.py
+++ b/examples/python/__init__.py
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
--- a/examples/python/hello_world/README.md
+++ b/examples/python/hello_world/README.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-# Hello World
-A basic example demonstrating the new interfaces and concepts of
-triton distributed. In the hello world example, you can deploy a set
-of simple workers to load balance requests from a local work queue.
-The example demonstrates:
-1. How to incorporate an existing Triton Core Model into a triton distributed worker.
-2. How to incorporate a standalone python class into a triton distributed worker.
-3. How deploy a set of workers
-4. How to send requests to the triton distributed deployment
-5. Requests over the Request Plane and Data movement over the Data
-   Plane.
-## Building the Hello World Environment
-The hello world example is designed to be deployed in a containerized
-environment and to work with and without GPU support.
-To get started build the "STANDARD" triton distributed development
-environment.
-Note: "STANDARD" is the default framework
-```
-./container/build.sh
-```
-## Starting the Deployment
-```
-./container/run.sh -it -- python3 -m hello_world.deploy --initialize-request-plane
-```
-#### Expected Output
-```
-Starting Workers
-17:17:09 deployment.py:115[triton_distributed.worker.deployment] INFO:
-Starting Worker:
-	Config:
-	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
-             data_plane=<function UcpDataPlane at 0x7f477eb5d580>,
-             request_plane_args=([], {}),
-             data_plane_args=([], {}),
-             log_level=1,
-             operators=[OperatorConfig(name='encoder',
-                                       implementation=<class 'triton_distributed.worker.triton_core_operator.TritonCoreOperator'>,
-                                       repository='/workspace/examples/hello_world/operators/triton_core_models',
-                                       version=1,
-                                       max_inflight_requests=1,
-                                       parameters={'config': {'instance_group': [{'count': 1,
-                                                                                  'kind': 'KIND_CPU'}],
-                                                              'parameters': {'delay': {'string_value': '0'},
-                                                                             'input_copies': {'string_value': '1'}}}},
-                                       log_level=None)],
-             triton_log_path=None,
-             name='encoder.0',
-             log_dir='/workspace/examples/hello_world/logs',
-             metrics_port=50000)
-	<SpawnProcess name='encoder.0' parent=1 initial>
-17:17:09 deployment.py:115[triton_distributed.worker.deployment] INFO:
-Starting Worker:
-	Config:
-	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
-             data_plane=<function UcpDataPlane at 0x7f477eb5d580>,
-             request_plane_args=([], {}),
-             data_plane_args=([], {}),
-             log_level=1,
-             operators=[OperatorConfig(name='decoder',
-                                       implementation=<class 'triton_distributed.worker.triton_core_operator.TritonCoreOperator'>,
-                                       repository='/workspace/examples/hello_world/operators/triton_core_models',
-                                       version=1,
-                                       max_inflight_requests=1,
-                                       parameters={'config': {'instance_group': [{'count': 1,
-                                                                                  'kind': 'KIND_CPU'}],
-                                                              'parameters': {'delay': {'string_value': '0'},
-                                                                             'input_copies': {'string_value': '1'}}}},
-                                       log_level=None)],
-             triton_log_path=None,
-             name='decoder.0',
-             log_dir='/workspace/examples/hello_world/logs',
-             metrics_port=50001)
-	<SpawnProcess name='decoder.0' parent=1 initial>
-17:17:09 deployment.py:115[triton_distributed.worker.deployment] INFO:
-Starting Worker:
-	Config:
-	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
-             data_plane=<function UcpDataPlane at 0x7f477eb5d580>,
-             request_plane_args=([], {}),
-             data_plane_args=([], {}),
-             log_level=1,
-             operators=[OperatorConfig(name='encoder_decoder',
-                                       implementation='EncodeDecodeOperator',
-                                       repository='/workspace/examples/hello_world/operators',
-                                       version=1,
-                                       max_inflight_requests=1,
-                                       parameters={},
-                                       log_level=None)],
-             triton_log_path=None,
-             name='encoder_decoder.0',
-             log_dir='/workspace/examples/hello_world/logs',
-             metrics_port=50002)
-	<SpawnProcess name='encoder_decoder.0' parent=1 initial>
-Workers started ... press Ctrl-C to Exit
-```
-## Sending Requests
-From a separate terminal run the sample client.
-```
-./container/run.sh -it -- python3 -m hello_world.client
-```
-#### Expected Output
-```
-Client: 0 Received Response: 42 From: 39491f06-d4f7-11ef-be96-047bcba9020e Error: None:  43%|███████▋          | 43/100 [00:04<00:05,  9.83request/s]
-Throughput: 9.10294484748811 Total Time: 10.985455989837646
-Clients Stopped Exit Code 0
-```
-## Behind the Scenes
-The hello world example is designed to demonstrate and allow
-experimenting with different mixtures of compute and memory loads and
-different numbers of workers for different parts of the hello world
-workflow.
-### Hello World Workflow
-The hello world workflow is a simple two stage pipeline with an
-encoding stage and a decoding stage plus an encoder-decoder stage to
-orchestrate the overall workflow.
-```
-client <-> encoder_decoder <-> encoder
-                      |
-                      -----<-> decoder
-```
-#### Encoder
-The encoder follows the simple procedure:
-1. copy the input x times (x is configurable via parameter)
-2. invert the input
-3. delay * size of output
-#### Decoder
-The decoder follows the simple procedure:
-1. remove the extra copies
-2. invert the input
-3. delay * size of output
-#### Encoder - Decoder
-The encoder-decoder operator controls the overall workflow.
-It first sends a request for an encoder. Once it receives the response
-it sends the output from the encoder as an input to the decoder. Note
-in this step memory is transferred directly between the encoder and
-decoder workers - and does not pass through the encoder-decoder.
-### Operators
-Operators are responsible for actually doing work and responding to
-requests. Operators are supported in two main flavors and are hosted
-by a common Worker class.
-#### Triton Core Operator
-The triton core operator makes a triton model (following the [standard
-model
-repo](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md)
-and backend structure of the tritonserver) available on the request
-plane. Both the encoder and decoder are implemented as triton python
-backend models.
-#### Custom Operator
-The encoder-decoder operator is a python class that implements the
-Operator interface. Internally it makes remote requests to other
-workers. Generally an operator can make use of other operators for its
-work but isn't required to.
-### Workers
-Workers host one or more operators and pull requests from the request
-plane and forward them to a local operator.
-### Request Plane
-The current triton distributed framework leverages a distributed work
-queue for its request plane implementation. The request plane ensures
-that requests for operators are forwarded and serviced by a single
-worker.
-### Data Plane
-The triton distributed framework leverages point to point data
-transfers using the UCX library to provide optimized primitives for
-device to device transfers.
-Data sent over the data plane is only pulled by the worker that needs
-to perform work on it. Requests themselves contain data descriptors
-and can be referenced and shared with other workers.
-Note: there is also a provision for sending data in the request
-contents when the message size is small enough that UCX transfer is
-not needed.
-### Components
-Any process which communicates with one or more of the request or data
-planes is considered a "component". While this example only uses
-"Workers" future examples will also include api servers, routers, and
-other types of components.
-### Deployment
-The final piece is a deployment. A deployment is a set of components
-deployed across a cluster. Components may be added and removed from
-deployments.
-## Limitations and Caveats
-The example is a rapidly evolving prototype and shouldn't be used in
-production. Limited testing has been done and it is meant to help
-flesh out the triton distributed concepts, architecture, and
-interfaces.
-1. No multi-node testing / support has been done
-2. No performance tuning / measurement has been done
--- a/examples/python/hello_world/__init__.py
+++ b/examples/python/hello_world/__init__.py
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
--- a/examples/python/hello_world/client/__main__.py
+++ b/examples/python/hello_world/client/__main__.py
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import multiprocessing
-import signal
-import sys
-import time
-from typing import Optional
-from .client import _start_client
-from .parser import parse_args
-processes: Optional[list[multiprocessing.context.SpawnProcess]] = None
-def handler(signum, frame):
-    exit_code = 0
-    if processes:
-        print("Stopping Clients")
-        for process in processes:
-            process.terminate()
-            process.kill()
-            process.join()
-            if process.exitcode is not None:
-                exit_code += process.exitcode
-    print(f"Clients Stopped Exit Code {exit_code}")
-    sys.exit(exit_code)
-signals = (signal.SIGHUP, signal.SIGTERM, signal.SIGINT)
-for sig in signals:
-    try:
-        signal.signal(sig, handler)
-    except Exception:
-        pass
-def main(args):
-    global processes
-    process_context = multiprocessing.get_context("spawn")
-    args.lock = process_context.Lock()
-    processes = []
-    start_time = time.time()
-    for index in range(args.clients):
-        processes.append(
-            process_context.Process(target=_start_client, args=(index, args))
-        )
-        processes[-1].start()
-    for process in processes:
-        process.join()
-    end_time = time.time()
-    print(
-        f"Throughput: {(args.requests_per_client*args.clients)/(end_time-start_time)} Total Time: {end_time-start_time}"
-    )
-    exit_code = 0
-    for process in processes:
-        if process.exitcode is not None:
-            exit_code += process.exitcode
-    print(f"Clients Stopped Exit Code {exit_code}")
-    return exit_code
-if __name__ == "__main__":
-    args = parse_args()
-    sys.exit(main(args))
--- a/examples/python/hello_world/client/client.py
+++ b/examples/python/hello_world/client/client.py
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import asyncio
-import sys
-import cupy
-import numpy
-from tqdm import tqdm
-from tritonserver import MemoryType
-from triton_distributed.icp import NatsRequestPlane, UcpDataPlane
-from triton_distributed.runtime import RemoteOperator
-def _get_input_sizes(args):
-    return numpy.maximum(
-        0,
-        numpy.round(
-            numpy.random.normal(
-                loc=args.input_size_mean,
-                scale=args.input_size_stdev,
-                size=args.requests_per_client,
-            )
-        ),
-    ).astype(int)
-def _start_client(client_index, args):
-    sys.exit(asyncio.run(client(client_index, args)))
-async def client(client_index, args):
-    request_count = args.requests_per_client
-    try:
-        request_plane = NatsRequestPlane(args.request_plane_uri)
-        data_plane = UcpDataPlane()
-        await request_plane.connect()
-        data_plane.connect()
-        remote_operator: RemoteOperator = RemoteOperator(
-            args.operator, request_plane, data_plane
-        )
-        input_sizes = _get_input_sizes(args)
-        inputs = [
-            numpy.array(numpy.random.randint(0, 100, input_sizes[index]))
-            for index in range(request_count)
-        ]
-        tqdm.set_lock(args.lock)
-        with tqdm(
-            total=args.requests_per_client,
-            desc=f"Client: {client_index}",
-            unit="request",
-            position=client_index,
-            leave=False,
-        ) as pbar:
-            requests = [
-                await remote_operator.async_infer(
-                    inputs={"input": inputs[index]}, request_id=str(index)
-                )
-                for index in range(request_count)
-            ]
-            for request in requests:
-                async for response in request:
-                    for output_name, output_value in response.outputs.items():
-                        if output_value.memory_type == MemoryType.CPU:
-                            output = numpy.from_dlpack(output_value)
-                            numpy.testing.assert_array_equal(
-                                output, inputs[int(response.request_id)]
-                            )
-                        else:
-                            output = cupy.from_dlpack(output_value)
-                            cupy.testing.assert_array_equal(
-                                output, inputs[int(response.request_id)]
-                            )
-                        del output_value
-                    pbar.set_description(
-                        f"Client: {client_index} Received Response: {response.request_id} From: {response.component_id} Error: {response.error}"
-                    )
-                    pbar.update(1)
-                    del response
-        await request_plane.close()
-        data_plane.close()
-    except Exception as e:
-        print(f"Exception: {e}")
-        return 1
-    else:
-        return 0
--- a/examples/python/hello_world/client/parser.py
+++ b/examples/python/hello_world/client/parser.py
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import argparse
-def parse_args(args=None):
-    parser = argparse.ArgumentParser(description="Hello World Client")
-    parser.add_argument(
-        "--request-plane-uri",
-        type=str,
-        default="nats://localhost:4223",
-        help="URI of request plane",
-    )
-    parser.add_argument(
-        "--requests-per-client",
-        type=int,
-        default=100,
-        help="number of requests to send per client",
-    )
-    parser.add_argument(
-        "--operator",
-        type=str,
-        choices=["encoder_decoder", "encoder", "decoder"],
-        default="encoder_decoder",
-        help="operator to send requests to. In this example all operators have the same input and output names.",
-    )
-    parser.add_argument(
-        "--input-size-mean",
-        type=int,
-        default=1000,
-        help="average input size for requests",
-    )
-    parser.add_argument(
-        "--input-size-stdev",
-        type=float,
-        default=0,
-        help="standard deviation for input size",
-    )
-    parser.add_argument(
-        "--clients", type=int, default=1, help="number of concurrent clients to launch."
-    )
-    args = parser.parse_args(args)
-    return args
--- a/examples/python/hello_world/deploy/__main__.py
+++ b/examples/python/hello_world/deploy/__main__.py
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import asyncio
-import shutil
-import signal
-import sys
-import time
-from pathlib import Path
-from triton_distributed.runtime import (
-    Deployment,
-    OperatorConfig,
-    TritonCoreOperator,
-    WorkerConfig,
-)
-from .parser import parse_args
-deployment = None
-def handler(signum, frame):
-    exit_code = 0
-    if deployment:
-        print("Stopping Workers")
-        exit_code = deployment.stop()
-    print(f"Workers Stopped Exit Code {exit_code}")
-    sys.exit(exit_code)
-signals = (signal.SIGHUP, signal.SIGTERM, signal.SIGINT)
-for sig in signals:
-    try:
-        signal.signal(sig, handler)
-    except Exception:
-        pass
-def _create_encoder_decoder_op(name, max_inflight_requests, args):
-    return OperatorConfig(
-        name=name,
-        implementation="EncodeDecodeOperator",
-        max_inflight_requests=int(max_inflight_requests),
-        repository=args.operator_repository,
-    )
-def _create_triton_core_op(
-    name,
-    max_inflight_requests,
-    instances_per_worker,
-    kind,
-    delay_per_token,
-    input_copies,
-    args,
-):
-    return OperatorConfig(
-        name=name,
-        repository=args.triton_core_models,
-        implementation=TritonCoreOperator,
-        max_inflight_requests=int(max_inflight_requests),
-        parameters={
-            "config": {
-                "instance_group": [
-                    {"count": int(instances_per_worker), "kind": f"KIND_{kind}"}
-                ],
-                "parameters": {
-                    "delay": {"string_value": f"{delay_per_token}"},
-                    "input_copies": {"string_value": f"{input_copies}"},
-                },
-            }
-        },
-    )
-async def main(args):
-    global deployment
-    log_dir = Path(args.log_dir)
-    if args.clear_logs:
-        shutil.rmtree(log_dir)
-    log_dir.mkdir(exist_ok=True)
-    (
-        encoder_worker_instances,
-        encoder_max_inflight_requests,
-        encoder_instances_per_worker,
-        encoder_device_kind,
-    ) = args.encoders
-    (
-        decoder_worker_instances,
-        decoder_max_inflight_requests,
-        decoder_instances_per_worker,
-        decoder_device_kind,
-    ) = args.decoders
-    (
-        encoder_decoder_worker_instances,
-        encoder_decoder_max_inflight_requests,
-    ) = args.encoder_decoders
-    encoder_op = _create_triton_core_op(
-        name="encoder",
-        max_inflight_requests=encoder_max_inflight_requests,
-        instances_per_worker=encoder_instances_per_worker,
-        kind=encoder_device_kind,
-        delay_per_token=args.encoder_delay_per_token,
-        input_copies=args.encoder_input_copies,
-        args=args,
-    )
-    encoder = WorkerConfig(
-        operators=[encoder_op],
-        name="encoder",
-    )
-    decoder_op = _create_triton_core_op(
-        name="decoder",
-        max_inflight_requests=decoder_max_inflight_requests,
-        instances_per_worker=decoder_instances_per_worker,
-        kind=decoder_device_kind,
-        delay_per_token=args.decoder_delay_per_token,
-        input_copies=args.encoder_input_copies,
-        args=args,
-    )
-    decoder = WorkerConfig(
-        operators=[decoder_op],
-        name="decoder",
-    )
-    encoder_decoder_op = _create_encoder_decoder_op(
-        name="encoder_decoder",
-        max_inflight_requests=encoder_decoder_max_inflight_requests,
-        args=args,
-    )
-    encoder_decoder = WorkerConfig(
-        operators=[encoder_decoder_op],
-        name="encoder_decoder",
-    )
-    print("Starting Workers")
-    deployment = Deployment(
-        [
-            # (worker_config, repeat_count )
-            (encoder, int(encoder_worker_instances)),
-            (decoder, int(decoder_worker_instances)),
-            (encoder_decoder, int(encoder_decoder_worker_instances)),
-        ],
-        initialize_request_plane=args.initialize_request_plane,
-        log_dir=args.log_dir,
-        log_level=args.log_level,
-        starting_metrics_port=args.starting_metrics_port,
-    )
-    deployment.start()
-    print("Workers started ... press Ctrl-C to Exit")
-    while True:
-        time.sleep(10)
-if __name__ == "__main__":
-    args = parse_args()
-    asyncio.run(main(args))
--- a/examples/python/hello_world/deploy/parser.py
+++ b/examples/python/hello_world/deploy/parser.py
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import argparse
-from pathlib import Path
-def parse_args(args=None):
-    example_dir = Path(__file__).parent.absolute().parent.absolute()
-    default_log_dir = example_dir.joinpath("logs")
-    default_operator_repository = example_dir.joinpath("operators")
-    default_triton_core_models = default_operator_repository.joinpath(
-        "triton_core_models"
-    )
-    parser = argparse.ArgumentParser(description="Hello World Deployment")
-    parser.add_argument(
-        "--initialize-request-plane",
-        default=False,
-        action="store_true",
-        help="Initialize the request plane, should only be done once per deployment",
-    )
-    parser.add_argument(
-        "--log-dir",
-        type=str,
-        default=str(default_log_dir),
-        help="log dir folder",
-    )
-    parser.add_argument(
-        "--clear-logs", default=False, action="store_true", help="clear log directory"
-    )
-    parser.add_argument(
-        "--log-level", type=int, default=1, help="log level applied to all workers"
-    )
-    parser.add_argument(
-        "--request-plane-uri",
-        type=str,
-        default="nats://localhost:4223",
-        help="URI of request plane",
-    )
-    parser.add_argument(
-        "--starting-metrics-port",
-        type=int,
-        default=50000,
-        help="Metrics port for first worker. Each worker will expose metrics on subsequent ports, ex. worker 1: 50000, worker 2: 50001, worker 3: 50002",
-    )
-    parser.add_argument(
-        "--operator-repository",
-        type=str,
-        default=str(default_operator_repository),
-        help="operator repository",
-    )
-    parser.add_argument(
-        "--triton-core-models",
-        type=str,
-        default=str(default_triton_core_models),
-        help="model repository for triton core models.",
-    )
-    parser.add_argument(
-        "--encoder-delay-per-token",
-        type=float,
-        default=0,
-        help="Delay per input token. In this toy example can be used to vary the simulated compute load for encoding stage.",
-    )
-    parser.add_argument(
-        "--encoder-input-copies",
-        type=int,
-        default=1,
-        help="Number of copies of input to create during encoding. In this toy example can be used to vary the memory transferred between encoding and decoding stages.",
-    )
-    parser.add_argument(
-        "--encoders",
-        type=str,
-        nargs=4,
-        default=["1", "1", "1", "CPU"],
-        help="Number of encoding workers to deploy. Specified as #Workers, #MaxInflightRequests, #ModelInstancesPerWorker, CPU || GPU",
-    )
-    parser.add_argument(
-        "--decoders",
-        type=str,
-        nargs=4,
-        default=["1", "1", "1", "CPU"],
-        help="Number of decoding workers to deploy. Specified as #Workers, #MaxInflightRequests,#ModelInstancesPerWorker, CPU || GPU",
-    )
-    parser.add_argument(
-        "--decoder-delay-per-token",
-        type=float,
-        default=0,
-        help="Delay per input token. In this toy example can be used to vary the simulated compute load for decoding stage.",
-    )
-    parser.add_argument(
-        "--encoder-decoders",
-        type=str,
-        nargs=2,
-        default=["1", "1"],
-        help="Number of encode-decode workers to deploy. Specified as #Worker, #MaxInflightRequests",
-    )
-    args = parser.parse_args(args)
-    return args
--- a/examples/python/hello_world/operators/__init__.py
+++ b/examples/python/hello_world/operators/__init__.py
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from hello_world.operators.encoder_decoder import (
-    EncodeDecodeOperator as EncodeDecodeOperator,
-)
--- a/examples/python/hello_world/operators/encoder_decoder.py
+++ b/examples/python/hello_world/operators/encoder_decoder.py
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import numpy
-from triton_distributed.runtime import Operator, RemoteInferenceRequest, RemoteOperator
-class EncodeDecodeOperator(Operator):
-    def __init__(
-        self,
-        name,
-        version,
-        request_plane,
-        data_plane,
-        parameters,
-        repository,
-        logger,
-        triton_core,
-    ):
-        self._encoder = RemoteOperator("encoder", request_plane, data_plane)
-        self._decoder = RemoteOperator("decoder", request_plane, data_plane)
-        self._logger = logger
-    async def execute(self, requests: list[RemoteInferenceRequest]):
-        self._logger.info("got request!")
-        for request in requests:
-            encoded_responses = await self._encoder.async_infer(
-                inputs={"input": request.inputs["input"]}
-            )
-            async for encoded_response in encoded_responses:
-                input_copies = int(
-                    numpy.from_dlpack(encoded_response.outputs["input_copies"])
-                )
-                decoded_responses = await self._decoder.async_infer(
-                    inputs={"input": encoded_response.outputs["output"]},
-                    parameters={"input_copies": input_copies},
-                )
-                async for decoded_response in decoded_responses:
-                    await request.response_sender().send(
-                        final=True,
-                        outputs={"output": decoded_response.outputs["output"]},
-                    )
-                    del decoded_response
--- a/examples/python/hello_world/operators/triton_core_models/decoder/1/model.py
+++ b/examples/python/hello_world/operators/triton_core_models/decoder/1/model.py
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import json
-import time
-import numpy
-import triton_python_backend_utils as pb_utils
-try:
-    import cupy
-except Exception:
-    cupy = None
-class TritonPythonModel:
-    @staticmethod
-    def auto_complete_config(auto_complete_model_config):
-        """Auto Complets Model Config
-        Model has one input and one output
-        both of type int64
-        Parameters
-        ----------
-        auto_complete_model_config : config
-            Enables reading and updating config.pbtxt
-        """
-        input_config = {
-            "name": "input",
-            "data_type": "TYPE_INT64",
-            "dims": [-1],
-            "optional": False,
-        }
-        output_config = {
-            "name": "output",
-            "data_type": "TYPE_INT64",
-            "dims": [-1],
-        }
-        auto_complete_model_config.add_input(input_config)
-        auto_complete_model_config.add_output(output_config)
-        auto_complete_model_config.set_max_batch_size(0)
-        auto_complete_model_config.set_model_transaction_policy({"decoupled": False})
-        return auto_complete_model_config
-    def initialize(self, args):
-        self._model_config = json.loads(args["model_config"])
-        self._model_instance_kind = args["model_instance_kind"]
-        self._model_instance_device_id = int(args["model_instance_device_id"])
-        self._config_parameters = self._model_config.get("parameters", {})
-        self._input_copies = int(
-            self._config_parameters.get("input_copies", {"string_value": "5"})[
-                "string_value"
-            ]
-        )
-        self._delay = float(
-            self._config_parameters.get("delay", {"string_value": "0"})["string_value"]
-        )
-        if self._model_instance_kind == "GPU" and cupy is None:
-            raise RuntimeError("GPU Device set but cupy not installed")
-    def execute(self, requests):
-        responses = []
-        input_copies = self._input_copies
-        delay = self._delay
-        for request in requests:
-            output_tensors = []
-            parameters = json.loads(request.parameters())
-            if parameters:
-                input_copies = int(parameters.get("input_copies", self._input_copies))
-                delay = float(parameters.get("delay", self._delay))
-            for input_tensor in request.inputs():
-                input_value = input_tensor.as_numpy()
-                output_value = []
-                if self._model_instance_kind == "GPU" and cupy is not None:
-                    with cupy.cuda.Device(self._model_instance_device_id):
-                        input_value = cupy.array(input_value)
-                        output_value = cupy.invert(input_value)
-                        output_value = output_value[::input_copies]
-                        output_tensor = pb_utils.Tensor.from_dlpack(
-                            "output", output_value
-                        )
-                else:
-                    output_value = numpy.invert(input_value)
-                    output_value = output_value[::input_copies]
-                    output_tensor = pb_utils.Tensor("output", output_value)
-                output_tensors.append(output_tensor)
-                time.sleep(len(output_value) * delay)
-            responses.append(pb_utils.InferenceResponse(output_tensors=output_tensors))
-        return responses
--- a/examples/python/hello_world/operators/triton_core_models/decoder/config.pbtxt
+++ b/examples/python/hello_world/operators/triton_core_models/decoder/config.pbtxt
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-## Model Instance and Kind are filled in by configuration when launched
-## All other values are filled in by auto_complete in model.py
-backend: "python"
--- a/examples/python/hello_world/operators/triton_core_models/encoder/1/model.py
+++ b/examples/python/hello_world/operators/triton_core_models/encoder/1/model.py
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import json
-import time
-import numpy
-import triton_python_backend_utils as pb_utils
-try:
-    import cupy
-except Exception:
-    cupy = None
-class TritonPythonModel:
-    @staticmethod
-    def auto_complete_config(auto_complete_model_config):
-        """Auto Complets Model Config
-        Model has one input and one output
-        both of type int64
-        Parameters
-        ----------
-        auto_complete_model_config : config
-            Enables reading and updating config.pbtxt
-        """
-        input_config = {
-            "name": "input",
-            "data_type": "TYPE_INT64",
-            "dims": [-1],
-            "optional": False,
-        }
-        output_config = {
-            "name": "output",
-            "data_type": "TYPE_INT64",
-            "dims": [-1],
-        }
-        copies_config = {
-            "name": "input_copies",
-            "data_type": "TYPE_INT64",
-            "dims": [1],
-        }
-        auto_complete_model_config.add_input(input_config)
-        auto_complete_model_config.add_output(output_config)
-        auto_complete_model_config.add_output(copies_config)
-        auto_complete_model_config.set_max_batch_size(0)
-        auto_complete_model_config.set_model_transaction_policy({"decoupled": False})
-        return auto_complete_model_config
-    def initialize(self, args):
-        self._model_config = json.loads(args["model_config"])
-        self._model_instance_kind = args["model_instance_kind"]
-        self._model_instance_device_id = int(args["model_instance_device_id"])
-        self._config_parameters = self._model_config.get("parameters", {})
-        self._input_copies = int(
-            self._config_parameters.get("input_copies", {"string_value": "5"})[
-                "string_value"
-            ]
-        )
-        self._delay = float(
-            self._config_parameters.get("delay", {"string_value": "0"})["string_value"]
-        )
-        if self._model_instance_kind == "GPU" and cupy is None:
-            raise RuntimeError("GPU Device set but cupy not installed")
-    def execute(self, requests):
-        responses = []
-        input_copies = self._input_copies
-        delay = self._delay
-        for request in requests:
-            output_tensors = []
-            parameters = json.loads(request.parameters())
-            if parameters:
-                input_copies = int(parameters.get("input_copies", self._input_copies))
-                delay = float(parameters.get("delay", self._delay))
-            for input_tensor in request.inputs():
-                input_value = input_tensor.as_numpy()
-                output_value = []
-                if self._model_instance_kind == "GPU" and cupy is not None:
-                    with cupy.cuda.Device(self._model_instance_device_id):
-                        input_value = cupy.array(input_value)
-                        output_value = cupy.tile(input_value, input_copies)
-                        output_value = cupy.invert(output_value)
-                        output_tensor = pb_utils.Tensor.from_dlpack(
-                            "output", output_value
-                        )
-                else:
-                    output_value = numpy.tile(input_value, input_copies)
-                    output_value = numpy.invert(output_value)
-                    output_tensor = pb_utils.Tensor("output", output_value)
-                output_tensors.append(output_tensor)
-                output_tensors.append(
-                    pb_utils.Tensor(
-                        "input_copies", numpy.array(input_copies).astype("int64")
-                    )
-                )
-                time.sleep(len(output_value) * delay)
-            responses.append(pb_utils.InferenceResponse(output_tensors=output_tensors))
-        return responses
--- a/examples/python/hello_world/operators/triton_core_models/encoder/config.pbtxt
+++ b/examples/python/hello_world/operators/triton_core_models/encoder/config.pbtxt
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-## Model Instance and Kind are filled in by configuration when launched
-## All other values are filled in by auto_complete in model.py
-backend: "python"
--- a/examples/python/hello_world/single_file.py
+++ b/examples/python/hello_world/single_file.py
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import asyncio
-import shutil
-import sys
-from pathlib import Path
-import cupy
-import numpy
-from tqdm import tqdm
-from tritonserver import MemoryType
-from triton_distributed.icp.nats_request_plane import NatsRequestPlane
-from triton_distributed.icp.ucp_data_plane import UcpDataPlane
-from triton_distributed.runtime import (
-    Deployment,
-    Operator,
-    OperatorConfig,
-    RemoteInferenceRequest,
-    RemoteOperator,
-    TritonCoreOperator,
-    WorkerConfig,
-)
-class EncodeDecodeOperator(Operator):
-    def __init__(
-        self,
-        name,
-        version,
-        request_plane,
-        data_plane,
-        parameters,
-        repository,
-        logger,
-        triton_core,
-    ):
-        self._encoder = RemoteOperator("encoder", request_plane, data_plane)
-        self._decoder = RemoteOperator("decoder", request_plane, data_plane)
-        self._logger = logger
-    async def execute(self, requests: list[RemoteInferenceRequest]):
-        for request in requests:
-            self._logger.info("got request!")
-            encoded_responses = await self._encoder.async_infer(
-                inputs={"input": request.inputs["input"]}
-            )
-            async for encoded_response in encoded_responses:
-                input_copies = int(
-                    numpy.from_dlpack(encoded_response.outputs["input_copies"])
-                )
-                decoded_responses = await self._decoder.async_infer(
-                    inputs={"input": encoded_response.outputs["output"]},
-                    parameters={"input_copies": input_copies},
-                )
-                async for decoded_response in decoded_responses:
-                    await request.response_sender().send(
-                        final=True,
-                        outputs={"output": decoded_response.outputs["output"]},
-                    )
-                    del decoded_response
-async def send_requests(nats_server_url, request_count=10):
-    request_plane = NatsRequestPlane(nats_server_url)
-    data_plane = UcpDataPlane()
-    await request_plane.connect()
-    data_plane.connect()
-    remote_operator: RemoteOperator = RemoteOperator(
-        "encoder_decoder", request_plane, data_plane
-    )
-    inputs = [
-        numpy.array(numpy.random.randint(0, 100, 10000)).astype("int64")
-        for _ in range(request_count)
-    ]
-    with tqdm(total=request_count, desc="Sending Requests", unit="request") as pbar:
-        requests = [
-            await remote_operator.async_infer(
-                inputs={"input": inputs[index]}, request_id=str(index)
-            )
-            for index in range(request_count)
-        ]
-        for request in requests:
-            async for response in request:
-                for output_name, output_value in response.outputs.items():
-                    if output_value.memory_type == MemoryType.CPU:
-                        output = numpy.from_dlpack(output_value)
-                        numpy.testing.assert_array_equal(
-                            output, inputs[int(response.request_id)]
-                        )
-                    else:
-                        output = cupy.from_dlpack(output_value)
-                        cupy.testing.assert_array_equal(
-                            output, inputs[int(response.request_id)]
-                        )
-                    del output_value
-                pbar.set_description(
-                    f"Finished Request: {response.request_id} Response From: {response.component_id} Error: {response.error}"
-                )
-                pbar.update(1)
-                del response
-    await request_plane.close()
-    data_plane.close()
-async def main():
-    module_dir = Path(__file__).parent.absolute()
-    log_dir = module_dir.joinpath("logs")
-    if log_dir.is_dir():
-        shutil.rmtree(log_dir)
-    log_dir.mkdir(exist_ok=True)
-    triton_core_models_dir = module_dir.joinpath("operators", "triton_core_models")
-    encoder_op = OperatorConfig(
-        name="encoder",
-        repository=str(triton_core_models_dir),
-        implementation=TritonCoreOperator,
-        max_inflight_requests=1,
-        parameters={
-            "config": {
-                "instance_group": [{"count": 1, "kind": "KIND_CPU"}],
-                "parameters": {"delay": {"string_value": "0"}},
-            }
-        },
-    )
-    decoder_op = OperatorConfig(
-        name="decoder",
-        repository=str(triton_core_models_dir),
-        implementation=TritonCoreOperator,
-        max_inflight_requests=1,
-        parameters={
-            "config": {
-                "instance_group": [{"count": 1, "kind": "KIND_GPU"}],
-                "parameters": {"delay": {"string_value": "0"}},
-            }
-        },
-    )
-    encoder_decoder_op = OperatorConfig(
-        name="encoder_decoder",
-        implementation=EncodeDecodeOperator,
-        max_inflight_requests=100,
-    )
-    encoder = WorkerConfig(
-        operators=[encoder_op],
-        name="encoder",
-    )
-    decoder = WorkerConfig(
-        operators=[decoder_op],
-        name="decoder",
-    )
-    encoder_decoder = WorkerConfig(
-        operators=[encoder_decoder_op],
-        name="encoder_decoder",
-    )
-    print("Starting Workers")
-    # You can configure the number of instances of each
-    # type of worker in a deployment
-    num_instances = 1
-    deployment = Deployment(
-        [
-            (encoder, num_instances),
-            (decoder, num_instances),
-            (encoder_decoder, num_instances),
-        ],
-        initialize_request_plane=True,
-        log_dir=str(log_dir),
-        log_level=1,
-    )
-    deployment.start()
-    print("Sending Requests")
-    await send_requests(deployment.request_plane_server.url)
-    print("Stopping Workers")
-    sys.exit(deployment.stop())
-if __name__ == "__main__":
-    asyncio.run(main())
--- a/examples/python/hello_world/tests/test_sanity.py
+++ b/examples/python/hello_world/tests/test_sanity.py
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import subprocess
-import pytest
-# TODO
-# Decide if this should be
-# pre merge, nightly, or weekly
-pytestmark = pytest.mark.pre_merge
-@pytest.mark.skip("interactions with sanity test")
-def test_single_file():
-    command = [
-        "python3",
-        "examples/hello_world/single_file.py",
-    ]
-    process = subprocess.Popen(
-        command,
-        stdin=subprocess.DEVNULL,
-    )
-    try:
-        process.wait(60)
-    except subprocess.TimeoutExpired:
-        print("single file timed out!")
-        process.terminate()
-        process.kill()
-    assert process.returncode == 0, "Error in single file!"
-def test_sanity():
-    deployment_command = [
-        "python3",
-        "-m",
-        "hello_world.deploy",
-        "--initialize-request-plane",
-    ]
-    deployment_process = subprocess.Popen(
-        deployment_command,
-        stdin=subprocess.DEVNULL,
-    )
-    client_command = [
-        "python3",
-        "-m",
-        "hello_world.client",
-        "--requests-per-client",
-        "10",
-    ]
-    client_process = subprocess.Popen(
-        client_command,
-        stdin=subprocess.DEVNULL,
-    )
-    try:
-        client_process.wait(timeout=60)
-    except subprocess.TimeoutExpired:
-        print("Client timed out!")
-        client_process.terminate()
-        client_process.wait()
-    client_process.terminate()
-    client_process.kill()
-    client_process.wait()
-    deployment_process.terminate()
-    deployment_process.wait()
-    assert client_process.returncode == 0, "Error in clients!"
-    assert deployment_process.returncode == 0, "Error starting deployment!"
--- a/examples/python/llm/__init__.py
+++ b/examples/python/llm/__init__.py
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.