refactor: Modify rust vllm example to use container

Co-authored-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com> Co-authored-by: Neelay Shah <neelays@nvidia.com>

refactor: Modify rust vllm example to use container
Co-authored-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com> Co-authored-by: Neelay Shah <neelays@nvidia.com>
d2e7ae02 · Ryan McCormick · GitHub · 2cfb1b6d · d2e7ae02 · d2e7ae02
Commit d2e7ae02 authored Feb 07, 2025 by Ryan McCormick Committed by GitHub Feb 07, 2025
Show whitespace changes
Inline Side-by-side

Showing with 59 additions and 10 deletions

container/Dockerfile container/Dockerfile +13 -2

examples/python_rs/llm/vllm/README.md examples/python_rs/llm/vllm/README.md +46 -8

No files found.
--- a/container/Dockerfile
+++ b/container/Dockerfile
@@ -25,8 +25,10 @@ USER root

 # TODO: separate dev from runtime dependendcies

+# Rust build/dev dependencies
 RUN apt-get update; apt-get install -y gdb protobuf-compiler
 RUN curl https://sh.rustup.rs -sSf | bash -s -- -y
+RUN pip install maturin
 ENV PATH="/root/.cargo/bin:${PATH}"

 # Install OpenAI-compatible frontend and its dependencies from triton server
@@ -117,7 +119,6 @@ ENV VLLM_GENERATE_WORKERS=${VLLM_FRAMEWORK:+1}
 ENV VLLM_BASELINE_TP_SIZE=${VLLM_FRAMEWORK:+1}
 ENV VLLM_CONTEXT_TP_SIZE=${VLLM_FRAMEWORK:+1}
 ENV VLLM_GENERATE_TP_SIZE=${VLLM_FRAMEWORK:+1}
-ENV VLLM_LOGGING_LEVEL=${VLLM_FRAMEWORK:+INFO}
 ENV PYTHONUNBUFFERED=1

 # Install NATS - pointing toward NATS github instead of binaries.nats.dev due to server instability
@@ -152,7 +153,17 @@ WORKDIR /workspace
 #TODO Exclude container directory
 COPY . /workspace

-RUN cd runtime/rust && cargo build --release --locked && cargo doc --no-deps
+RUN cd runtime/rust && \
+    cargo build --release --locked && cargo doc --no-deps
+
+# Create virtualenv
+COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
+RUN mkdir /opt/triton && \
+    uv venv /opt/triton/venv --python 3.12 && \
+    source /opt/triton/venv/bin/activate && \
+    cd runtime/rust/python-wheel && \
+    maturin develop
+
 RUN /workspace/icp/protos/gen_python.sh

 # Install python packages

--- a/examples/python_rs/llm/vllm/README.md
+++ b/examples/python_rs/llm/vllm/README.md
@@ -21,14 +21,7 @@ This example demonstrates how to use Triton Distributed to serve large language

 ## Prerequisites

-1. Follow the setup instructions in the Python bindings [README](/runtime/rust/python-wheel/README.md) to prepare your environment
-
-2. Install vLLM:
-    ```bash
-    uv pip install vllm==0.7.2
-    ```
-
-3. Start required services (etcd and NATS):
+Start required services (etcd and NATS):

   Option A: Using [Docker Compose](/runtime/rust/docker-compose.yml) (Recommended)
   ```bash
@@ -42,6 +35,26 @@ This example demonstrates how to use Triton Distributed to serve large language
    - [etcd](https://etcd.io) server
        - follow instructions in [etcd installation](https://etcd.io/docs/v3.5/install/) to start an `etcd-server` locally

+
+## Building the Environment
+
+The example is designed to run in a containerized environment using Triton Distributed, vLLM, and associated dependencies. To build the container:
+
+```bash
+# Build image
+./container/build.sh
+```
+
+## Launching the Environment
+```
+# Run image interactively
+./container/run.sh -it
+
+# Add vllm into the python virtual environment
+source /opt/triton/venv/bin/activate
+uv pip install vllm==0.7.2
+```
+
 ## Deployment Options

 ### 1. Monolithic Deployment
@@ -50,6 +63,11 @@ Run the server and client components in separate terminal sessions:

 **Terminal 1 - Server:**
 ```bash
+# Activate virtual environment
+source /opt/triton/venv/bin/activate
+
+# Launch worker
+cd /workspace/examples/python_rs/llm/vllm
 python3 -m monolith.worker \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --max-model-len 100 \
@@ -58,6 +76,11 @@ python3 -m monolith.worker \

 **Terminal 2 - Client:**
 ```bash
+# Activate virtual environment
+source /opt/triton/venv/bin/activate
+
+# Run client
+cd /workspace/examples/python_rs/llm/vllm
 python3 -m common.client \
    --prompt "what is the capital of france?" \
    --max-tokens 10 \
@@ -85,6 +108,11 @@ This deployment option splits the model serving across prefill and decode worker

 **Terminal 1 - Prefill Worker:**
 ```bash
+# Activate virtual environment
+source /opt/triton/venv/bin/activate
+
+# Launch prefill worker
+cd /workspace/examples/python_rs/llm/vllm
 CUDA_VISIBLE_DEVICES=0 python3 -m disaggregated.prefill_worker \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --max-model-len 100 \
@@ -96,6 +124,11 @@ CUDA_VISIBLE_DEVICES=0 python3 -m disaggregated.prefill_worker \

 **Terminal 2 - Decode Worker:**
 ```bash
+# Activate virtual environment
+source /opt/triton/venv/bin/activate
+
+# Launch decode worker
+cd /workspace/examples/python_rs/llm/vllm
 CUDA_VISIBLE_DEVICES=1 python3 -m disaggregated.decode_worker \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --max-model-len 100 \
@@ -107,6 +140,11 @@ CUDA_VISIBLE_DEVICES=1 python3 -m disaggregated.decode_worker \

 **Terminal 3 - Client:**
 ```bash
+# Activate virtual environment
+source /opt/triton/venv/bin/activate
+
+# Run client
+cd /workspace/examples/python_rs/llm/vllm
 python3 -m common.client \
    --prompt "what is the capital of france?" \
    --max-tokens 10 \