Commit d2e7ae02 authored by Ryan McCormick's avatar Ryan McCormick Committed by GitHub
Browse files

refactor: Modify rust vllm example to use container


Co-authored-by: default avatarPiotr Tarasiewicz <ptarasiewicz@nvidia.com>
Co-authored-by: default avatarNeelay Shah <neelays@nvidia.com>
parent 2cfb1b6d
......@@ -25,8 +25,10 @@ USER root
# TODO: separate dev from runtime dependendcies
# Rust build/dev dependencies
RUN apt-get update; apt-get install -y gdb protobuf-compiler
RUN curl https://sh.rustup.rs -sSf | bash -s -- -y
RUN pip install maturin
ENV PATH="/root/.cargo/bin:${PATH}"
# Install OpenAI-compatible frontend and its dependencies from triton server
......@@ -117,7 +119,6 @@ ENV VLLM_GENERATE_WORKERS=${VLLM_FRAMEWORK:+1}
ENV VLLM_BASELINE_TP_SIZE=${VLLM_FRAMEWORK:+1}
ENV VLLM_CONTEXT_TP_SIZE=${VLLM_FRAMEWORK:+1}
ENV VLLM_GENERATE_TP_SIZE=${VLLM_FRAMEWORK:+1}
ENV VLLM_LOGGING_LEVEL=${VLLM_FRAMEWORK:+INFO}
ENV PYTHONUNBUFFERED=1
# Install NATS - pointing toward NATS github instead of binaries.nats.dev due to server instability
......@@ -152,7 +153,17 @@ WORKDIR /workspace
#TODO Exclude container directory
COPY . /workspace
RUN cd runtime/rust && cargo build --release --locked && cargo doc --no-deps
RUN cd runtime/rust && \
cargo build --release --locked && cargo doc --no-deps
# Create virtualenv
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
RUN mkdir /opt/triton && \
uv venv /opt/triton/venv --python 3.12 && \
source /opt/triton/venv/bin/activate && \
cd runtime/rust/python-wheel && \
maturin develop
RUN /workspace/icp/protos/gen_python.sh
# Install python packages
......
......@@ -21,14 +21,7 @@ This example demonstrates how to use Triton Distributed to serve large language
## Prerequisites
1. Follow the setup instructions in the Python bindings [README](/runtime/rust/python-wheel/README.md) to prepare your environment
2. Install vLLM:
```bash
uv pip install vllm==0.7.2
```
3. Start required services (etcd and NATS):
Start required services (etcd and NATS):
Option A: Using [Docker Compose](/runtime/rust/docker-compose.yml) (Recommended)
```bash
......@@ -42,6 +35,26 @@ This example demonstrates how to use Triton Distributed to serve large language
- [etcd](https://etcd.io) server
- follow instructions in [etcd installation](https://etcd.io/docs/v3.5/install/) to start an `etcd-server` locally
## Building the Environment
The example is designed to run in a containerized environment using Triton Distributed, vLLM, and associated dependencies. To build the container:
```bash
# Build image
./container/build.sh
```
## Launching the Environment
```
# Run image interactively
./container/run.sh -it
# Add vllm into the python virtual environment
source /opt/triton/venv/bin/activate
uv pip install vllm==0.7.2
```
## Deployment Options
### 1. Monolithic Deployment
......@@ -50,6 +63,11 @@ Run the server and client components in separate terminal sessions:
**Terminal 1 - Server:**
```bash
# Activate virtual environment
source /opt/triton/venv/bin/activate
# Launch worker
cd /workspace/examples/python_rs/llm/vllm
python3 -m monolith.worker \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--max-model-len 100 \
......@@ -58,6 +76,11 @@ python3 -m monolith.worker \
**Terminal 2 - Client:**
```bash
# Activate virtual environment
source /opt/triton/venv/bin/activate
# Run client
cd /workspace/examples/python_rs/llm/vllm
python3 -m common.client \
--prompt "what is the capital of france?" \
--max-tokens 10 \
......@@ -85,6 +108,11 @@ This deployment option splits the model serving across prefill and decode worker
**Terminal 1 - Prefill Worker:**
```bash
# Activate virtual environment
source /opt/triton/venv/bin/activate
# Launch prefill worker
cd /workspace/examples/python_rs/llm/vllm
CUDA_VISIBLE_DEVICES=0 python3 -m disaggregated.prefill_worker \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--max-model-len 100 \
......@@ -96,6 +124,11 @@ CUDA_VISIBLE_DEVICES=0 python3 -m disaggregated.prefill_worker \
**Terminal 2 - Decode Worker:**
```bash
# Activate virtual environment
source /opt/triton/venv/bin/activate
# Launch decode worker
cd /workspace/examples/python_rs/llm/vllm
CUDA_VISIBLE_DEVICES=1 python3 -m disaggregated.decode_worker \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--max-model-len 100 \
......@@ -107,6 +140,11 @@ CUDA_VISIBLE_DEVICES=1 python3 -m disaggregated.decode_worker \
**Terminal 3 - Client:**
```bash
# Activate virtual environment
source /opt/triton/venv/bin/activate
# Run client
cd /workspace/examples/python_rs/llm/vllm
python3 -m common.client \
--prompt "what is the capital of france?" \
--max-tokens 10 \
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment