Commit d2e7ae02 authored by Ryan McCormick's avatar Ryan McCormick Committed by GitHub
Browse files

refactor: Modify rust vllm example to use container


Co-authored-by: default avatarPiotr Tarasiewicz <ptarasiewicz@nvidia.com>
Co-authored-by: default avatarNeelay Shah <neelays@nvidia.com>
parent 2cfb1b6d
...@@ -25,8 +25,10 @@ USER root ...@@ -25,8 +25,10 @@ USER root
# TODO: separate dev from runtime dependendcies # TODO: separate dev from runtime dependendcies
# Rust build/dev dependencies
RUN apt-get update; apt-get install -y gdb protobuf-compiler RUN apt-get update; apt-get install -y gdb protobuf-compiler
RUN curl https://sh.rustup.rs -sSf | bash -s -- -y RUN curl https://sh.rustup.rs -sSf | bash -s -- -y
RUN pip install maturin
ENV PATH="/root/.cargo/bin:${PATH}" ENV PATH="/root/.cargo/bin:${PATH}"
# Install OpenAI-compatible frontend and its dependencies from triton server # Install OpenAI-compatible frontend and its dependencies from triton server
...@@ -117,7 +119,6 @@ ENV VLLM_GENERATE_WORKERS=${VLLM_FRAMEWORK:+1} ...@@ -117,7 +119,6 @@ ENV VLLM_GENERATE_WORKERS=${VLLM_FRAMEWORK:+1}
ENV VLLM_BASELINE_TP_SIZE=${VLLM_FRAMEWORK:+1} ENV VLLM_BASELINE_TP_SIZE=${VLLM_FRAMEWORK:+1}
ENV VLLM_CONTEXT_TP_SIZE=${VLLM_FRAMEWORK:+1} ENV VLLM_CONTEXT_TP_SIZE=${VLLM_FRAMEWORK:+1}
ENV VLLM_GENERATE_TP_SIZE=${VLLM_FRAMEWORK:+1} ENV VLLM_GENERATE_TP_SIZE=${VLLM_FRAMEWORK:+1}
ENV VLLM_LOGGING_LEVEL=${VLLM_FRAMEWORK:+INFO}
ENV PYTHONUNBUFFERED=1 ENV PYTHONUNBUFFERED=1
# Install NATS - pointing toward NATS github instead of binaries.nats.dev due to server instability # Install NATS - pointing toward NATS github instead of binaries.nats.dev due to server instability
...@@ -152,7 +153,17 @@ WORKDIR /workspace ...@@ -152,7 +153,17 @@ WORKDIR /workspace
#TODO Exclude container directory #TODO Exclude container directory
COPY . /workspace COPY . /workspace
RUN cd runtime/rust && cargo build --release --locked && cargo doc --no-deps RUN cd runtime/rust && \
cargo build --release --locked && cargo doc --no-deps
# Create virtualenv
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
RUN mkdir /opt/triton && \
uv venv /opt/triton/venv --python 3.12 && \
source /opt/triton/venv/bin/activate && \
cd runtime/rust/python-wheel && \
maturin develop
RUN /workspace/icp/protos/gen_python.sh RUN /workspace/icp/protos/gen_python.sh
# Install python packages # Install python packages
......
...@@ -21,14 +21,7 @@ This example demonstrates how to use Triton Distributed to serve large language ...@@ -21,14 +21,7 @@ This example demonstrates how to use Triton Distributed to serve large language
## Prerequisites ## Prerequisites
1. Follow the setup instructions in the Python bindings [README](/runtime/rust/python-wheel/README.md) to prepare your environment Start required services (etcd and NATS):
2. Install vLLM:
```bash
uv pip install vllm==0.7.2
```
3. Start required services (etcd and NATS):
Option A: Using [Docker Compose](/runtime/rust/docker-compose.yml) (Recommended) Option A: Using [Docker Compose](/runtime/rust/docker-compose.yml) (Recommended)
```bash ```bash
...@@ -42,6 +35,26 @@ This example demonstrates how to use Triton Distributed to serve large language ...@@ -42,6 +35,26 @@ This example demonstrates how to use Triton Distributed to serve large language
- [etcd](https://etcd.io) server - [etcd](https://etcd.io) server
- follow instructions in [etcd installation](https://etcd.io/docs/v3.5/install/) to start an `etcd-server` locally - follow instructions in [etcd installation](https://etcd.io/docs/v3.5/install/) to start an `etcd-server` locally
## Building the Environment
The example is designed to run in a containerized environment using Triton Distributed, vLLM, and associated dependencies. To build the container:
```bash
# Build image
./container/build.sh
```
## Launching the Environment
```
# Run image interactively
./container/run.sh -it
# Add vllm into the python virtual environment
source /opt/triton/venv/bin/activate
uv pip install vllm==0.7.2
```
## Deployment Options ## Deployment Options
### 1. Monolithic Deployment ### 1. Monolithic Deployment
...@@ -50,6 +63,11 @@ Run the server and client components in separate terminal sessions: ...@@ -50,6 +63,11 @@ Run the server and client components in separate terminal sessions:
**Terminal 1 - Server:** **Terminal 1 - Server:**
```bash ```bash
# Activate virtual environment
source /opt/triton/venv/bin/activate
# Launch worker
cd /workspace/examples/python_rs/llm/vllm
python3 -m monolith.worker \ python3 -m monolith.worker \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--max-model-len 100 \ --max-model-len 100 \
...@@ -58,6 +76,11 @@ python3 -m monolith.worker \ ...@@ -58,6 +76,11 @@ python3 -m monolith.worker \
**Terminal 2 - Client:** **Terminal 2 - Client:**
```bash ```bash
# Activate virtual environment
source /opt/triton/venv/bin/activate
# Run client
cd /workspace/examples/python_rs/llm/vllm
python3 -m common.client \ python3 -m common.client \
--prompt "what is the capital of france?" \ --prompt "what is the capital of france?" \
--max-tokens 10 \ --max-tokens 10 \
...@@ -85,6 +108,11 @@ This deployment option splits the model serving across prefill and decode worker ...@@ -85,6 +108,11 @@ This deployment option splits the model serving across prefill and decode worker
**Terminal 1 - Prefill Worker:** **Terminal 1 - Prefill Worker:**
```bash ```bash
# Activate virtual environment
source /opt/triton/venv/bin/activate
# Launch prefill worker
cd /workspace/examples/python_rs/llm/vllm
CUDA_VISIBLE_DEVICES=0 python3 -m disaggregated.prefill_worker \ CUDA_VISIBLE_DEVICES=0 python3 -m disaggregated.prefill_worker \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--max-model-len 100 \ --max-model-len 100 \
...@@ -96,6 +124,11 @@ CUDA_VISIBLE_DEVICES=0 python3 -m disaggregated.prefill_worker \ ...@@ -96,6 +124,11 @@ CUDA_VISIBLE_DEVICES=0 python3 -m disaggregated.prefill_worker \
**Terminal 2 - Decode Worker:** **Terminal 2 - Decode Worker:**
```bash ```bash
# Activate virtual environment
source /opt/triton/venv/bin/activate
# Launch decode worker
cd /workspace/examples/python_rs/llm/vllm
CUDA_VISIBLE_DEVICES=1 python3 -m disaggregated.decode_worker \ CUDA_VISIBLE_DEVICES=1 python3 -m disaggregated.decode_worker \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--max-model-len 100 \ --max-model-len 100 \
...@@ -107,6 +140,11 @@ CUDA_VISIBLE_DEVICES=1 python3 -m disaggregated.decode_worker \ ...@@ -107,6 +140,11 @@ CUDA_VISIBLE_DEVICES=1 python3 -m disaggregated.decode_worker \
**Terminal 3 - Client:** **Terminal 3 - Client:**
```bash ```bash
# Activate virtual environment
source /opt/triton/venv/bin/activate
# Run client
cd /workspace/examples/python_rs/llm/vllm
python3 -m common.client \ python3 -m common.client \
--prompt "what is the capital of france?" \ --prompt "what is the capital of france?" \
--max-tokens 10 \ --max-tokens 10 \
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment