@@ -49,6 +49,9 @@ For more details on the basics of Triton Distributed, please see the [Hello Worl
- For FP8 usage, GPUs with **Compute Capability >= 8.9** are required.
- If you have older GPUs, consider BF16/FP16 precision variants instead of `FP8`. (See [below](#model-precision-variants).)
5.**HuggingFace**
- You need a HuggingFace account to download the model and set HF_TOKEN environment variable.
---
## 2. Building the Environment
...
...
@@ -56,7 +59,7 @@ For more details on the basics of Triton Distributed, please see the [Hello Worl
The example is designed to run in a containerized environment using Triton Distributed, vLLM, and associated dependencies. To build the container:
```bash
./container/build.sh --frameworkVLLM
./container/build.sh --frameworkvllm
```
This command pulls necessary dependencies and patches vLLM in the container image.
...
...
@@ -67,45 +70,58 @@ This command pulls necessary dependencies and patches vLLM in the container imag
Below is a minimal example of how to start each component of a disaggregated serving setup. The typical sequence is:
2.**Start the Context Worker(s) and Request Plane**
3.**Start the Generate Worker(s)**
1.**Start the API Server** (handles incoming requests and coordinates workers)
2.**Start the Prefill Worker(s)**
3.**Start the Decode Worker(s)**
All components must be able to connect to the same NATS server to coordinate.
All components must be able to connect to the same request plane to coordinate.
### 3.1 API Server
### 3.1 HuggingFace Token
The API server in a vLLM-disaggregated setup listens for OpenAI-compatible requests on a chosen port (default 8005). Below is an example command:
```bash
export HF_TOKEN=<YOUR TOKEN>
```
### 3.2 Launch Interactive Environment
```bash
python3 -m examples.api_server \
--nats-url nats://localhost:4223 \
--log-level INFO \
--port 8005
./container/run.sh --framework vllm -it
```
### 3.2 Prefill Worker
Note: all subsequent commands will be run in the same container for simplicity
The prefill stage encodes incoming prompts. By default, vLLM uses GPU resources to tokenize and prepare the model’s key-value (KV) caches. Run the prefill worker:
Note: by default this command makes all gpu devices visible. Use flag
```bash
```
--gpus
```
to selectively make gpu devices visible.
### 3.2 Launch Context Worker and Request Plane
The context stage encodes incoming prompts. By default, vLLM uses GPU resources to tokenize and prepare the model’s key-value (KV) caches.
Within the container start the context worker and the request plane:
INFO 01-24 09:21:22 model_runner.py:1406] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-24 09:21:22 model_runner.py:1410] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
data: {"id":"052eabe0-fc54-4f7c-9be8-4926523b26fc","choices":[{"delta":{"content":"The capital of France is Paris.","role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1737711185,"model":"llama","system_fingerprint":"052eabe0-fc54-4f7c-9be8-4926523b26fc","object":"chat.completion.chunk"}
You can benchmark this setup using [**GenAI-Perf**](https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/README.md), which supports OpenAI endpoints for chat or completion requests.
...
...
@@ -178,7 +278,7 @@ You can benchmark this setup using [**GenAI-Perf**](https://github.com/triton-in