Unverified Commit 32aa2059 authored by Rafael Vasquez's avatar Rafael Vasquez Committed by GitHub
Browse files

[Docs] Convert rST to MyST (Markdown) (#11145)


Signed-off-by: default avatarRafael Vasquez <rafvasq21@gmail.com>
parent 94d545a1
(int8)=
# INT8 W8A8
vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
This quantization method is particularly useful for reducing model size while maintaining good performance.
Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
```{note}
INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
```
## Prerequisites
To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
```console
$ pip install llmcompressor
```
## Quantization Process
The quantization process involves four main steps:
1. Loading the model
2. Preparing calibration data
3. Applying quantization
4. Evaluating accuracy in vLLM
### 1. Loading the Model
Use `SparseAutoModelForCausalLM`, which wraps `AutoModelForCausalLM`, for saving and loading quantized models:
```python
from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
model = SparseAutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
```
### 2. Preparing Calibration Data
When quantizing activations to INT8, you need sample data to estimate the activation scales.
It's best to use calibration data that closely matches your deployment data.
For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
```python
from datasets import load_dataset
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
# Load and preprocess the dataset
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
def preprocess(example):
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)
def tokenize(sample):
return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)
```
### 3. Applying Quantization
Now, apply the quantization algorithms:
```python
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
# Configure the quantization algorithms
recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
]
# Apply quantization
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
# Save the compressed model
SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
```
This process creates a W8A8 model with weights and activations quantized to 8-bit integers.
### 4. Evaluating Accuracy
After quantization, you can load and run the model in vLLM:
```python
from vllm import LLM
model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
```
To evaluate accuracy, you can use `lm_eval`:
```console
$ lm_eval --model vllm \
--model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
--tasks gsm8k \
--num_fewshot 5 \
--limit 250 \
--batch_size 'auto'
```
```{note}
Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
```
## Best Practices
- Start with 512 samples for calibration data (increase if accuracy drops)
- Use a sequence length of 2048 as a starting point
- Employ the chat template or instruction template that the model was trained with
- If you've fine-tuned a model, consider using a sample of your training data for calibration
## Troubleshooting and Support
If you encounter any issues or have feature requests, please open an issue on the `vllm-project/llm-compressor` GitHub repository.
.. _int8:
INT8 W8A8
==================
vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
This quantization method is particularly useful for reducing model size while maintaining good performance.
Please visit the HF collection of `quantized INT8 checkpoints of popular LLMs ready to use with vLLM <https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415>`_.
.. note::
INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
Prerequisites
-------------
To use INT8 quantization with vLLM, you'll need to install the `llm-compressor <https://github.com/vllm-project/llm-compressor/>`_ library:
.. code-block:: console
$ pip install llmcompressor
Quantization Process
--------------------
The quantization process involves four main steps:
1. Loading the model
2. Preparing calibration data
3. Applying quantization
4. Evaluating accuracy in vLLM
1. Loading the Model
^^^^^^^^^^^^^^^^^^^^
Use ``SparseAutoModelForCausalLM``, which wraps ``AutoModelForCausalLM``, for saving and loading quantized models:
.. code-block:: python
from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
model = SparseAutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
2. Preparing Calibration Data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When quantizing activations to INT8, you need sample data to estimate the activation scales.
It's best to use calibration data that closely matches your deployment data.
For a general-purpose instruction-tuned model, you can use a dataset like ``ultrachat``:
.. code-block:: python
from datasets import load_dataset
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
# Load and preprocess the dataset
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
def preprocess(example):
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)
def tokenize(sample):
return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)
3. Applying Quantization
^^^^^^^^^^^^^^^^^^^^^^^^
Now, apply the quantization algorithms:
.. code-block:: python
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
# Configure the quantization algorithms
recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
]
# Apply quantization
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
# Save the compressed model
SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
This process creates a W8A8 model with weights and activations quantized to 8-bit integers.
4. Evaluating Accuracy
^^^^^^^^^^^^^^^^^^^^^^
After quantization, you can load and run the model in vLLM:
.. code-block:: python
from vllm import LLM
model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
To evaluate accuracy, you can use ``lm_eval``:
.. code-block:: console
$ lm_eval --model vllm \
--model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
--tasks gsm8k \
--num_fewshot 5 \
--limit 250 \
--batch_size 'auto'
.. note::
Quantized models can be sensitive to the presence of the ``bos`` token. Make sure to include the ``add_bos_token=True`` argument when running evaluations.
Best Practices
--------------
- Start with 512 samples for calibration data (increase if accuracy drops)
- Use a sequence length of 2048 as a starting point
- Employ the chat template or instruction template that the model was trained with
- If you've fine-tuned a model, consider using a sample of your training data for calibration
Troubleshooting and Support
---------------------------
If you encounter any issues or have feature requests, please open an issue on the ``vllm-project/llm-compressor`` GitHub repository.
.. _supported_hardware_for_quantization:
Supported Hardware for Quantization Kernels
===========================================
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
.. list-table::
:header-rows: 1
:widths: 20 8 8 8 8 8 8 8 8 8 8
* - Implementation
- Volta
- Turing
- Ampere
- Ada
- Hopper
- AMD GPU
- Intel GPU
- x86 CPU
- AWS Inferentia
- Google TPU
* - AWQ
- ✗
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✅︎
- ✅︎
- ✗
- ✗
* - GPTQ
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✅︎
- ✅︎
- ✗
- ✗
* - Marlin (GPTQ/AWQ/FP8)
- ✗
- ✗
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✗
* - INT8 (W8A8)
- ✗
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✅︎
- ✗
- ✗
* - FP8 (W8A8)
- ✗
- ✗
- ✗
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
* - AQLM
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✗
* - bitsandbytes
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✗
* - DeepSpeedFP
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✗
* - GGUF
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✗
Notes:
^^^^^^
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- "✅︎" indicates that the quantization method is supported on the specified hardware.
- "✗" indicates that the quantization method is not supported on the specified hardware.
Please note that this compatibility chart may be subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please check the `quantization directory <https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/quantization>`_ or consult with the vLLM development team.
(supported-hardware-for-quantization)=
# Supported Hardware for Quantization Kernels
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
```{eval-rst}
.. list-table::
:header-rows: 1
:widths: 20 8 8 8 8 8 8 8 8 8 8
* - Implementation
- Volta
- Turing
- Ampere
- Ada
- Hopper
- AMD GPU
- Intel GPU
- x86 CPU
- AWS Inferentia
- Google TPU
* - AWQ
- ✗
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✅︎
- ✅︎
- ✗
- ✗
* - GPTQ
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✅︎
- ✅︎
- ✗
- ✗
* - Marlin (GPTQ/AWQ/FP8)
- ✗
- ✗
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✗
* - INT8 (W8A8)
- ✗
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✅︎
- ✗
- ✗
* - FP8 (W8A8)
- ✗
- ✗
- ✗
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
* - AQLM
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✗
* - bitsandbytes
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✗
* - DeepSpeedFP
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✗
* - GGUF
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✗
```
## Notes:
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- "✅︎" indicates that the quantization method is supported on the specified hardware.
- "✗" indicates that the quantization method is not supported on the specified hardware.
Please note that this compatibility chart may be subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please check the [quantization directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/quantization) or consult with the vLLM development team.
(deploying-with-bentoml)=
# Deploying with BentoML
[BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-complicant image and deploy it on Kubernetes.
For details, see the tutorial [vLLM inference in the BentoML documentation](https://docs.bentoml.com/en/latest/use-cases/large-language-models/vllm.html).
.. _deploying_with_bentoml:
Deploying with BentoML
======================
`BentoML <https://github.com/bentoml/BentoML>`_ allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-complicant image and deploy it on Kubernetes.
For details, see the tutorial `vLLM inference in the BentoML documentation <https://docs.bentoml.com/en/latest/use-cases/large-language-models/vllm.html>`_.
\ No newline at end of file
(deploying-with-cerebrium)=
# Deploying with Cerebrium
```{raw} html
<p align="center">
<img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
</p>
```
vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebrium.ai/), a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
To install the Cerebrium client, run:
```console
$ pip install cerebrium
$ cerebrium login
```
Next, create your Cerebrium project, run:
```console
$ cerebrium init vllm-project
```
Next, to install the required packages, add the following to your cerebrium.toml:
```toml
[cerebrium.deployment]
docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"
[cerebrium.dependencies.pip]
vllm = "latest"
```
Next, let us add our code to handle inference for the LLM of your choice(`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your main.py\`:
```python
from vllm import LLM, SamplingParams
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
results = []
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
results.append({"prompt": prompt, "generated_text": generated_text})
return {"results": results}
```
Then, run the following code to deploy it to the cloud
```console
$ cerebrium deploy
```
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run)
```python
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
-H 'Content-Type: application/json' \
-H 'Authorization: <JWT TOKEN>' \
--data '{
"prompts": [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is"
]
}'
```
You should get a response like:
```python
{
"run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
"result": {
"result": [
{
"prompt": "Hello, my name is",
"generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
},
{
"prompt": "The president of the United States is",
"generated_text": " elected every four years. This is a democratic system.\n\n5. What"
},
{
"prompt": "The capital of France is",
"generated_text": " Paris.\n"
},
{
"prompt": "The future of AI is",
"generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
}
]
},
"run_time_ms": 152.53663063049316
}
```
You now have an autoscaling endpoint where you only pay for the compute you use!
.. _deploying_with_cerebrium:
Deploying with Cerebrium
============================
.. raw:: html
<p align="center">
<img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
</p>
vLLM can be run on a cloud based GPU machine with `Cerebrium <https://www.cerebrium.ai/>`__, a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
To install the Cerebrium client, run:
.. code-block:: console
$ pip install cerebrium
$ cerebrium login
Next, create your Cerebrium project, run:
.. code-block:: console
$ cerebrium init vllm-project
Next, to install the required packages, add the following to your cerebrium.toml:
.. code-block:: toml
[cerebrium.deployment]
docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"
[cerebrium.dependencies.pip]
vllm = "latest"
Next, let us add our code to handle inference for the LLM of your choice(`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your main.py`:
.. code-block:: python
from vllm import LLM, SamplingParams
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
results = []
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
results.append({"prompt": prompt, "generated_text": generated_text})
return {"results": results}
Then, run the following code to deploy it to the cloud
.. code-block:: console
$ cerebrium deploy
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run)
.. code-block:: python
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
-H 'Content-Type: application/json' \
-H 'Authorization: <JWT TOKEN>' \
--data '{
"prompts": [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is"
]
}'
You should get a response like:
.. code-block:: python
{
"run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
"result": {
"result": [
{
"prompt": "Hello, my name is",
"generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
},
{
"prompt": "The president of the United States is",
"generated_text": " elected every four years. This is a democratic system.\n\n5. What"
},
{
"prompt": "The capital of France is",
"generated_text": " Paris.\n"
},
{
"prompt": "The future of AI is",
"generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
}
]
},
"run_time_ms": 152.53663063049316
}
You now have an autoscaling endpoint where you only pay for the compute you use!
(deploying-with-docker)=
# Deploying with Docker
## Use vLLM's Official Docker Image
vLLM offers an official Docker image for deployment.
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
```console
$ docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1
```
```{note}
You can either use the `ipc=host` flag or `--shm-size` flag to allow the
container to access the host's shared memory. vLLM uses PyTorch, which uses shared
memory to share data between processes under the hood, particularly for tensor parallel inference.
```
## Building vLLM's Docker Image from Source
You can build and run vLLM from source via the provided [Dockerfile](https://github.com/vllm-project/vllm/blob/main/Dockerfile). To build vLLM:
```console
$ # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
$ DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai
```
```{note}
By default vLLM will build for all GPU types for widest distribution. If you are just building for the
current GPU type the machine is running on, you can add the argument `--build-arg torch_cuda_arch_list=""`
for vLLM to find the current GPU type and build for that.
```
## Building for Arm64/aarch64
A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper. At time of this writing, this requires the use
of PyTorch Nightly and should be considered **experimental**. Using the flag `--platform "linux/arm64"` will attempt to build for arm64.
```{note}
Multiple modules must be compiled, so this process can take a while. Recommend using `--build-arg max_jobs=` & `--build-arg nvcc_threads=`
flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
```
```console
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
$ python3 use_existing_torch.py
$ DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--platform "linux/arm64" \
-t vllm/vllm-gh200-openai:latest \
--build-arg max_jobs=66 \
--build-arg nvcc_threads=2 \
--build-arg torch_cuda_arch_list="9.0+PTX" \
--build-arg vllm_fa_cmake_gpu_arches="90-real"
```
## Use the custom-built vLLM Docker image
To run vLLM with the custom-built Docker image:
```console
$ docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
vllm/vllm-openai <args...>
```
The argument `vllm/vllm-openai` specifies the image to run, and should be replaced with the name of the custom-built image (the `-t` tag from the build command).
```{note}
**For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` .
```
.. _deploying_with_docker:
Deploying with Docker
============================
Use vLLM's Official Docker Image
--------------------------------
vLLM offers an official Docker image for deployment.
The image can be used to run OpenAI compatible server and is available on Docker Hub as `vllm/vllm-openai <https://hub.docker.com/r/vllm/vllm-openai/tags>`_.
.. code-block:: console
$ docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1
.. note::
You can either use the ``ipc=host`` flag or ``--shm-size`` flag to allow the
container to access the host's shared memory. vLLM uses PyTorch, which uses shared
memory to share data between processes under the hood, particularly for tensor parallel inference.
Building vLLM's Docker Image from Source
----------------------------------------
You can build and run vLLM from source via the provided `Dockerfile <https://github.com/vllm-project/vllm/blob/main/Dockerfile>`_. To build vLLM:
.. code-block:: console
$ # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
$ DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai
.. note::
By default vLLM will build for all GPU types for widest distribution. If you are just building for the
current GPU type the machine is running on, you can add the argument ``--build-arg torch_cuda_arch_list=""``
for vLLM to find the current GPU type and build for that.
Building for Arm64/aarch64
--------------------------
A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper. At time of this writing, this requires the use
of PyTorch Nightly and should be considered **experimental**. Using the flag ``--platform "linux/arm64"`` will attempt to build for arm64.
.. note::
Multiple modules must be compiled, so this process can take a while. Recommend using ``--build-arg max_jobs=`` & ``--build-arg nvcc_threads=``
flags to speed up build process. However, ensure your ``max_jobs`` is substantially larger than ``nvcc_threads`` to get the most benefits.
Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
.. code-block:: console
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
$ python3 use_existing_torch.py
$ DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--platform "linux/arm64" \
-t vllm/vllm-gh200-openai:latest \
--build-arg max_jobs=66 \
--build-arg nvcc_threads=2 \
--build-arg torch_cuda_arch_list="9.0+PTX" \
--build-arg vllm_fa_cmake_gpu_arches="90-real"
Use the custom-built vLLM Docker image
--------------------------------------
To run vLLM with the custom-built Docker image:
.. code-block:: console
$ docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
vllm/vllm-openai <args...>
The argument ``vllm/vllm-openai`` specifies the image to run, and should be replaced with the name of the custom-built image (the ``-t`` tag from the build command).
.. note::
**For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. ``/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1`` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable ``VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1`` .
(deploying-with-dstack)=
# Deploying with dstack
```{raw} html
<p align="center">
<img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>
</p>
```
vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/), an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.
To install dstack client, run:
```console
$ pip install "dstack[all]
$ dstack server
```
Next, to configure your dstack project, run:
```console
$ mkdir -p vllm-dstack
$ cd vllm-dstack
$ dstack init
```
Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
```yaml
type: service
python: "3.11"
env:
- MODEL=NousResearch/Llama-2-7b-chat-hf
port: 8000
resources:
gpu: 24GB
commands:
- pip install vllm
- vllm serve $MODEL --port 8000
model:
format: openai
type: chat
name: NousResearch/Llama-2-7b-chat-hf
```
Then, run the following CLI for provisioning:
```console
$ dstack run . -f serve.dstack.yml
⠸ Getting run plan...
Configuration serve.dstack.yml
Project deep-diver-main
User deep-diver
Min resources 2..xCPU, 8GB.., 1xGPU (24GB)
Max price -
Max duration -
Spot policy auto
Retry policy no
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 gcp us-central1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
2 gcp us-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
3 gcp us-west1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
...
Shown 3 of 193 offers, $5.876 max
Continue? [y/n]: y
⠙ Submitting run...
⠏ Launching spicy-treefrog-1 (pulling)
spicy-treefrog-1 provisioning completed (running)
Service is published at ...
```
After the provisioning, you can interact with the model by using the OpenAI SDK:
```python
from openai import OpenAI
client = OpenAI(
base_url="https://gateway.<gateway domain>",
api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
)
completion = client.chat.completions.create(
model="NousResearch/Llama-2-7b-chat-hf",
messages=[
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming.",
}
]
)
print(completion.choices[0].message.content)
```
```{note}
dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)
```
.. _deploying_with_dstack:
Deploying with dstack
============================
.. raw:: html
<p align="center">
<img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>
</p>
vLLM can be run on a cloud based GPU machine with `dstack <https://dstack.ai/>`__, an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.
To install dstack client, run:
.. code-block:: console
$ pip install "dstack[all]
$ dstack server
Next, to configure your dstack project, run:
.. code-block:: console
$ mkdir -p vllm-dstack
$ cd vllm-dstack
$ dstack init
Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
.. code-block:: yaml
type: service
python: "3.11"
env:
- MODEL=NousResearch/Llama-2-7b-chat-hf
port: 8000
resources:
gpu: 24GB
commands:
- pip install vllm
- vllm serve $MODEL --port 8000
model:
format: openai
type: chat
name: NousResearch/Llama-2-7b-chat-hf
Then, run the following CLI for provisioning:
.. code-block:: console
$ dstack run . -f serve.dstack.yml
⠸ Getting run plan...
Configuration serve.dstack.yml
Project deep-diver-main
User deep-diver
Min resources 2..xCPU, 8GB.., 1xGPU (24GB)
Max price -
Max duration -
Spot policy auto
Retry policy no
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 gcp us-central1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
2 gcp us-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
3 gcp us-west1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
...
Shown 3 of 193 offers, $5.876 max
Continue? [y/n]: y
⠙ Submitting run...
⠏ Launching spicy-treefrog-1 (pulling)
spicy-treefrog-1 provisioning completed (running)
Service is published at ...
After the provisioning, you can interact with the model by using the OpenAI SDK:
.. code-block:: python
from openai import OpenAI
client = OpenAI(
base_url="https://gateway.<gateway domain>",
api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
)
completion = client.chat.completions.create(
model="NousResearch/Llama-2-7b-chat-hf",
messages=[
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming.",
}
]
)
print(completion.choices[0].message.content)
.. note::
dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out `this repository <https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm>`__
.. _deploying_with_helm:
(deploying-with-helm)=
Deploying with Helm
===================
# Deploying with Helm
A Helm chart to deploy vLLM for Kubernetes
......@@ -9,44 +8,42 @@ Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s
This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm install and documentation on architecture and values file.
Prerequisites
-------------
## Prerequisites
Before you begin, ensure that you have the following:
- A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (``k8s-device-plugin``): This can be found at `https://github.com/NVIDIA/k8s-device-plugin <https://github.com/NVIDIA/k8s-device-plugin>`__
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at [https://github.com/NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)
- Available GPU resources in your cluster
- S3 with the model which will be deployed
Installing the chart
--------------------
To install the chart with the release name ``test-vllm``:
.. code-block:: console
## Installing the chart
helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
To install the chart with the release name `test-vllm`:
Uninstalling the Chart
----------------------
```console
helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
```
To uninstall the ``test-vllm`` deployment:
## Uninstalling the Chart
.. code-block:: console
To uninstall the `test-vllm` deployment:
helm uninstall test-vllm --namespace=ns-vllm
```console
helm uninstall test-vllm --namespace=ns-vllm
```
The command removes all the Kubernetes components associated with the
chart **including persistent volumes** and deletes the release.
Architecture
------------
## Architecture
.. image:: architecture_helm_deployment.png
```{image} architecture_helm_deployment.png
```
Values
------
## Values
```{eval-rst}
.. list-table:: Values
:widths: 25 25 25 25
:header-rows: 1
......@@ -251,3 +248,4 @@ Values
- string
- test
- Release name
```
.. _deploying_with_k8s:
(deploying-with-k8s)=
Deploying with Kubernetes
==========================
# Deploying with Kubernetes
Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
Prerequisites
-------------
## Prerequisites
Before you begin, ensure that you have the following:
- A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/`
- Available GPU resources in your cluster
Deployment Steps
----------------
1. **Create a PVC , Secret and Deployment for vLLM**
## Deployment Steps
1. **Create a PVC , Secret and Deployment for vLLM**
PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
.. code-block:: yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mistral-7b
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: default
volumeMode: Filesystem
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mistral-7b
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: default
volumeMode: Filesystem
```
Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
.. code-block:: yaml
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
namespace: default
type: Opaque
data:
token: "REPLACE_WITH_TOKEN"
```yaml
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
namespace: default
type: Opaque
data:
token: "REPLACE_WITH_TOKEN"
```
Create a deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model:
.. code-block:: yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
namespace: default
labels:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
namespace: default
labels:
app: mistral-7b
spec:
replicas: 1
selector:
matchLabels:
app: mistral-7b
spec:
replicas: 1
selector:
matchLabels:
template:
metadata:
labels:
app: mistral-7b
template:
metadata:
labels:
app: mistral-7b
spec:
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: mistral-7b
# vLLM needs to access the host's shared memory for tensor parallel inference.
spec:
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: mistral-7b
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
containers:
- name: mistral-7b
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args: [
"vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 20G
nvidia.com/gpu: "1"
requests:
cpu: "2"
memory: 6G
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
containers:
- name: mistral-7b
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args: [
"vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 20G
nvidia.com/gpu: "1"
requests:
cpu: "2"
memory: 6G
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
- name: shm
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 5
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 5
```
2. **Create a Kubernetes Service for vLLM**
Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
.. code-block:: yaml
apiVersion: v1
kind: Service
metadata:
name: mistral-7b
namespace: default
spec:
ports:
- name: http-mistral-7b
port: 80
protocol: TCP
targetPort: 8000
# The label selector should match the deployment labels & it is useful for prefix caching feature
selector:
app: mistral-7b
sessionAffinity: None
type: ClusterIP
```yaml
apiVersion: v1
kind: Service
metadata:
name: mistral-7b
namespace: default
spec:
ports:
- name: http-mistral-7b
port: 80
protocol: TCP
targetPort: 8000
# The label selector should match the deployment labels & it is useful for prefix caching feature
selector:
app: mistral-7b
sessionAffinity: None
type: ClusterIP
```
3. **Deploy and Test**
Apply the deployment and service configurations using ``kubectl apply -f <filename>``:
Apply the deployment and service configurations using `kubectl apply -f <filename>`:
.. code-block:: console
```console
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
```
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
To test the deployment, run the following `curl` command:
To test the deployment, run the following ``curl`` command:
.. code-block:: console
curl http://mistral-7b.default.svc.cluster.local/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
```console
curl http://mistral-7b.default.svc.cluster.local/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
```
If the service is correctly deployed, you should receive a response from the vLLM model.
Conclusion
----------
## Conclusion
Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation.
(deploying-with-kserve)=
# Deploying with KServe
vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.
Please see [this guide](https://kserve.github.io/website/latest/modelserving/v1beta1/llm/huggingface/) for more details on using vLLM with KServe.
.. _deploying_with_kserve:
Deploying with KServe
============================
vLLM can be deployed with `KServe <https://github.com/kserve/kserve>`_ on Kubernetes for highly scalable distributed model serving.
Please see `this guide <https://kserve.github.io/website/latest/modelserving/v1beta1/llm/huggingface/>`_ for more details on using vLLM with KServe.
(deploying-with-kubeai)=
# Deploying with KubeAI
[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
Please see the Installation Guides for environment specific instructions:
- [Any Kubernetes Cluster](https://www.kubeai.org/installation/any/)
- [EKS](https://www.kubeai.org/installation/eks/)
- [GKE](https://www.kubeai.org/installation/gke/)
Once you have KubeAI installed, you can
[configure text generation models](https://www.kubeai.org/how-to/configure-text-generation-models/)
using vLLM.
.. _deploying_with_kubeai:
Deploying with KubeAI
=====================
`KubeAI <https://github.com/substratusai/kubeai>`_ is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
Please see the Installation Guides for environment specific instructions:
* `Any Kubernetes Cluster <https://www.kubeai.org/installation/any/>`_
* `EKS <https://www.kubeai.org/installation/eks/>`_
* `GKE <https://www.kubeai.org/installation/gke/>`_
Once you have KubeAI installed, you can
`configure text generation models <https://www.kubeai.org/how-to/configure-text-generation-models/>`_
using vLLM.
\ No newline at end of file
.. _deploying_with_lws:
(deploying-with-lws)=
Deploying with LWS
============================
# Deploying with LWS
LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
A major use case is for multi-host/multi-node distributed inference.
vLLM can be deployed with `LWS <https://github.com/kubernetes-sigs/lws>`_ on Kubernetes for distributed model serving.
vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kubernetes for distributed model serving.
Please see `this guide <https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/vllm>`_ for more details on
Please see [this guide](https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/vllm) for more details on
deploying vLLM on Kubernetes using LWS.
(nginxloadbalancer)=
# Deploying with Nginx Loadbalancer
This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.
Table of contents:
1. [Build Nginx Container](#nginxloadbalancer-nginx-build)
2. [Create Simple Nginx Config file](#nginxloadbalancer-nginx-conf)
3. [Build vLLM Container](#nginxloadbalancer-nginx-vllm-container)
4. [Create Docker Network](#nginxloadbalancer-nginx-docker-network)
5. [Launch vLLM Containers](#nginxloadbalancer-nginx-launch-container)
6. [Launch Nginx](#nginxloadbalancer-nginx-launch-nginx)
7. [Verify That vLLM Servers Are Ready](#nginxloadbalancer-nginx-verify-nginx)
(nginxloadbalancer-nginx-build)=
## Build Nginx Container
This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory.
```console
export vllm_root=`pwd`
```
Create a file named `Dockerfile.nginx`:
```console
FROM nginx:latest
RUN rm /etc/nginx/conf.d/default.conf
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]
```
Build the container:
```console
docker build . -f Dockerfile.nginx --tag nginx-lb
```
(nginxloadbalancer-nginx-conf)=
## Create Simple Nginx Config file
Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`.
```console
upstream backend {
least_conn;
server vllm0:8000 max_fails=3 fail_timeout=10000s;
server vllm1:8000 max_fails=3 fail_timeout=10000s;
}
server {
listen 80;
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
```
(nginxloadbalancer-nginx-vllm-container)=
## Build vLLM Container
```console
cd $vllm_root
docker build -f Dockerfile . --tag vllm
```
If you are behind proxy, you can pass the proxy settings to the docker build command as shown below:
```console
cd $vllm_root
docker build -f Dockerfile . --tag vllm --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy
```
(nginxloadbalancer-nginx-docker-network)=
## Create Docker Network
```console
docker network create vllm_nginx
```
(nginxloadbalancer-nginx-launch-container)=
## Launch vLLM Containers
Notes:
- If you have your HuggingFace models cached somewhere else, update `hf_cache_dir` below.
- If you don't have an existing HuggingFace cache you will want to start `vllm0` and wait for the model to complete downloading and the server to be ready. This will ensure that `vllm1` can leverage the model you just downloaded and it won't have to be downloaded again.
- The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus all`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command.
- Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`.
```console
mkdir -p ~/.cache/huggingface/hub/
hf_cache_dir=~/.cache/huggingface/
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
```
```{note}
If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.
```
(nginxloadbalancer-nginx-launch-nginx)=
## Launch Nginx
```console
docker run -itd -p 8000:80 --network vllm_nginx -v ./nginx_conf/:/etc/nginx/conf.d/ --name nginx-lb nginx-lb:latest
```
(nginxloadbalancer-nginx-verify-nginx)=
## Verify That vLLM Servers Are Ready
```console
docker logs vllm0 | grep Uvicorn
docker logs vllm1 | grep Uvicorn
```
Both outputs should look like this:
```console
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```
.. _nginxloadbalancer:
Deploying with Nginx Loadbalancer
=================================
This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.
Table of contents:
#. :ref:`Build Nginx Container <nginxloadbalancer_nginx_build>`
#. :ref:`Create Simple Nginx Config file <nginxloadbalancer_nginx_conf>`
#. :ref:`Build vLLM Container <nginxloadbalancer_nginx_vllm_container>`
#. :ref:`Create Docker Network <nginxloadbalancer_nginx_docker_network>`
#. :ref:`Launch vLLM Containers <nginxloadbalancer_nginx_launch_container>`
#. :ref:`Launch Nginx <nginxloadbalancer_nginx_launch_nginx>`
#. :ref:`Verify That vLLM Servers Are Ready <nginxloadbalancer_nginx_verify_nginx>`
.. _nginxloadbalancer_nginx_build:
Build Nginx Container
---------------------
This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory.
.. code-block:: console
export vllm_root=`pwd`
Create a file named ``Dockerfile.nginx``:
.. code-block:: console
FROM nginx:latest
RUN rm /etc/nginx/conf.d/default.conf
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]
Build the container:
.. code-block:: console
docker build . -f Dockerfile.nginx --tag nginx-lb
.. _nginxloadbalancer_nginx_conf:
Create Simple Nginx Config file
-------------------------------
Create a file named ``nginx_conf/nginx.conf``. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another ``server vllmN:8000 max_fails=3 fail_timeout=10000s;`` entry to ``upstream backend``.
.. code-block:: console
upstream backend {
least_conn;
server vllm0:8000 max_fails=3 fail_timeout=10000s;
server vllm1:8000 max_fails=3 fail_timeout=10000s;
}
server {
listen 80;
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
.. _nginxloadbalancer_nginx_vllm_container:
Build vLLM Container
--------------------
.. code-block:: console
cd $vllm_root
docker build -f Dockerfile . --tag vllm
If you are behind proxy, you can pass the proxy settings to the docker build command as shown below:
.. code-block:: console
cd $vllm_root
docker build -f Dockerfile . --tag vllm --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy
.. _nginxloadbalancer_nginx_docker_network:
Create Docker Network
---------------------
.. code-block:: console
docker network create vllm_nginx
.. _nginxloadbalancer_nginx_launch_container:
Launch vLLM Containers
----------------------
Notes:
* If you have your HuggingFace models cached somewhere else, update ``hf_cache_dir`` below.
* If you don't have an existing HuggingFace cache you will want to start ``vllm0`` and wait for the model to complete downloading and the server to be ready. This will ensure that ``vllm1`` can leverage the model you just downloaded and it won't have to be downloaded again.
* The below example assumes GPU backend used. If you are using CPU backend, remove ``--gpus all``, add ``VLLM_CPU_KVCACHE_SPACE`` and ``VLLM_CPU_OMP_THREADS_BIND`` environment variables to the docker run command.
* Adjust the model name that you want to use in your vLLM servers if you don't want to use ``Llama-2-7b-chat-hf``.
.. code-block:: console
mkdir -p ~/.cache/huggingface/hub/
hf_cache_dir=~/.cache/huggingface/
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
.. note::
If you are behind proxy, you can pass the proxy settings to the docker run command via ``-e http_proxy=$http_proxy -e https_proxy=$https_proxy``.
.. _nginxloadbalancer_nginx_launch_nginx:
Launch Nginx
------------
.. code-block:: console
docker run -itd -p 8000:80 --network vllm_nginx -v ./nginx_conf/:/etc/nginx/conf.d/ --name nginx-lb nginx-lb:latest
.. _nginxloadbalancer_nginx_verify_nginx:
Verify That vLLM Servers Are Ready
----------------------------------
.. code-block:: console
docker logs vllm0 | grep Uvicorn
docker logs vllm1 | grep Uvicorn
Both outputs should look like this:
.. code-block:: console
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment