vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
This quantization method is particularly useful for reducing model size while maintaining good performance.
Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
```{note}
INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
```
## Prerequisites
To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
```console
$pip install llmcompressor
```
## Quantization Process
The quantization process involves four main steps:
1. Loading the model
2. Preparing calibration data
3. Applying quantization
4. Evaluating accuracy in vLLM
### 1. Loading the Model
Use `SparseAutoModelForCausalLM`, which wraps `AutoModelForCausalLM`, for saving and loading quantized models:
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
```{eval-rst}
.. list-table::
.. list-table::
:header-rows: 1
:header-rows: 1
:widths: 20 8 8 8 8 8 8 8 8 8 8
:widths: 20 8 8 8 8 8 8 8 8 8 8
* - Implementation
* - Implementation
- Volta
- Volta
- Turing
- Turing
- Ampere
- Ampere
- Ada
- Ada
- Hopper
- Hopper
- AMD GPU
- AMD GPU
- Intel GPU
- Intel GPU
- x86 CPU
- x86 CPU
- AWS Inferentia
- AWS Inferentia
- Google TPU
- Google TPU
* - AWQ
* - AWQ
- ✗
- ✗
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
* - GPTQ
* - GPTQ
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
* - Marlin (GPTQ/AWQ/FP8)
* - Marlin (GPTQ/AWQ/FP8)
- ✗
- ✗
- ✗
- ✗
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
* - INT8 (W8A8)
* - INT8 (W8A8)
- ✗
- ✗
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
* - FP8 (W8A8)
* - FP8 (W8A8)
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
* - AQLM
* - AQLM
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
* - bitsandbytes
* - bitsandbytes
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
* - DeepSpeedFP
* - DeepSpeedFP
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
* - GGUF
* - GGUF
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
```
Notes:
^^^^^^
## Notes:
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- "✅︎" indicates that the quantization method is supported on the specified hardware.
- "✅︎" indicates that the quantization method is supported on the specified hardware.
- "✗" indicates that the quantization method is not supported on the specified hardware.
- "✗" indicates that the quantization method is not supported on the specified hardware.
Please note that this compatibility chart may be subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
Please note that this compatibility chart may be subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please check the `quantization directory <https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/quantization>`_ or consult with the vLLM development team.
For the most up-to-date information on hardware support and quantization methods, please check the [quantization directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/quantization) or consult with the vLLM development team.
[BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-complicant image and deploy it on Kubernetes.
For details, see the tutorial [vLLM inference in the BentoML documentation](https://docs.bentoml.com/en/latest/use-cases/large-language-models/vllm.html).
`BentoML <https://github.com/bentoml/BentoML>`_ allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-complicant image and deploy it on Kubernetes.
For details, see the tutorial `vLLM inference in the BentoML documentation <https://docs.bentoml.com/en/latest/use-cases/large-language-models/vllm.html>`_.
vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebrium.ai/), a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
To install the Cerebrium client, run:
```console
$pip install cerebrium
$cerebrium login
```
Next, create your Cerebrium project, run:
```console
$cerebrium init vllm-project
```
Next, to install the required packages, add the following to your cerebrium.toml:
Next, let us add our code to handle inference for the LLM of your choice(`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your main.py\`:
Then, run the following code to deploy it to the cloud
```console
$cerebrium deploy
```
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run)
vLLM can be run on a cloud based GPU machine with `Cerebrium <https://www.cerebrium.ai/>`__, a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
To install the Cerebrium client, run:
.. code-block:: console
$ pip install cerebrium
$ cerebrium login
Next, create your Cerebrium project, run:
.. code-block:: console
$ cerebrium init vllm-project
Next, to install the required packages, add the following to your cerebrium.toml:
Next, let us add our code to handle inference for the LLM of your choice(`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your main.py`:
Then, run the following code to deploy it to the cloud
.. code-block:: console
$ cerebrium deploy
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run)
.. code-block:: python
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
-H 'Content-Type: application/json' \
-H 'Authorization: <JWT TOKEN>' \
--data '{
"prompts": [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is"
]
}'
You should get a response like:
.. code-block:: python
{
"run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
"result": {
"result": [
{
"prompt": "Hello, my name is",
"generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
},
{
"prompt": "The president of the United States is",
"generated_text": " elected every four years. This is a democratic system.\n\n5. What"
},
{
"prompt": "The capital of France is",
"generated_text": " Paris.\n"
},
{
"prompt": "The future of AI is",
"generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
}
]
},
"run_time_ms": 152.53663063049316
}
You now have an autoscaling endpoint where you only pay for the compute you use!
vLLM offers an official Docker image for deployment.
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
The argument `vllm/vllm-openai` specifies the image to run, and should be replaced with the name of the custom-built image (the `-t` tag from the build command).
```{note}
**For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` .
vLLM offers an official Docker image for deployment.
The image can be used to run OpenAI compatible server and is available on Docker Hub as `vllm/vllm-openai <https://hub.docker.com/r/vllm/vllm-openai/tags>`_.
The argument ``vllm/vllm-openai`` specifies the image to run, and should be replaced with the name of the custom-built image (the ``-t`` tag from the build command).
.. note::
**For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. ``/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1`` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable ``VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1`` .
vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/), an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.
To install dstack client, run:
```console
$pip install"dstack[all]
$dstack server
```
Next, to configure your dstack project, run:
```console
$mkdir-p vllm-dstack
$cd vllm-dstack
$dstack init
```
Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
After the provisioning, you can interact with the model by using the OpenAI SDK:
```python
fromopenaiimportOpenAI
client=OpenAI(
base_url="https://gateway.<gateway domain>",
api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
)
completion=client.chat.completions.create(
model="NousResearch/Llama-2-7b-chat-hf",
messages=[
{
"role":"user",
"content":"Compose a poem that explains the concept of recursion in programming.",
}
]
)
print(completion.choices[0].message.content)
```
```{note}
dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)
vLLM can be run on a cloud based GPU machine with `dstack <https://dstack.ai/>`__, an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.
To install dstack client, run:
.. code-block:: console
$ pip install "dstack[all]
$ dstack server
Next, to configure your dstack project, run:
.. code-block:: console
$ mkdir -p vllm-dstack
$ cd vllm-dstack
$ dstack init
Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
After the provisioning, you can interact with the model by using the OpenAI SDK:
.. code-block:: python
from openai import OpenAI
client = OpenAI(
base_url="https://gateway.<gateway domain>",
api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
)
completion = client.chat.completions.create(
model="NousResearch/Llama-2-7b-chat-hf",
messages=[
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming.",
}
]
)
print(completion.choices[0].message.content)
.. note::
dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out `this repository <https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm>`__
@@ -9,44 +8,42 @@ Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s
...
@@ -9,44 +8,42 @@ Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s
This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm install and documentation on architecture and values file.
This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm install and documentation on architecture and values file.
Prerequisites
## Prerequisites
-------------
Before you begin, ensure that you have the following:
Before you begin, ensure that you have the following:
- A running Kubernetes cluster
- A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (``k8s-device-plugin``): This can be found at `https://github.com/NVIDIA/k8s-device-plugin <https://github.com/NVIDIA/k8s-device-plugin>`__
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at [https://github.com/NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)
- Available GPU resources in your cluster
- Available GPU resources in your cluster
- S3 with the model which will be deployed
- S3 with the model which will be deployed
Installing the chart
## Installing the chart
--------------------
To install the chart with the release name ``test-vllm``:
Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
Prerequisites
## Prerequisites
-------------
Before you begin, ensure that you have the following:
Before you begin, ensure that you have the following:
- A running Kubernetes cluster
- A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/`
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/`
- Available GPU resources in your cluster
- Available GPU resources in your cluster
Deployment Steps
## Deployment Steps
----------------
1. **Create a PVC , Secret and Deployment for vLLM**
1.**Create a PVC , Secret and Deployment for vLLM**
PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
.. code-block:: yaml
```yaml
apiVersion:v1
apiVersion: v1
kind:PersistentVolumeClaim
kind: PersistentVolumeClaim
metadata:
metadata:
name:mistral-7b
name: mistral-7b
namespace:default
namespace: default
spec:
spec:
accessModes:
accessModes:
-ReadWriteOnce
- ReadWriteOnce
resources:
resources:
requests:
requests:
storage:50Gi
storage: 50Gi
storageClassName:default
storageClassName: default
volumeMode:Filesystem
volumeMode: Filesystem
```
Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
.. code-block:: yaml
```yaml
apiVersion:v1
apiVersion: v1
kind:Secret
kind: Secret
metadata:
metadata:
name:hf-token-secret
name: hf-token-secret
namespace:default
namespace: default
type:Opaque
type: Opaque
data:
data:
token:"REPLACE_WITH_TOKEN"
token: "REPLACE_WITH_TOKEN"
```
Create a deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model:
Create a deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model:
.. code-block:: yaml
```yaml
apiVersion:apps/v1
apiVersion: apps/v1
kind:Deployment
kind: Deployment
metadata:
metadata:
name:mistral-7b
name: mistral-7b
namespace:default
namespace: default
labels:
labels:
app:mistral-7b
spec:
replicas:1
selector:
matchLabels:
app:mistral-7b
app:mistral-7b
spec:
template:
replicas: 1
metadata:
selector:
labels:
matchLabels:
app:mistral-7b
app:mistral-7b
template:
spec:
metadata:
volumes:
labels:
-name:cache-volume
app: mistral-7b
persistentVolumeClaim:
spec:
claimName:mistral-7b
volumes:
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: cache-volume
-name:shm
persistentVolumeClaim:
emptyDir:
claimName: mistral-7b
medium:Memory
# vLLM needs to access the host's shared memory for tensor parallel inference.
If the service is correctly deployed, you should receive a response from the vLLM model.
If the service is correctly deployed, you should receive a response from the vLLM model.
Conclusion
## Conclusion
----------
Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation.
Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation.
[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
Please see the Installation Guides for environment specific instructions:
`KubeAI <https://github.com/substratusai/kubeai>`_ is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
Please see the Installation Guides for environment specific instructions:
7.[Verify That vLLM Servers Are Ready](#nginxloadbalancer-nginx-verify-nginx)
(nginxloadbalancer-nginx-build)=
## Build Nginx Container
This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory.
```console
export vllm_root=`pwd`
```
Create a file named `Dockerfile.nginx`:
```console
FROM nginx:latest
RUN rm /etc/nginx/conf.d/default.conf
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]
```
Build the container:
```console
docker build . -f Dockerfile.nginx --tag nginx-lb
```
(nginxloadbalancer-nginx-conf)=
## Create Simple Nginx Config file
Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`.
```console
upstream backend {
least_conn;
server vllm0:8000 max_fails=3 fail_timeout=10000s;
server vllm1:8000 max_fails=3 fail_timeout=10000s;
- If you have your HuggingFace models cached somewhere else, update `hf_cache_dir` below.
- If you don't have an existing HuggingFace cache you will want to start `vllm0` and wait for the model to complete downloading and the server to be ready. This will ensure that `vllm1` can leverage the model you just downloaded and it won't have to be downloaded again.
- The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus all`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command.
- Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`.
```console
mkdir -p ~/.cache/huggingface/hub/
hf_cache_dir=~/.cache/huggingface/
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
```
```{note}
If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.
#. :ref:`Verify That vLLM Servers Are Ready <nginxloadbalancer_nginx_verify_nginx>`
.. _nginxloadbalancer_nginx_build:
Build Nginx Container
---------------------
This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory.
.. code-block:: console
export vllm_root=`pwd`
Create a file named ``Dockerfile.nginx``:
.. code-block:: console
FROM nginx:latest
RUN rm /etc/nginx/conf.d/default.conf
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]
Build the container:
.. code-block:: console
docker build . -f Dockerfile.nginx --tag nginx-lb
.. _nginxloadbalancer_nginx_conf:
Create Simple Nginx Config file
-------------------------------
Create a file named ``nginx_conf/nginx.conf``. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another ``server vllmN:8000 max_fails=3 fail_timeout=10000s;`` entry to ``upstream backend``.
.. code-block:: console
upstream backend {
least_conn;
server vllm0:8000 max_fails=3 fail_timeout=10000s;
server vllm1:8000 max_fails=3 fail_timeout=10000s;
* If you have your HuggingFace models cached somewhere else, update ``hf_cache_dir`` below.
* If you don't have an existing HuggingFace cache you will want to start ``vllm0`` and wait for the model to complete downloading and the server to be ready. This will ensure that ``vllm1`` can leverage the model you just downloaded and it won't have to be downloaded again.
* The below example assumes GPU backend used. If you are using CPU backend, remove ``--gpus all``, add ``VLLM_CPU_KVCACHE_SPACE`` and ``VLLM_CPU_OMP_THREADS_BIND`` environment variables to the docker run command.
* Adjust the model name that you want to use in your vLLM servers if you don't want to use ``Llama-2-7b-chat-hf``.
.. code-block:: console
mkdir -p ~/.cache/huggingface/hub/
hf_cache_dir=~/.cache/huggingface/
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
.. note::
If you are behind proxy, you can pass the proxy settings to the docker run command via ``-e http_proxy=$http_proxy -e https_proxy=$https_proxy``.