vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32 and BF16.
vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32 and BF16. vLLM CPU backend supports the following vLLM features:
- Tensor Parallel (``-tp = N``)
- Quantization (``INT8 W8A8, AWQ``)
.. note::
FP16 data type and more advanced features on `chunked-prefill`, `prefix-caching` and `FP8 KV cache` are under development and will be available soon.
Table of contents:
...
...
@@ -141,5 +147,20 @@ Performance tips
- If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using ``VLLM_CPU_OMP_THREADS_BIND`` to avoid cross NUMA node memory access.
CPU Backend Considerations
--------------------------
- The CPU backend significantly differs from the GPU backend since the vLLM architecture was originally optimized for GPU use. A number of optimizations are needed to enhance its performance.
- Decouple the HTTP serving components from the inference components. In a GPU backend configuration, the HTTP serving and tokenization tasks operate on the CPU, while inference runs on the GPU, which typically does not pose a problem. However, in a CPU-based setup, the HTTP serving and tokenization can cause significant context switching and reduced cache efficiency. Therefore, it is strongly recommended to segregate these two components for improved performance.
- On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the `topology <https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.md#non-uniform-memory-access-numa>`_. For NUMA architecture, two optimizations are to recommended: Tensor Parallel or Data Parallel.
* Using Tensor Parallel for a latency constraints deployment: following GPU backend design, a Megatron-LM's parallel algorithm will be used to shard the model, based on the number of NUMA nodes (e.g. TP = 2 for a two NUMA node system). With `TP feature on CPU <https://github.com/vllm-project/vllm/pull/6125>`_ merged, Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
* Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like `Nginx <../serving/deploying_with_nginx.html>`_ or HAProxy are recommended. Anyscale Ray project provides the feature on LLM `serving <https://docs.ray.io/en/latest/serve/index.html>`_. Here is the example to setup a scalable LLM serving with `Ray Serve <https://github.com/intel/llm-on-ray/blob/main/docs/setup.md>`_.
If you are testing with multi-nodes, adjust ``--nproc-per-node`` and ``--nnodes`` according to your setup and set ``MASTER_ADDR`` to the correct IP address of the master node, reachable from all nodes. Then, run:
LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for Linux running on x86 platform with cuda 12 for every commit since v0.5.3. You can download and install the latest one with the following command:
LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for Linux running on a x86 platform with CUDA 12 for every commit since ``v0.5.3``. You can download and install it with the following command:
.. code-block:: console
...
...
@@ -66,7 +66,7 @@ If you want to access the wheels for previous commits, you can specify the commi
$ export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
Note that the wheels are built with Python 3.8 abi (see `PEP 425 <https://peps.python.org/pep-0425/>`_ for more details about abi), so **they are compatible with Python 3.8 and later**. The version string in the wheel file name (``1.0.0.dev``) is just a placeholder to have a unified URL for the wheels. The actual versions of wheels are contained in the wheel metadata.
Note that the wheels are built with Python 3.8 ABI (see `PEP 425 <https://peps.python.org/pep-0425/>`_ for more details about ABI), so **they are compatible with Python 3.8 and later**. The version string in the wheel file name (``1.0.0.dev``) is just a placeholder to have a unified URL for the wheels. The actual versions of wheels are contained in the wheel metadata.
Another way to access the latest code is to use the docker images:
...
...
@@ -77,17 +77,17 @@ Another way to access the latest code is to use the docker images:
These docker images are used for CI and testing only, and they are not intended for production use. They will be expired after several days.
Latest code can contain bugs and may not be stable. Please use it with caution.
The latest code can contain bugs and may not be stable. Please use it with caution.
.. _build_from_source:
Build from source
==================
=================
.. _python-only-build:
Python-only build (without compilation)
----------------------------------------
---------------------------------------
If you only need to change Python code, you can simply build vLLM without compilation.
...
...
@@ -116,28 +116,28 @@ The script will:
Now, you can edit the Python code in the current directory, and the changes will be reflected when you run vLLM.
Once you have finished editing or want to install another vLLM wheel, you should exit the development environment using `the same script <https://github.com/vllm-project/vllm/blob/main/python_only_dev.py>`_ with the ``--quit-dev``(or ``-q`` for short) flag:
Once you have finished editing or want to install another vLLM wheel, you should exit the development environment using `the same script <https://github.com/vllm-project/vllm/blob/main/python_only_dev.py>`_ with the ``--quit-dev``(or ``-q`` for short) flag:
.. code-block:: console
$ python python_only_dev.py --quit-dev
The script with ``--quit-dev`` flag will:
The ``--quit-dev`` flag will:
* Remove the symbolic link from the current directory to the vLLM package.
* Restore the original vLLM package from the backup.
If you update the vLLM wheel and want to rebuild from the source and make further edits, you will need to start `all above <#python-only-build>`_ over again.
If you update the vLLM wheel and rebuild from the source to make further edits, you will need to repeat the `Python-only build <#python-only-build>`_ steps again.
.. note::
There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to `the above section <#install-the-latest-code>`_ for instructions on how to install a specified wheel.
It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to `the section above <#install-the-latest-code>`_ for instructions on how to install a specified wheel.
Full build (with compilation)
---------------------------------
-----------------------------
If you want to modify C++ or CUDA code, you'll need to build vLLM from source. This can take several minutes:
If you want to modify C++ or CUDA code, you'll need to build vLLM from source. This can take several minutes:
.. code-block:: console
...
...
@@ -153,7 +153,7 @@ If you want to modify C++ or CUDA code, you'll need to build vLLM from source. T
Use an existing PyTorch installation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There are scenarios where the PyTorch dependency cannot be easily installed via pip, e.g.:
* Building vLLM with PyTorch nightly or a custom PyTorch build.
...
...
@@ -171,7 +171,7 @@ To build vLLM using an existing PyTorch installation:
Troubleshooting
~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~
To avoid your system being overloaded, you can limit the number of compilation jobs
to be run simultaneously, via the environment variable ``MAX_JOBS``. For example:
...
...
@@ -207,7 +207,7 @@ Here is a sanity check to verify that the CUDA Toolkit is correctly installed:
Unsupported OS build
----------------------
--------------------
vLLM can fully run only on Linux but for development purposes, you can still build it on other systems (for example, macOS), allowing for imports and a more convenient development environment. The binaries will not be compiled and won't work on non-Linux systems.
#. :ref:`Verify That vLLM Servers Are Ready <nginxloadbalancer_nginx_verify_nginx>`
.. _nginxloadbalancer_nginx_build:
Build Nginx Container
---------------------
This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory.
.. code-block:: console
export vllm_root=`pwd`
Create a file named ``Dockerfile.nginx``:
.. code-block:: console
FROM nginx:latest
RUN rm /etc/nginx/conf.d/default.conf
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]
Build the container:
.. code-block:: console
docker build . -f Dockerfile.nginx --tag nginx-lb
.. _nginxloadbalancer_nginx_conf:
Create Simple Nginx Config file
-------------------------------
Create a file named ``nginx_conf/nginx.conf``. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another ``server vllmN:8000 max_fails=3 fail_timeout=10000s;`` entry to ``upstream backend``.
.. code-block:: console
upstream backend {
least_conn;
server vllm0:8000 max_fails=3 fail_timeout=10000s;
server vllm1:8000 max_fails=3 fail_timeout=10000s;
* If you have your HuggingFace models cached somewhere else, update ``hf_cache_dir`` below.
* If you don't have an existing HuggingFace cache you will want to start ``vllm0`` and wait for the model to complete downloading and the server to be ready. This will ensure that ``vllm1`` can leverage the model you just downloaded and it won't have to be downloaded again.
* The below example assumes GPU backend used. If you are using CPU backend, remove ``--gpus all``, add ``VLLM_CPU_KVCACHE_SPACE`` and ``VLLM_CPU_OMP_THREADS_BIND`` environment variables to the docker run command.
* Adjust the model name that you want to use in your vLLM servers if you don't want to use ``Llama-2-7b-chat-hf``.
.. code-block:: console
mkdir -p ~/.cache/huggingface/hub/
hf_cache_dir=~/.cache/huggingface/
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
.. note::
If you are behind proxy, you can pass the proxy settings to the docker run command via ``-e http_proxy=$http_proxy -e https_proxy=$https_proxy``.
@@ -157,7 +157,7 @@ vLLM will use guided decoding to ensure the response matches the tool parameter
To enable this feature, you should set the following flags:
*`--enable-auto-tool-choice` -- **mandatory** Auto tool choice. tells vLLM that you want to enable the model to generate its own tool calls when it
deems appropriate.
*`--tool-call-parser` -- select the tool parser to use - currently either `hermes` or `mistral` or `llama3_json` or `internlm`. Additional tool parsers
*`--tool-call-parser` -- select the tool parser to use (listed below). Additional tool parsers
will continue to be added in the future, and also can register your own tool parsers in the `--tool-parser-plugin`.
*`--tool-parser-plugin` -- **optional** tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in `--tool-call-parser`.
*`--chat-template` -- **optional** for auto tool choice. the path to the chat template which handles `tool`-role messages and `assistant`-role messages
...
...
@@ -168,7 +168,7 @@ from HuggingFace; and you can find an example of this in a `tokenizer_config.jso
If your favorite tool-calling model is not supported, please feel free to contribute a parser & tool use chat template!
#### Hermes Models
#### Hermes Models (`hermes`)
All Nous Research Hermes-series models newer than Hermes 2 Pro should be supported.
*`NousResearch/Hermes-2-Pro-*`
*`NousResearch/Hermes-2-Theta-*`
...
...
@@ -180,7 +180,7 @@ step in their creation_.
Flags: `--tool-call-parser hermes`
#### Mistral Models
#### Mistral Models (`mistral`)
Supported models:
*`mistralai/Mistral-7B-Instruct-v0.3` (confirmed)
* Additional mistral function-calling models are compatible as well.
...
...
@@ -199,7 +199,7 @@ when tools are provided, that results in much better reliability when working wi
@@ -9,4 +9,7 @@ shorter Pod startup times and CPU memory usage. Tensor encryption is also suppor
For more information on CoreWeave's Tensorizer, please refer to
`CoreWeave's Tensorizer documentation <https://github.com/coreweave/tensorizer>`_. For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see
the `vLLM example script <https://docs.vllm.ai/en/stable/getting_started/examples/tensorize_vllm_model.html>`_.
\ No newline at end of file
the `vLLM example script <https://docs.vllm.ai/en/stable/getting_started/examples/tensorize_vllm_model.html>`_.
.. note::
Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`.