cpu.x86.inc.md 6.81 KB
Newer Older
1
2
<!-- markdownlint-disable MD041 -->
--8<-- [start:installation]
3

Li, Jiang's avatar
Li, Jiang committed
4
vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
5

6
7
--8<-- [end:installation]
--8<-- [start:requirements]
8
9

- OS: Linux
10
- CPU flags: `avx512f` (Recommended), `avx2` (Limited features)
11
12

!!! tip
Li, Jiang's avatar
Li, Jiang committed
13
    Use `lscpu` to check the CPU flags.
14

15
16
--8<-- [end:requirements]
--8<-- [start:set-up-using-python]
17

18
19
--8<-- [end:set-up-using-python]
--8<-- [start:pre-built-wheels]
20

21
Pre-built vLLM wheels for x86 with AVX512/AVX2 are available since version 0.17.0. To install release wheels:
Li, Jiang's avatar
Li, Jiang committed
22
23
24
25
26

```bash
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')

# use uv
27
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl --torch-backend cpu
Li, Jiang's avatar
Li, Jiang committed
28
```
29

Li, Jiang's avatar
Li, Jiang committed
30
31
32
??? console "pip"
    ```bash
    # use pip
33
    pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cpu
Li, Jiang's avatar
Li, Jiang committed
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
    ```
!!! warning "set `LD_PRELOAD`"
    Before use vLLM CPU installed via wheels, make sure TCMalloc and Intel OpenMP are installed and added to `LD_PRELOAD`:
    ```bash
    # install TCMalloc, Intel OpenMP is installed with vLLM CPU
    sudo apt-get install -y --no-install-recommends libtcmalloc-minimal4

    # manually find the path
    sudo find / -iname *libtcmalloc_minimal.so.4
    sudo find / -iname *libiomp5.so
    TC_PATH=...
    IOMP_PATH=...

    # add them to LD_PRELOAD
    export LD_PRELOAD="$TC_PATH:$IOMP_PATH:$LD_PRELOAD"
    ```

51
#### Install the latest code
Li, Jiang's avatar
Li, Jiang committed
52
53
54
55
56
57
58

To install the wheel built from the latest main branch:

```bash
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index-strategy first-index --torch-backend cpu
```

59
#### Install specific revisions
Li, Jiang's avatar
Li, Jiang committed
60
61
62
63
64
65
66

If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:

```bash
export VLLM_COMMIT=730bd35378bf2a5b56b6d3a45be28b3092d26519 # use full commit hash from the main branch
uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}/cpu --index-strategy first-index --torch-backend cpu
```
67

68
69
--8<-- [end:pre-built-wheels]
--8<-- [start:build-wheel-from-source]
70

71
72
73
74
Install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:

```bash
sudo apt-get update -y
Li, Jiang's avatar
Li, Jiang committed
75
sudo apt-get install -y gcc-12 g++-12 libnuma-dev
76
77
78
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
```

Li, Jiang's avatar
Li, Jiang committed
79
80
--8<-- "docs/getting_started/installation/python_env_setup.inc.md"

81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
Clone the vLLM project:

```bash
git clone https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source
```

Install the required dependencies:

```bash
uv pip install -r requirements/cpu-build.txt --torch-backend cpu
uv pip install -r requirements/cpu.txt --torch-backend cpu
```

??? console "pip"
    ```bash
    pip install --upgrade pip
    pip install -v -r requirements/cpu-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
    pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
    ```

Build and install vLLM:

```bash
VLLM_TARGET_DEVICE=cpu uv pip install . --no-build-isolation
```

If you want to develop vLLM, install it in editable mode instead.

```bash
111
VLLM_TARGET_DEVICE=cpu python3 setup.py develop
112
113
114
115
116
```

Optionally, build a portable wheel which you can then install elsewhere:

```bash
117
VLLM_TARGET_DEVICE=cpu uv build --wheel --no-build-isolation
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
```

```bash
uv pip install dist/*.whl
```

??? console "pip"
    ```bash
    VLLM_TARGET_DEVICE=cpu python -m build --wheel --no-isolation
    ```

    ```bash
    pip install dist/*.whl
    ```

Li, Jiang's avatar
Li, Jiang committed
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
!!! warning "set `LD_PRELOAD`"
    Before use vLLM CPU installed via wheels, make sure TCMalloc and Intel OpenMP are installed and added to `LD_PRELOAD`:
    ```bash
    # install TCMalloc, Intel OpenMP is installed with vLLM CPU
    sudo apt-get install -y --no-install-recommends libtcmalloc-minimal4

    # manually find the path
    sudo find / -iname *libtcmalloc_minimal.so.4
    sudo find / -iname *libiomp5.so
    TC_PATH=...
    IOMP_PATH=...

    # add them to LD_PRELOAD
    export LD_PRELOAD="$TC_PATH:$IOMP_PATH:$LD_PRELOAD"
    ```

149
150
151
!!! example "Troubleshooting"
    - **NumPy ≥2.0 error**: Downgrade using `pip install "numpy<2.0"`.
    - **CMake picks up CUDA**: Add `CMAKE_DISABLE_FIND_PACKAGE_CUDA=ON` to prevent CUDA detection during CPU builds, even if CUDA is installed.
152
    - `AMD` requires at least 4th gen processors (Zen 4/Genoa) or higher to support [AVX512](https://www.phoronix.com/review/amd-zen4-avx512) to run vLLM on CPU.
153
154
155
156
157
158
159
160
161
    - If you receive an error such as: `Could not find a version that satisfies the requirement torch==X.Y.Z+cpu+cpu`, consider updating [pyproject.toml](https://github.com/vllm-project/vllm/blob/main/pyproject.toml) to help pip resolve the dependency.
    ```toml title="pyproject.toml"
    [build-system]
    requires = [
      "cmake>=3.26.1",
      ...
      "torch==X.Y.Z+cpu"   # <-------
    ]
    ```
162

163
164
--8<-- [end:build-wheel-from-source]
--8<-- [start:pre-built-images]
165

166
You can pull the latest available CPU image from Docker Hub:
167
168

```bash
169
docker pull vllm/vllm-openai-cpu:latest-x86_64
170
171
```

172
173
174
175
176
177
178
179
To pull an image for a specific vLLM version:

```bash
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')
docker pull vllm/vllm-openai-cpu:v${VLLM_VERSION}-x86_64
```

All available image tags are here: [https://hub.docker.com/r/vllm/vllm-openai-cpu/tags](https://hub.docker.com/r/vllm/vllm-openai-cpu/tags)
180
181
182
183
184
185
186
187

You can run these images via:

```bash
docker run \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --env "HF_TOKEN=<secret>" \
188
    vllm/vllm-openai-cpu:latest-x86_64 <args...>
189
```
Li, Jiang's avatar
Li, Jiang committed
190

191
192
--8<-- [end:pre-built-images]
--8<-- [start:build-image-from-source]
193

194
#### Building for your target CPU
195
196
197

```bash
docker build -f docker/Dockerfile.cpu \
198
        --build-arg VLLM_CPU_X86=<false (default)|true> \ # For cross-compilation
199
200
201
202
        --tag vllm-cpu-env \
        --target vllm-openai .
```

203
#### Launching the OpenAI server
204
205

```bash
Li, Jiang's avatar
Li, Jiang committed
206
docker run --rm \
207
            --security-opt seccomp=unconfined \
208
            --cap-add SYS_NICE \
Li, Jiang's avatar
Li, Jiang committed
209
210
211
212
            --shm-size=4g \
            -p 8000:8000 \
            -e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
            vllm-cpu-env \
Li, Jiang's avatar
Li, Jiang committed
213
            meta-llama/Llama-3.2-1B-Instruct \
Li, Jiang's avatar
Li, Jiang committed
214
215
216
217
            --dtype=bfloat16 \
            other vLLM OpenAI server arguments
```

218
219
220
--8<-- [end:build-image-from-source]
--8<-- [start:extra-information]
--8<-- [end:extra-information]