cpu.x86.inc.md 8.47 KB
Newer Older
1
2
<!-- markdownlint-disable MD041 -->
--8<-- [start:installation]
3

Li, Jiang's avatar
Li, Jiang committed
4
vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
5

6
7
--8<-- [end:installation]
--8<-- [start:requirements]
8
9

- OS: Linux
10
- CPU flags: `avx512f` (Recommended), `avx512_bf16` (Optional), `avx512_vnni` (Optional)
11
12

!!! tip
Li, Jiang's avatar
Li, Jiang committed
13
    Use `lscpu` to check the CPU flags.
14

15
16
--8<-- [end:requirements]
--8<-- [start:set-up-using-python]
17

18
19
--8<-- [end:set-up-using-python]
--8<-- [start:pre-built-wheels]
20

Li, Jiang's avatar
Li, Jiang committed
21
22
23
24
25
26
Pre-built vLLM wheels for x86 with AVX512 are available since version 0.13.0. To install release wheels:

```bash
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')

# use uv
27
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl --torch-backend cpu
Li, Jiang's avatar
Li, Jiang committed
28
```
29

Li, Jiang's avatar
Li, Jiang committed
30
31
32
??? console "pip"
    ```bash
    # use pip
33
    pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cpu
Li, Jiang's avatar
Li, Jiang committed
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
    ```
!!! warning "set `LD_PRELOAD`"
    Before use vLLM CPU installed via wheels, make sure TCMalloc and Intel OpenMP are installed and added to `LD_PRELOAD`:
    ```bash
    # install TCMalloc, Intel OpenMP is installed with vLLM CPU
    sudo apt-get install -y --no-install-recommends libtcmalloc-minimal4

    # manually find the path
    sudo find / -iname *libtcmalloc_minimal.so.4
    sudo find / -iname *libiomp5.so
    TC_PATH=...
    IOMP_PATH=...

    # add them to LD_PRELOAD
    export LD_PRELOAD="$TC_PATH:$IOMP_PATH:$LD_PRELOAD"
    ```

51
#### Install the latest code
Li, Jiang's avatar
Li, Jiang committed
52
53
54
55
56
57
58

To install the wheel built from the latest main branch:

```bash
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index-strategy first-index --torch-backend cpu
```

59
#### Install specific revisions
Li, Jiang's avatar
Li, Jiang committed
60
61
62
63
64
65
66

If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:

```bash
export VLLM_COMMIT=730bd35378bf2a5b56b6d3a45be28b3092d26519 # use full commit hash from the main branch
uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}/cpu --index-strategy first-index --torch-backend cpu
```
67

68
69
--8<-- [end:pre-built-wheels]
--8<-- [start:build-wheel-from-source]
70

71
72
73
74
Install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:

```bash
sudo apt-get update -y
Li, Jiang's avatar
Li, Jiang committed
75
sudo apt-get install -y gcc-12 g++-12 libnuma-dev
76
77
78
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
```

Li, Jiang's avatar
Li, Jiang committed
79
80
--8<-- "docs/getting_started/installation/python_env_setup.inc.md"

81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
Clone the vLLM project:

```bash
git clone https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source
```

Install the required dependencies:

```bash
uv pip install -r requirements/cpu-build.txt --torch-backend cpu
uv pip install -r requirements/cpu.txt --torch-backend cpu
```

??? console "pip"
    ```bash
    pip install --upgrade pip
    pip install -v -r requirements/cpu-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
    pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
    ```

Build and install vLLM:

```bash
VLLM_TARGET_DEVICE=cpu uv pip install . --no-build-isolation
```

If you want to develop vLLM, install it in editable mode instead.

```bash
VLLM_TARGET_DEVICE=cpu uv pip install -e . --no-build-isolation
```

Optionally, build a portable wheel which you can then install elsewhere:

```bash
VLLM_TARGET_DEVICE=cpu uv build --wheel
```

```bash
uv pip install dist/*.whl
```

??? console "pip"
    ```bash
    VLLM_TARGET_DEVICE=cpu python -m build --wheel --no-isolation
    ```

    ```bash
    pip install dist/*.whl
    ```

Li, Jiang's avatar
Li, Jiang committed
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
!!! warning "set `LD_PRELOAD`"
    Before use vLLM CPU installed via wheels, make sure TCMalloc and Intel OpenMP are installed and added to `LD_PRELOAD`:
    ```bash
    # install TCMalloc, Intel OpenMP is installed with vLLM CPU
    sudo apt-get install -y --no-install-recommends libtcmalloc-minimal4

    # manually find the path
    sudo find / -iname *libtcmalloc_minimal.so.4
    sudo find / -iname *libiomp5.so
    TC_PATH=...
    IOMP_PATH=...

    # add them to LD_PRELOAD
    export LD_PRELOAD="$TC_PATH:$IOMP_PATH:$LD_PRELOAD"
    ```

149
150
151
!!! example "Troubleshooting"
    - **NumPy ≥2.0 error**: Downgrade using `pip install "numpy<2.0"`.
    - **CMake picks up CUDA**: Add `CMAKE_DISABLE_FIND_PACKAGE_CUDA=ON` to prevent CUDA detection during CPU builds, even if CUDA is installed.
152
    - `AMD` requires at least 4th gen processors (Zen 4/Genoa) or higher to support [AVX512](https://www.phoronix.com/review/amd-zen4-avx512) to run vLLM on CPU.
153
154
155
156
157
158
159
160
161
    - If you receive an error such as: `Could not find a version that satisfies the requirement torch==X.Y.Z+cpu+cpu`, consider updating [pyproject.toml](https://github.com/vllm-project/vllm/blob/main/pyproject.toml) to help pip resolve the dependency.
    ```toml title="pyproject.toml"
    [build-system]
    requires = [
      "cmake>=3.26.1",
      ...
      "torch==X.Y.Z+cpu"   # <-------
    ]
    ```
162

163
164
--8<-- [end:build-wheel-from-source]
--8<-- [start:pre-built-images]
165

166
You can pull the latest available CPU image from Docker Hub:
167
168

```bash
169
docker pull vllm/vllm-openai-cpu:latest-x86_64
170
171
```

172
173
174
175
176
177
178
179
To pull an image for a specific vLLM version:

```bash
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')
docker pull vllm/vllm-openai-cpu:v${VLLM_VERSION}-x86_64
```

All available image tags are here: [https://hub.docker.com/r/vllm/vllm-openai-cpu/tags](https://hub.docker.com/r/vllm/vllm-openai-cpu/tags)
180
181
182
183
184
185
186
187

You can run these images via:

```bash
docker run \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --env "HF_TOKEN=<secret>" \
188
vllm/vllm-openai-cpu:latest-x86_64 <args...>
189
```
Li, Jiang's avatar
Li, Jiang committed
190
191

!!! warning
192
    If deploying the pre-built images on machines without `avx512f`, `avx512_bf16`, or `avx512_vnni` support, an `Illegal instruction` error may be raised. See the build-image-from-source section below for build arguments to match your target CPU capabilities.
193

194
195
--8<-- [end:pre-built-images]
--8<-- [start:build-image-from-source]
196

197
#### Building for your target CPU
198
199
200
201
202
203
204
205
206
207
208
209
210

```bash
docker build -f docker/Dockerfile.cpu \
        --build-arg VLLM_CPU_DISABLE_AVX512=<false (default)|true> \
        --build-arg VLLM_CPU_AVX2=<false (default)|true> \
        --build-arg VLLM_CPU_AVX512=<false (default)|true> \
        --build-arg VLLM_CPU_AVX512BF16=<false (default)|true> \
        --build-arg VLLM_CPU_AVX512VNNI=<false (default)|true> \
        --build-arg VLLM_CPU_AMXBF16=<false|true (default)> \
        --tag vllm-cpu-env \
        --target vllm-openai .
```

211
212
!!! note "Auto-detection by default"
    By default, CPU instruction sets (AVX512, AVX2, etc.) are automatically detected from the build system's CPU flags. Build arguments like `VLLM_CPU_AVX2`, `VLLM_CPU_AVX512`, `VLLM_CPU_AVX512BF16`, `VLLM_CPU_AVX512VNNI`, and `VLLM_CPU_AMXBF16` are used for cross-compilation:
213

214
215
    - `VLLM_CPU_{ISA}=true` - Force-enable the instruction set (build with ISA regardless of build system capabilities)
    - `VLLM_CPU_{ISA}=false` - Rely on auto-detection (default)
216

217
##### Examples
218

219
###### Auto-detection build (default)
220

Li, Jiang's avatar
Li, Jiang committed
221
```bash
222
docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai .
223
224
```

225
###### Cross-compile for AVX512
226
227
228
229
230
231
232
233
234
235

```bash
docker build -f docker/Dockerfile.cpu \
        --build-arg VLLM_CPU_AVX512=true \
        --build-arg VLLM_CPU_AVX512BF16=true \
        --build-arg VLLM_CPU_AVX512VNNI=true \
        --tag vllm-cpu-avx512 \
        --target vllm-openai .
```

236
###### Cross-compile for AVX2
237
238
239
240
241
242
243
244

```bash
docker build -f docker/Dockerfile.cpu \
        --build-arg VLLM_CPU_AVX2=true \
        --tag vllm-cpu-avx2 \
        --target vllm-openai .
```

245
#### Launching the OpenAI server
246
247

```bash
Li, Jiang's avatar
Li, Jiang committed
248
docker run --rm \
249
            --security-opt seccomp=unconfined \
250
            --cap-add SYS_NICE \
Li, Jiang's avatar
Li, Jiang committed
251
252
253
254
            --shm-size=4g \
            -p 8000:8000 \
            -e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
            vllm-cpu-env \
Li, Jiang's avatar
Li, Jiang committed
255
            meta-llama/Llama-3.2-1B-Instruct \
Li, Jiang's avatar
Li, Jiang committed
256
257
258
259
            --dtype=bfloat16 \
            other vLLM OpenAI server arguments
```

260
261
262
--8<-- [end:build-image-from-source]
--8<-- [start:extra-information]
--8<-- [end:extra-information]