cpu.md 15.9 KB
Newer Older
1
2
3
4
---
toc_depth: 3
---

5
# CPU
6

7
vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:
8

9
=== "Intel/AMD x86"
10

11
    --8<-- "docs/getting_started/installation/cpu.x86.inc.md:installation"
12

13
=== "ARM AArch64"
14

15
    --8<-- "docs/getting_started/installation/cpu.arm.inc.md:installation"
16

17
=== "Apple silicon"
18

19
    --8<-- "docs/getting_started/installation/cpu.apple.inc.md:installation"
20

21
=== "IBM Z (S390X)"
22

23
    --8<-- "docs/getting_started/installation/cpu.s390x.inc.md:installation"
24

Li, Jiang's avatar
Li, Jiang committed
25
26
27
28
29
30
## Technical Discussions

The main discussions happen in the `#sig-cpu` channel of [vLLM Slack](https://slack.vllm.ai/).

When open a Github issue about the CPU backend, please add `[CPU Backend]` in the title and it will be labeled with `cpu` for better awareness.

31
32
## Requirements

33
- Python: 3.10 -- 3.13
34

35
=== "Intel/AMD x86"
36

37
    --8<-- "docs/getting_started/installation/cpu.x86.inc.md:requirements"
38

39
=== "ARM AArch64"
40

41
    --8<-- "docs/getting_started/installation/cpu.arm.inc.md:requirements"
42

43
=== "Apple silicon"
44

45
    --8<-- "docs/getting_started/installation/cpu.apple.inc.md:requirements"
46

47
=== "IBM Z (S390X)"
48

49
    --8<-- "docs/getting_started/installation/cpu.s390x.inc.md:requirements"
50

51
52
53
54
## Set up using Python

### Create a new Python environment

55
--8<-- "docs/getting_started/installation/python_env_setup.inc.md"
56

57
### Pre-built wheels
58

59
60
When specifying the index URL, please make sure to use the `cpu` variant subdirectory.
For example, the nightly build index is: `https://wheels.vllm.ai/nightly/cpu/`.
61

62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
=== "Intel/AMD x86"

    --8<-- "docs/getting_started/installation/cpu.x86.inc.md:pre-built-wheels"

=== "ARM AArch64"

    --8<-- "docs/getting_started/installation/cpu.arm.inc.md:pre-built-wheels"

=== "Apple silicon"

    --8<-- "docs/getting_started/installation/cpu.apple.inc.md:pre-built-wheels"

=== "IBM Z (S390X)"

    --8<-- "docs/getting_started/installation/cpu.s390x.inc.md:pre-built-wheels"

78
79
### Build wheel from source

80
81
#### Set up using Python-only build (without compilation) {#python-only-build}

82
83
This method requires [pre-built wheels](#pre-built-wheels) for your platform.

84
85
86
87
88
89
90
91
Please refer to the instructions for [Python-only build on GPU](./gpu.md#python-only-build), and replace the build commands with:

```bash
VLLM_USE_PRECOMPILED=1 VLLM_PRECOMPILED_WHEEL_VARIANT=cpu VLLM_TARGET_DEVICE=cpu uv pip install --editable .
```

#### Full build (with compilation) {#full-build}

92
=== "Intel/AMD x86"
93

94
    --8<-- "docs/getting_started/installation/cpu.x86.inc.md:build-wheel-from-source"
95

96
=== "ARM AArch64"
97

98
    --8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-wheel-from-source"
99

100
=== "Apple silicon"
101

102
    --8<-- "docs/getting_started/installation/cpu.apple.inc.md:build-wheel-from-source"
103

104
=== "IBM Z (s390x)"
105

106
    --8<-- "docs/getting_started/installation/cpu.s390x.inc.md:build-wheel-from-source"
107

108
109
110
111
## Set up using Docker

### Pre-built images

112
=== "Intel/AMD x86"
113

114
    --8<-- "docs/getting_started/installation/cpu.x86.inc.md:pre-built-images"
115

116
117
118
119
120
121
122
123
124
125
126
127
=== "ARM AArch64"

    --8<-- "docs/getting_started/installation/cpu.arm.inc.md:pre-built-images"

=== "Apple silicon"

    --8<-- "docs/getting_started/installation/cpu.apple.inc.md:pre-built-images"

=== "IBM Z (S390X)"

    --8<-- "docs/getting_started/installation/cpu.s390x.inc.md:pre-built-images"

128
### Build image from source
129

Li, Jiang's avatar
Li, Jiang committed
130
=== "Intel/AMD x86"
131

132
    --8<-- "docs/getting_started/installation/cpu.x86.inc.md:build-image-from-source"
133

Li, Jiang's avatar
Li, Jiang committed
134
=== "ARM AArch64"
135

136
    --8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-image-from-source"
137

Li, Jiang's avatar
Li, Jiang committed
138
=== "Apple silicon"
139

140
    --8<-- "docs/getting_started/installation/cpu.apple.inc.md:build-image-from-source"
141

Li, Jiang's avatar
Li, Jiang committed
142
=== "IBM Z (S390X)"
143
    --8<-- "docs/getting_started/installation/cpu.s390x.inc.md:build-image-from-source"
144
145
146

## Related runtime environment variables

147
- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GiB space for KV cache), larger setting will allow vLLM to run more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. Default value is `0`.
148
- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads, can be set as CPU id lists, `auto` (by default), or `nobind` (to disable binding to individual CPU cores and to inherit user-defined OpenMP variables). For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to `auto`, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node respectively. If set to `nobind`, the number of OpenMP threads is determined by the standard `OMP_NUM_THREADS` environment variable.
149
- `VLLM_CPU_NUM_OF_RESERVED_CPU`: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. Default value is `None`. If the value is not set and use `auto` thread binding, no CPU will be reserved for `world_size == 1`, 1 CPU per rank will be reserved for `world_size > 1`.
150
- `CPU_VISIBLE_MEMORY_NODES`: specify visible NUMA memory nodes for vLLM CPU workers, similar to ```CUDA_VISIBLE_DEVICES```. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. The variable provides more control for the auto thread-binding feature, such as masking nodes and changing nodes binding sequence.
Li, Jiang's avatar
Li, Jiang committed
151
- `VLLM_CPU_SGL_KERNEL` (x86 only, Experimental): whether to use small-batch optimized kernels for linear layer and MoE layer, especially for low-latency requirements like online serving. The kernels require AMX instruction set, BFloat16 weight type and weight shapes divisible by 32. Default is `0` (False).
152

Li, Jiang's avatar
Li, Jiang committed
153
## FAQ
154

Li, Jiang's avatar
Li, Jiang committed
155
### Which `dtype` should be used?
156

157
- Currently, vLLM CPU uses model default settings as `dtype`. However, due to unstable float16 support in torch CPU, it is recommended to explicitly set `dtype=bfloat16` if there are any performance or accuracy problem.  
158

Li, Jiang's avatar
Li, Jiang committed
159
160
161
### How to launch a vLLM service on CPU?

- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 31 for the framework and using CPU 0-30 for inference threads:
162

163
```bash
164
export VLLM_CPU_KVCACHE_SPACE=40
Li, Jiang's avatar
Li, Jiang committed
165
166
export VLLM_CPU_OMP_THREADS_BIND=0-30
vllm serve facebook/opt-125m --dtype=bfloat16
167
168
```

169
170
 or using default auto thread binding:

171
```bash
172
export VLLM_CPU_KVCACHE_SPACE=40
Li, Jiang's avatar
Li, Jiang committed
173
174
export VLLM_CPU_NUM_OF_RESERVED_CPU=1
vllm serve facebook/opt-125m --dtype=bfloat16
175
176
```

177
178
Note, it is recommended to manually reserve 1 CPU for vLLM front-end process when `world_size == 1`.

179
180
### What are supported models on CPU?

181
For the full and up-to-date list of models validated on CPU platforms, please see the official documentation: [Supported Models on CPU](../../models/hardware_supported_models/cpu.md)
182
183
184

### How to find benchmark configuration examples for supported CPU models?

185
For any model listed under [Supported Models on CPU](../../models/hardware_supported_models/cpu.md), optimized runtime configurations are provided in the vLLM Benchmark Suite’s CPU test cases, defined in cpu test cases as serving-tests-cpu.json. Full test cases for Text-only models, Multi-Modal models and Embedded models are in cpu Text-Only test cases as serving-tests-cpu-text.json, cpu Multi-Modal test cases as serving-tests-cpu-multimodal.json and cpu Embedded test cases as serving-tests-cpu-embed.json.  
186
187
For details on how these optimized configurations are determined, see: [performance-benchmark-details](../../../.buildkite/performance-benchmarks/README.md#performance-benchmark-details).
To benchmark the supported models using these optimized settings, follow the steps in [running vLLM Benchmark Suite manually](../../benchmarking/dashboard.md#manually-trigger-the-benchmark) and run the Benchmark Suite on a CPU environment.  
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207

Below is an example command to benchmark all CPU-supported models using optimized configurations.

```bash
ON_CPU=1 bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
```

The benchmark results will be saved in `./benchmark/results/`.
In the directory, the generated `.commands` files contain all example commands for the benchmark.

We recommend configuring tensor-parallel-size to match the number of NUMA nodes on your system. Note that the current release does not support tensor-parallel-size=6.
To determine the number of NUMA nodes available, use the following command:

```bash
lscpu | grep "NUMA node(s):" | awk '{print $3}'
```

For performance reference, users may also consult the [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm&deviceName=cpu)
, which publishes default-model CPU results produced using the same Benchmark Suite.

208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
#### Dry-Run

For users only need to get the optimized runtime configurations without running benchmark, a Dry-Run mode is provided.
By passing an environment variable DRY_RUN=1 with run-performance-benchmarks.sh,
all commands will be generated under `./benchmark/results/`.

```bash
ON_CPU=1 DRY_RUN=1 bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
```

By providing different JSON file, users can get runtime configurations for different models such as Embedded Models.

```bash
ON_CPU=1 SERVING_JSON=serving-tests-cpu-embed.json DRY_RUN=1 bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
```

By providing MODEL_FILTER and DTYPE_FILTER, only commands for related model ID and Data Type will be generated.

```bash
ON_CPU=1 SERVING_JSON=serving-tests-cpu-text.json DRY_RUN=1 MODEL_FILTER=meta-llama/Llama-3.1-8B-Instruct DTYPE_FILTER=bfloat16  bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
```

Li, Jiang's avatar
Li, Jiang committed
230
231
### How to decide `VLLM_CPU_OMP_THREADS_BIND`?

232
- Default `auto` thread-binding is recommended for most cases. Ideally, each OpenMP thread will be bound to a dedicated physical core respectively, threads of each rank will be bound to the same NUMA node respectively, and 1 CPU per rank will be reserved for other vLLM components when `world_size > 1`. If you have any performance problems or unexpected binding behaviours, please try to bind threads as following.
233
234

- On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
235

236
??? console "Commands"
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259

    ```console
    $ lscpu -e # check the mapping between logical CPU cores and physical CPU cores

    # The "CPU" column means the logical CPU core IDs, and the "CORE" column means the physical core IDs. On this platform, two logical cores are sharing one physical core.
    CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ      MHZ
    0    0      0    0 0:0:0:0          yes 2401.0000 800.0000  800.000
    1    0      0    1 1:1:1:0          yes 2401.0000 800.0000  800.000
    2    0      0    2 2:2:2:0          yes 2401.0000 800.0000  800.000
    3    0      0    3 3:3:3:0          yes 2401.0000 800.0000  800.000
    4    0      0    4 4:4:4:0          yes 2401.0000 800.0000  800.000
    5    0      0    5 5:5:5:0          yes 2401.0000 800.0000  800.000
    6    0      0    6 6:6:6:0          yes 2401.0000 800.0000  800.000
    7    0      0    7 7:7:7:0          yes 2401.0000 800.0000  800.000
    8    0      0    0 0:0:0:0          yes 2401.0000 800.0000  800.000
    9    0      0    1 1:1:1:0          yes 2401.0000 800.0000  800.000
    10   0      0    2 2:2:2:0          yes 2401.0000 800.0000  800.000
    11   0      0    3 3:3:3:0          yes 2401.0000 800.0000  800.000
    12   0      0    4 4:4:4:0          yes 2401.0000 800.0000  800.000
    13   0      0    5 5:5:5:0          yes 2401.0000 800.0000  800.000
    14   0      0    6 6:6:6:0          yes 2401.0000 800.0000  800.000
    15   0      0    7 7:7:7:0          yes 2401.0000 800.0000  800.000

260
    # On this platform, it is recommended to only bind openMP threads on logical CPU cores 0-7 or 8-15
261
262
263
    $ export VLLM_CPU_OMP_THREADS_BIND=0-7
    $ python examples/offline_inference/basic/basic.py
    ```
264

265
- When deploying vLLM CPU backend on a multi-socket machine with NUMA and enable tensor parallel or pipeline parallel, each NUMA node is treated as a TP/PP rank. So be aware to set CPU cores of a single rank on the same NUMA node to avoid cross NUMA node memory access.
266

Li, Jiang's avatar
Li, Jiang committed
267
### How to decide `VLLM_CPU_KVCACHE_SPACE`?
268

269
This value is 4GB by default. Larger space can support more concurrent requests, longer context length. However, users should take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, the TP worker will be killed with `exitcode 9` due to out-of-memory.
270

271
272
### How to do performance tuning for vLLM CPU?

Li, Jiang's avatar
Li, Jiang committed
273
First of all, please make sure the thread-binding and KV cache space are properly set and take effect. You can check the thread-binding by running a vLLM benchmark and observing CPU cores usage via `htop`.
274

275
276
Use multiples of 32 as `--block-size`, which is 128 by default.

277
Inference batch size is an important parameter for the performance. A larger batch usually provides higher throughput, a smaller batch provides lower latency. Tuning the max batch size starting from the default value to balance throughput and latency is an effective way to improve vLLM CPU performance on specific platforms. There are two important related parameters in vLLM:
278

Li, Jiang's avatar
Li, Jiang committed
279
280
281
282
283
284
285
- `--max-num-batched-tokens`, defines the limit of token numbers in a single batch, has more impacts on the first token performance. The default value is set as:
    - Offline Inference: `4096 * world_size`
    - Online Serving: `2048 * world_size`
- `--max-num-seqs`, defines the limit of sequence numbers in a single batch, has more impacts on the output token performance.
    - Offline Inference: `256 * world_size`
    - Online Serving: `128 * world_size`

286
vLLM CPU supports data parallel (DP), tensor parallel (TP) and pipeline parallel (PP) to leverage multiple CPU sockets and memory nodes. For more details of tuning DP, TP and PP, please refer to [Optimization and Tuning](../../configuration/optimization.md). For vLLM CPU, it is recommended to use DP, TP and PP together if there are enough CPU sockets and memory nodes.
287

Li, Jiang's avatar
Li, Jiang committed
288
### Which quantization configs does vLLM CPU support?
289

290
- vLLM CPU supports quantizations:
Li, Jiang's avatar
Li, Jiang committed
291
292
293
    - AWQ (x86 only)
    - GPTQ (x86 only)
    - compressed-tensor INT8 W8A8 (x86, s390x)
294

295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
### Why do I see `get_mempolicy: Operation not permitted` when running in Docker?

In some container environments (like Docker), NUMA-related syscalls used by vLLM (e.g., `get_mempolicy`, `migrate_pages`) are blocked/denied in the runtime's default seccomp/capabilities settings. This may lead to warnings like `get_mempolicy: Operation not permitted`. Functionality is not affected, but NUMA memory binding/migration optimizations may not take effect and performance can be suboptimal.

To enable these optimizations inside Docker with the least privilege, you can follow below tips:

```bash
docker run ... --cap-add SYS_NICE --security-opt seccomp=unconfined  ...

# 1) `--cap-add SYS_NICE` is to address `get_mempolicy` EPERM issue.

# 2) `--security-opt seccomp=unconfined` is to enable `migrate_pages` for `numa_migrate_pages()`.
# Actually, `seccomp=unconfined` bypasses the seccomp for container,
# if it's unacceptable, you can customize your own seccomp profile,
# based on docker/runtime default.json and add `migrate_pages` to `SCMP_ACT_ALLOW` list.

# reference : https://docs.docker.com/engine/security/seccomp/
```

Alternatively, running with `--privileged=true` also works but is broader and not generally recommended.

In K8S, the following configuration can be added to workload yaml to achieve the same effect as above:

```yaml
securityContext:
  seccompProfile:
    type: Unconfined
  capabilities:
    add:
    - SYS_NICE
```