cpu.md 9.24 KB
Newer Older
1
# CPU
2

3
vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:
4

5
=== "Intel/AMD x86"
6

7
    --8<-- "docs/getting_started/installation/cpu/x86.inc.md:installation"
8

9
=== "ARM AArch64"
10

11
    --8<-- "docs/getting_started/installation/cpu/arm.inc.md:installation"
12

13
=== "Apple silicon"
14

15
    --8<-- "docs/getting_started/installation/cpu/apple.inc.md:installation"
16

17
=== "IBM Z (S390X)"
18

19
    --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:installation"
20

21
22
## Requirements

23
- Python: 3.9 -- 3.12
24

25
=== "Intel/AMD x86"
26

27
    --8<-- "docs/getting_started/installation/cpu/x86.inc.md:requirements"
28

29
=== "ARM AArch64"
30

31
    --8<-- "docs/getting_started/installation/cpu/arm.inc.md:requirements"
32

33
=== "Apple silicon"
34

35
    --8<-- "docs/getting_started/installation/cpu/apple.inc.md:requirements"
36

37
=== "IBM Z (S390X)"
38

39
    --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:requirements"
40

41
42
43
44
## Set up using Python

### Create a new Python environment

45
--8<-- "docs/getting_started/installation/python_env_setup.inc.md"
46

47
### Pre-built wheels
48

49
50
51
52
Currently, there are no pre-built CPU wheels.

### Build wheel from source

53
=== "Intel/AMD x86"
54

55
    --8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-wheel-from-source"
56

57
=== "ARM AArch64"
58

59
    --8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-wheel-from-source"
60

61
=== "Apple silicon"
62

63
    --8<-- "docs/getting_started/installation/cpu/apple.inc.md:build-wheel-from-source"
64

65
=== "IBM Z (s390x)"
66

67
    --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-wheel-from-source"
68

69
70
71
72
## Set up using Docker

### Pre-built images

73
=== "Intel/AMD x86"
74

75
    --8<-- "docs/getting_started/installation/cpu/x86.inc.md:pre-built-images"
76
77

### Build image from source
78

79
80
81
??? Commands

    ```console
82
83
84
    $ docker build -f docker/Dockerfile.cpu \
            --tag vllm-cpu-env \
            --target vllm-openai .
85
86
87
88
89
90
91
92
93
94
95
96
97

    # Launching OpenAI server 
    $ docker run --rm \
                --privileged=true \
                --shm-size=4g \
                -p 8000:8000 \
                -e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
                -e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
                vllm-cpu-env \
                --model=meta-llama/Llama-3.2-1B-Instruct \
                --dtype=bfloat16 \
                other vLLM OpenAI server arguments
    ```
98

99
100
!!! tip
    For ARM or Apple silicon, use `docker/Dockerfile.arm`
101

102
103
!!! tip
    For IBM Z (s390x), use `docker/Dockerfile.s390x` and in `docker run` use flag `--dtype float`
104

105
## Supported features
106

107
108
109
110
111
112
vLLM CPU backend supports the following vLLM features:

- Tensor Parallel
- Model Quantization (`INT8 W8A8, AWQ, GPTQ`)
- Chunked-prefill
- Prefix-caching
113
- FP8-E5M2 KV cache
114
115
116

## Related runtime environment variables

117
118
119
- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GiB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. Default value is `0`.
- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads. For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to `auto`, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node. By setting to `all`, the OpenMP threads of each rank uses all CPU cores available on the system. Default value is `auto`.
- `VLLM_CPU_NUM_OF_RESERVED_CPU`: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. Default value is `0`.
120
- `VLLM_CPU_MOE_PREPACK`: whether to use prepack for MoE layer. This will be passed to `ipex.llm.modules.GatedMLPMOE`. Default is `1` (True). On unsupported CPUs, you might need to set this to `0` (False).
121
122
123
124
125
126

## Performance tips

- We highly recommend to use TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.4, you can run:

```console
127
128
129
sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
find / -name *libtcmalloc* # find the dynamic link library path
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
130
python examples/offline_inference/basic/basic.py # run vLLM
131
132
133
134
135
```

- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:

```console
136
137
138
export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_OMP_THREADS_BIND=0-29
vllm serve facebook/opt-125m
139
140
```

141
142
143
144
145
146
147
148
149
 or using default auto thread binding:

```console
export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_NUM_OF_RESERVED_CPU=2
vllm serve facebook/opt-125m
```

- If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using `VLLM_CPU_OMP_THREADS_BIND` or using auto thread binding feature by default. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
150

151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
??? Commands

    ```console
    $ lscpu -e # check the mapping between logical CPU cores and physical CPU cores

    # The "CPU" column means the logical CPU core IDs, and the "CORE" column means the physical core IDs. On this platform, two logical cores are sharing one physical core.
    CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ      MHZ
    0    0      0    0 0:0:0:0          yes 2401.0000 800.0000  800.000
    1    0      0    1 1:1:1:0          yes 2401.0000 800.0000  800.000
    2    0      0    2 2:2:2:0          yes 2401.0000 800.0000  800.000
    3    0      0    3 3:3:3:0          yes 2401.0000 800.0000  800.000
    4    0      0    4 4:4:4:0          yes 2401.0000 800.0000  800.000
    5    0      0    5 5:5:5:0          yes 2401.0000 800.0000  800.000
    6    0      0    6 6:6:6:0          yes 2401.0000 800.0000  800.000
    7    0      0    7 7:7:7:0          yes 2401.0000 800.0000  800.000
    8    0      0    0 0:0:0:0          yes 2401.0000 800.0000  800.000
    9    0      0    1 1:1:1:0          yes 2401.0000 800.0000  800.000
    10   0      0    2 2:2:2:0          yes 2401.0000 800.0000  800.000
    11   0      0    3 3:3:3:0          yes 2401.0000 800.0000  800.000
    12   0      0    4 4:4:4:0          yes 2401.0000 800.0000  800.000
    13   0      0    5 5:5:5:0          yes 2401.0000 800.0000  800.000
    14   0      0    6 6:6:6:0          yes 2401.0000 800.0000  800.000
    15   0      0    7 7:7:7:0          yes 2401.0000 800.0000  800.000

    # On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15
    $ export VLLM_CPU_OMP_THREADS_BIND=0-7
    $ python examples/offline_inference/basic/basic.py
    ```
179
180
181

- If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using `VLLM_CPU_OMP_THREADS_BIND` to avoid cross NUMA node memory access.

182
## Other considerations
183
184
185
186
187

- The CPU backend significantly differs from the GPU backend since the vLLM architecture was originally optimized for GPU use. A number of optimizations are needed to enhance its performance.

- Decouple the HTTP serving components from the inference components. In a GPU backend configuration, the HTTP serving and tokenization tasks operate on the CPU, while inference runs on the GPU, which typically does not pose a problem. However, in a CPU-based setup, the HTTP serving and tokenization can cause significant context switching and reduced cache efficiency. Therefore, it is strongly recommended to segregate these two components for improved performance.

188
- On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the [topology](https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.md#non-uniform-memory-access-numa). For NUMA architecture, Tensor Parallel is a option for better performance.
189

190
  - Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
191
192

    ```console
193
194
195
196
    VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" \
        vllm serve meta-llama/Llama-2-7b-chat-hf \
        -tp=2 \
        --distributed-executor-backend mp
197
198
    ```

199
200
201
    or using default auto thread binding:

    ```console
202
203
204
205
    VLLM_CPU_KVCACHE_SPACE=40 \
        vllm serve meta-llama/Llama-2-7b-chat-hf \
        -tp=2 \
        --distributed-executor-backend mp
206
207
    ```

208
209
210
  - For each thread id list in `VLLM_CPU_OMP_THREADS_BIND`, users should guarantee threads in the list belong to a same NUMA node.

  - Meanwhile, users should also take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, TP worker will be killed due to out-of-memory.