cpu.md 7.83 KB
Newer Older
1
# CPU
2

3
vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:
4

5
=== "Intel/AMD x86"
6

7
    --8<-- "docs/getting_started/installation/cpu/x86.inc.md:installation"
8

9
=== "ARM AArch64"
10

11
    --8<-- "docs/getting_started/installation/cpu/arm.inc.md:installation"
12

13
=== "Apple silicon"
14

15
    --8<-- "docs/getting_started/installation/cpu/apple.inc.md:installation"
16

17
=== "IBM Z (S390X)"
18

19
    --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:installation"
20

21
22
## Requirements

23
- Python: 3.9 -- 3.12
24

25
=== "Intel/AMD x86"
26

27
    --8<-- "docs/getting_started/installation/cpu/x86.inc.md:requirements"
28

29
=== "ARM AArch64"
30

31
    --8<-- "docs/getting_started/installation/cpu/arm.inc.md:requirements"
32

33
=== "Apple silicon"
34

35
    --8<-- "docs/getting_started/installation/cpu/apple.inc.md:requirements"
36

37
=== "IBM Z (S390X)"
38

39
    --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:requirements"
40

41
42
43
44
## Set up using Python

### Create a new Python environment

45
--8<-- "docs/getting_started/installation/python_env_setup.inc.md"
46

47
### Pre-built wheels
48

49
50
51
52
Currently, there are no pre-built CPU wheels.

### Build wheel from source

53
=== "Intel/AMD x86"
54

55
    --8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-wheel-from-source"
56

57
=== "ARM AArch64"
58

59
    --8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-wheel-from-source"
60

61
=== "Apple silicon"
62

63
    --8<-- "docs/getting_started/installation/cpu/apple.inc.md:build-wheel-from-source"
64

65
=== "IBM Z (s390x)"
66

67
    --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-wheel-from-source"
68

69
70
71
72
## Set up using Docker

### Pre-built images

73
=== "Intel/AMD x86"
74

75
    --8<-- "docs/getting_started/installation/cpu/x86.inc.md:pre-built-images"
76
77

### Build image from source
78

Li, Jiang's avatar
Li, Jiang committed
79
=== "Intel/AMD x86"
80

Li, Jiang's avatar
Li, Jiang committed
81
    --8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-image-from-source"
82

Li, Jiang's avatar
Li, Jiang committed
83
=== "ARM AArch64"
84

Li, Jiang's avatar
Li, Jiang committed
85
    --8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-image-from-source"
86

Li, Jiang's avatar
Li, Jiang committed
87
=== "Apple silicon"
88

Li, Jiang's avatar
Li, Jiang committed
89
    --8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-image-from-source"
90

Li, Jiang's avatar
Li, Jiang committed
91
92
=== "IBM Z (S390X)"
    --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-image-from-source"
93
94
95

## Related runtime environment variables

96
97
98
- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GiB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. Default value is `0`.
- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads. For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to `auto`, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node. By setting to `all`, the OpenMP threads of each rank uses all CPU cores available on the system. Default value is `auto`.
- `VLLM_CPU_NUM_OF_RESERVED_CPU`: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. Default value is `0`.
Li, Jiang's avatar
Li, Jiang committed
99
100
- `VLLM_CPU_MOE_PREPACK` (x86 only): whether to use prepack for MoE layer. This will be passed to `ipex.llm.modules.GatedMLPMOE`. Default is `1` (True). On unsupported CPUs, you might need to set this to `0` (False).
- `VLLM_CPU_SGL_KERNEL` (x86 only, Experimental): whether to use small-batch optimized kernels for linear layer and MoE layer, especially for low-latency requirements like online serving. The kernels require AMX instruction set, BFloat16 weight type and weight shapes divisible by 32. Default is `0` (False).
101

Li, Jiang's avatar
Li, Jiang committed
102
## FAQ
103

Li, Jiang's avatar
Li, Jiang committed
104
### Which `dtype` should be used?
105

Li, Jiang's avatar
Li, Jiang committed
106
- Currently vLLM CPU uses model default settings as `dtype`. However, due to unstable float16 support in torch CPU, it is recommended to explicitly set `dtype=bfloat16` if there are any performance or accuracy problem.  
107

Li, Jiang's avatar
Li, Jiang committed
108
109
110
### How to launch a vLLM service on CPU?

- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 31 for the framework and using CPU 0-30 for inference threads:
111

112
```bash
113
export VLLM_CPU_KVCACHE_SPACE=40
Li, Jiang's avatar
Li, Jiang committed
114
115
export VLLM_CPU_OMP_THREADS_BIND=0-30
vllm serve facebook/opt-125m --dtype=bfloat16
116
117
```

118
119
 or using default auto thread binding:

120
```bash
121
export VLLM_CPU_KVCACHE_SPACE=40
Li, Jiang's avatar
Li, Jiang committed
122
123
export VLLM_CPU_NUM_OF_RESERVED_CPU=1
vllm serve facebook/opt-125m --dtype=bfloat16
124
125
```

Li, Jiang's avatar
Li, Jiang committed
126
127
128
### How to decide `VLLM_CPU_OMP_THREADS_BIND`?

- Bind each OpenMP thread to a dedicated physical CPU core respectively, or use auto thread binding feature by default. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
129

130
??? console "Commands"
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157

    ```console
    $ lscpu -e # check the mapping between logical CPU cores and physical CPU cores

    # The "CPU" column means the logical CPU core IDs, and the "CORE" column means the physical core IDs. On this platform, two logical cores are sharing one physical core.
    CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ      MHZ
    0    0      0    0 0:0:0:0          yes 2401.0000 800.0000  800.000
    1    0      0    1 1:1:1:0          yes 2401.0000 800.0000  800.000
    2    0      0    2 2:2:2:0          yes 2401.0000 800.0000  800.000
    3    0      0    3 3:3:3:0          yes 2401.0000 800.0000  800.000
    4    0      0    4 4:4:4:0          yes 2401.0000 800.0000  800.000
    5    0      0    5 5:5:5:0          yes 2401.0000 800.0000  800.000
    6    0      0    6 6:6:6:0          yes 2401.0000 800.0000  800.000
    7    0      0    7 7:7:7:0          yes 2401.0000 800.0000  800.000
    8    0      0    0 0:0:0:0          yes 2401.0000 800.0000  800.000
    9    0      0    1 1:1:1:0          yes 2401.0000 800.0000  800.000
    10   0      0    2 2:2:2:0          yes 2401.0000 800.0000  800.000
    11   0      0    3 3:3:3:0          yes 2401.0000 800.0000  800.000
    12   0      0    4 4:4:4:0          yes 2401.0000 800.0000  800.000
    13   0      0    5 5:5:5:0          yes 2401.0000 800.0000  800.000
    14   0      0    6 6:6:6:0          yes 2401.0000 800.0000  800.000
    15   0      0    7 7:7:7:0          yes 2401.0000 800.0000  800.000

    # On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15
    $ export VLLM_CPU_OMP_THREADS_BIND=0-7
    $ python examples/offline_inference/basic/basic.py
    ```
158

Li, Jiang's avatar
Li, Jiang committed
159
- When deploy vLLM CPU backend on a multi-socket machine with NUMA and enable tensor parallel or pipeline parallel, each NUMA node is treated as a TP/PP rank. So be aware to set CPU cores of a single rank on a same NUMA node to avoid cross NUMA node memory access.
160

Li, Jiang's avatar
Li, Jiang committed
161
### How to decide `VLLM_CPU_KVCACHE_SPACE`?
162

Li, Jiang's avatar
Li, Jiang committed
163
  - This value is 4GB by default. Larger space can support more concurrent requests, longer context length. However, users should take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, the TP worker will be killed with `exitcode 9` due to out-of-memory.
164

Li, Jiang's avatar
Li, Jiang committed
165
### Which quantization configs does vLLM CPU support?
166

Li, Jiang's avatar
Li, Jiang committed
167
168
169
170
  - vLLM CPU supports quantizations:
    - AWQ (x86 only)
    - GPTQ (x86 only)
    - compressed-tensor INT8 W8A8 (x86, s390x)
171

Li, Jiang's avatar
Li, Jiang committed
172
### (x86 only) What is the purpose of `VLLM_CPU_MOE_PREPACK` and `VLLM_CPU_SGL_KERNEL`?
173

Li, Jiang's avatar
Li, Jiang committed
174
175
176
  - Both of them requires `amx` CPU flag.
    - `VLLM_CPU_MOE_PREPACK` can provides better performance for MoE models
    - `VLLM_CPU_SGL_KERNEL` can provides better performance for MoE models and small-batch scenarios.