Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
change
sglang
Commits
007b849b
Unverified
Commit
007b849b
authored
Oct 23, 2025
by
Zaili Wang
Committed by
GitHub
Oct 22, 2025
Browse files
[CPU] misc updates (#11906)
parent
8612811d
Changes
3
Show whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
33 additions
and
28 deletions
+33
-28
docs/platforms/cpu_server.md
docs/platforms/cpu_server.md
+23
-22
python/sglang/srt/utils/common.py
python/sglang/srt/utils/common.py
+9
-4
sgl-kernel/pyproject_cpu.toml
sgl-kernel/pyproject_cpu.toml
+1
-2
No files found.
docs/platforms/cpu_server.md
View file @
007b849b
# CPU Servers
# CPU Servers
The document addresses how to set up the
[
SGLang
](
https://github.com/sgl-project/sglang
)
environment and run LLM inference on CPU servers.
The document addresses how to set up the
[
SGLang
](
https://github.com/sgl-project/sglang
)
environment and run LLM inference on CPU servers.
S
pecifically, SGLang is well
optimized on the CPUs equipped with Intel® AMX® Instructions,
S
GLang is enabled and
optimized on the CPUs equipped with Intel® AMX® Instructions,
which are 4th generation or newer Intel® Xeon® Scalable Processors.
which are 4th generation or newer Intel® Xeon® Scalable Processors.
## Optimized Model List
## Optimized Model List
A list of popular LLMs are optimized and run efficiently on CPU,
A list of popular LLMs are optimized and run efficiently on CPU,
including the most notable open-source models like Llama series, Qwen series,
including the most notable open-source models like Llama series, Qwen series,
and
the phenomenal high-quality reasoning model DeepSeek-R1
.
and
DeepSeek series like DeepSeek-R1 and DeepSeek-V3.1-Terminus
.
| Model Name | BF16 |
w8a8_int
8 | FP8 |
| Model Name | BF16 |
W8A8_INT
8 | FP8 |
|:---:|:---:|:---:|:---:|
|:---:|:---:|:---:|:---:|
| DeepSeek-R1 | |
[
meituan/DeepSeek-R1-Channel-INT8
](
https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8
)
|
[
deepseek-ai/DeepSeek-R1
](
https://huggingface.co/deepseek-ai/DeepSeek-R1
)
|
| DeepSeek-R1 | |
[
meituan/DeepSeek-R1-Channel-INT8
](
https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8
)
|
[
deepseek-ai/DeepSeek-R1
](
https://huggingface.co/deepseek-ai/DeepSeek-R1
)
|
| DeepSeek-V3.1-Terminus | |
[
IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8
](
https://huggingface.co/IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8
)
|
[
deepseek-ai/DeepSeek-V3.1-Terminus
](
https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus
)
|
| Llama-3.2-3B |
[
meta-llama/Llama-3.2-3B-Instruct
](
https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
)
|
[
RedHatAI/Llama-3.2-3B-quantized.w8a8
](
https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8
)
| |
| Llama-3.2-3B |
[
meta-llama/Llama-3.2-3B-Instruct
](
https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
)
|
[
RedHatAI/Llama-3.2-3B-quantized.w8a8
](
https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8
)
| |
| Llama-3.1-8B |
[
meta-llama/Llama-3.1-8B-Instruct
](
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
)
|
[
RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8
](
https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8
)
| |
| Llama-3.1-8B |
[
meta-llama/Llama-3.1-8B-Instruct
](
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
)
|
[
RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8
](
https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8
)
| |
| QwQ-32B | |
[
RedHatAI/QwQ-32B-quantized.w8a8
](
https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8
)
| |
| QwQ-32B | |
[
RedHatAI/QwQ-32B-quantized.w8a8
](
https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8
)
| |
...
@@ -36,7 +37,7 @@ git clone https://github.com/sgl-project/sglang.git
...
@@ -36,7 +37,7 @@ git clone https://github.com/sgl-project/sglang.git
cd
sglang/docker
cd
sglang/docker
# Build the docker image
# Build the docker image
docker build
-t
sglang-cpu:
main
-f
Dockerfile.xeon
.
docker build
-t
sglang-cpu:
latest
-f
Dockerfile.xeon
.
# Initiate a docker container
# Initiate a docker container
docker run
\
docker run
\
...
@@ -48,7 +49,7 @@ docker run \
...
@@ -48,7 +49,7 @@ docker run \
-v
~/.cache/huggingface:/root/.cache/huggingface
\
-v
~/.cache/huggingface:/root/.cache/huggingface
\
-p
30000:30000
\
-p
30000:30000
\
-e
"HF_TOKEN=<secret>"
\
-e
"HF_TOKEN=<secret>"
\
sglang-cpu:
main
/bin/bash
sglang-cpu:
latest
/bin/bash
```
```
### Install From Source
### Install From Source
...
@@ -121,9 +122,9 @@ Notes:
...
@@ -121,9 +122,9 @@ Notes:
2.
The flag
`--tp 6`
specifies that tensor parallelism will be applied using 6 ranks (TP6).
2.
The flag
`--tp 6`
specifies that tensor parallelism will be applied using 6 ranks (TP6).
The number of TP specified is how many TP ranks will be used during the execution.
The number of TP specified is how many TP ranks will be used during the execution.
I
n a CPU platform, a TP rank means a sub-NUMA cluster (SNC).
O
n a CPU platform, a TP rank means a sub-NUMA cluster (SNC).
Usually we can get the SNC information (How many available) from Operati
o
n System.
Usually we can get the SNC information (How many available) from
the
Operatin
g
System.
User can specify TP to be no more than the total available SNCs in current system.
User
s
can specify TP to be no more than the total available SNCs in current system.
If the specified TP rank number differs from the total SNC count,
If the specified TP rank number differs from the total SNC count,
the system will automatically utilize the first `n` SNCs.
the system will automatically utilize the first `n` SNCs.
...
@@ -175,13 +176,13 @@ Additionally, the requests can be formed with
...
@@ -175,13 +176,13 @@ Additionally, the requests can be formed with
[
OpenAI Completions API
](
https://docs.sglang.ai/basic_usage/openai_api_completions.html
)
[
OpenAI Completions API
](
https://docs.sglang.ai/basic_usage/openai_api_completions.html
)
and sent via the command line (e.g. using
`curl`
) or via your own script.
and sent via the command line (e.g. using
`curl`
) or via your own script.
## Example: Running DeepSeek-
R1
## Example: Running DeepSeek-
V3.1-Terminus
An example command to launch service for W8A8 DeepSeek-
R1
on a Xeon® 6980P server
An example command to launch service for W8A8
_INT8
DeepSeek-
V3.1-Terminus
on a Xeon® 6980P server
:
```
bash
```
bash
python
-m
sglang.launch_server
\
python
-m
sglang.launch_server
\
--model
meituan/DeepSeek-R1
-Channel-
INT
8
\
--model
IntervitensInc/DeepSeek-V3.1-Terminus
-Channel-
int
8
\
--trust-remote-code
\
--trust-remote-code
\
--disable-overlap-schedule
\
--disable-overlap-schedule
\
--device
cpu
\
--device
cpu
\
...
@@ -193,11 +194,11 @@ python -m sglang.launch_server \
...
@@ -193,11 +194,11 @@ python -m sglang.launch_server \
--tp
6
--tp
6
```
```
Similarly, an example command to launch service for FP8 DeepSeek-
R1
would be
Similarly, an example command to launch service for FP8 DeepSeek-
V3.1-Terminus
would be
:
```
bash
```
bash
python
-m
sglang.launch_server
\
python
-m
sglang.launch_server
\
--model
deepseek-ai/DeepSeek-
R1
\
--model
deepseek-ai/DeepSeek-
V3.1-Terminus
\
--trust-remote-code
\
--trust-remote-code
\
--disable-overlap-schedule
\
--disable-overlap-schedule
\
--device
cpu
\
--device
cpu
\
...
...
python/sglang/srt/utils/common.py
View file @
007b849b
...
@@ -1623,13 +1623,18 @@ def get_cpu_memory_capacity():
...
@@ -1623,13 +1623,18 @@ def get_cpu_memory_capacity():
for
numa_id
in
range
(
n_numa_node
):
for
numa_id
in
range
(
n_numa_node
):
file_meminfo
=
f
"node
{
numa_id
}
/meminfo"
file_meminfo
=
f
"node
{
numa_id
}
/meminfo"
with
open
(
os
.
path
.
join
(
file_prefix
,
file_meminfo
),
"r"
)
as
f
:
with
open
(
os
.
path
.
join
(
file_prefix
,
file_meminfo
),
"r"
)
as
f
:
# 1st line contains 'MemTotal'
# MemTotal info is at the 1st line
line
=
f
.
read
().
split
(
"
\n
"
)[
0
]
line
=
f
.
readline
()
numa_mem_list
.
append
(
int
(
line
.
split
()[
3
]))
# Expected format: "Node 0 MemTotal: 100000000 kB"
parts
=
line
.
split
()
if
len
(
parts
)
>=
4
and
parts
[
2
]
==
"MemTotal:"
:
numa_mem_list
.
append
(
int
(
parts
[
3
]))
else
:
raise
ValueError
(
f
"Unexpected format in
{
file_meminfo
}
:
{
line
}
"
)
# Retrieved value in KB, need MB
# Retrieved value in KB, need MB
numa_mem
=
float
(
min
(
numa_mem_list
)
//
1024
)
numa_mem
=
float
(
min
(
numa_mem_list
)
//
1024
)
return
numa_mem
return
numa_mem
except
FileNotFoundError
:
except
(
FileNotFoundError
,
ValueError
,
IndexError
)
:
numa_mem
=
psutil
.
virtual_memory
().
total
/
n_numa_node
numa_mem
=
psutil
.
virtual_memory
().
total
/
n_numa_node
# Retrieved value in Byte, need MB
# Retrieved value in Byte, need MB
return
float
(
numa_mem
//
(
1
<<
20
))
return
float
(
numa_mem
//
(
1
<<
20
))
...
...
sgl-kernel/pyproject_cpu.toml
View file @
007b849b
...
@@ -15,8 +15,7 @@ requires-python = ">=3.10"
...
@@ -15,8 +15,7 @@ requires-python = ">=3.10"
license
=
{
file
=
"LICENSE"
}
license
=
{
file
=
"LICENSE"
}
classifiers
=
[
classifiers
=
[
"Programming Language :: Python :: 3"
,
"Programming Language :: Python :: 3"
,
"License :: OSI Approved :: Apache Software License"
,
"License :: OSI Approved :: Apache Software License"
"Environment :: CPU"
]
]
dependencies
=
[]
dependencies
=
[]
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment