Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
change
sglang
Commits
007b849b
Unverified
Commit
007b849b
authored
Oct 23, 2025
by
Zaili Wang
Committed by
GitHub
Oct 22, 2025
Browse files
[CPU] misc updates (#11906)
parent
8612811d
Changes
3
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
33 additions
and
28 deletions
+33
-28
docs/platforms/cpu_server.md
docs/platforms/cpu_server.md
+23
-22
python/sglang/srt/utils/common.py
python/sglang/srt/utils/common.py
+9
-4
sgl-kernel/pyproject_cpu.toml
sgl-kernel/pyproject_cpu.toml
+1
-2
No files found.
docs/platforms/cpu_server.md
View file @
007b849b
# CPU Servers
The document addresses how to set up the
[
SGLang
](
https://github.com/sgl-project/sglang
)
environment and run LLM inference on CPU servers.
S
pecifically, SGLang is well
optimized on the CPUs equipped with Intel® AMX® Instructions,
S
GLang is enabled and
optimized on the CPUs equipped with Intel® AMX® Instructions,
which are 4th generation or newer Intel® Xeon® Scalable Processors.
## Optimized Model List
A list of popular LLMs are optimized and run efficiently on CPU,
including the most notable open-source models like Llama series, Qwen series,
and
the phenomenal high-quality reasoning model DeepSeek-R1
.
and
DeepSeek series like DeepSeek-R1 and DeepSeek-V3.1-Terminus
.
| Model Name | BF16 |
w8a8_int
8 | FP8 |
| Model Name | BF16 |
W8A8_INT
8 | FP8 |
|:---:|:---:|:---:|:---:|
| DeepSeek-R1 | |
[
meituan/DeepSeek-R1-Channel-INT8
](
https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8
)
|
[
deepseek-ai/DeepSeek-R1
](
https://huggingface.co/deepseek-ai/DeepSeek-R1
)
|
| DeepSeek-V3.1-Terminus | |
[
IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8
](
https://huggingface.co/IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8
)
|
[
deepseek-ai/DeepSeek-V3.1-Terminus
](
https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus
)
|
| Llama-3.2-3B |
[
meta-llama/Llama-3.2-3B-Instruct
](
https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
)
|
[
RedHatAI/Llama-3.2-3B-quantized.w8a8
](
https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8
)
| |
| Llama-3.1-8B |
[
meta-llama/Llama-3.1-8B-Instruct
](
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
)
|
[
RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8
](
https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8
)
| |
| QwQ-32B | |
[
RedHatAI/QwQ-32B-quantized.w8a8
](
https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8
)
| |
...
...
@@ -36,7 +37,7 @@ git clone https://github.com/sgl-project/sglang.git
cd
sglang/docker
# Build the docker image
docker build
-t
sglang-cpu:
main
-f
Dockerfile.xeon
.
docker build
-t
sglang-cpu:
latest
-f
Dockerfile.xeon
.
# Initiate a docker container
docker run
\
...
...
@@ -48,7 +49,7 @@ docker run \
-v
~/.cache/huggingface:/root/.cache/huggingface
\
-p
30000:30000
\
-e
"HF_TOKEN=<secret>"
\
sglang-cpu:
main
/bin/bash
sglang-cpu:
latest
/bin/bash
```
### Install From Source
...
...
@@ -121,9 +122,9 @@ Notes:
2.
The flag
`--tp 6`
specifies that tensor parallelism will be applied using 6 ranks (TP6).
The number of TP specified is how many TP ranks will be used during the execution.
I
n a CPU platform, a TP rank means a sub-NUMA cluster (SNC).
Usually we can get the SNC information (How many available) from Operati
o
n System.
User can specify TP to be no more than the total available SNCs in current system.
O
n a CPU platform, a TP rank means a sub-NUMA cluster (SNC).
Usually we can get the SNC information (How many available) from
the
Operatin
g
System.
User
s
can specify TP to be no more than the total available SNCs in current system.
If the specified TP rank number differs from the total SNC count,
the system will automatically utilize the first `n` SNCs.
...
...
@@ -175,29 +176,29 @@ Additionally, the requests can be formed with
[
OpenAI Completions API
](
https://docs.sglang.ai/basic_usage/openai_api_completions.html
)
and sent via the command line (e.g. using
`curl`
) or via your own script.
## Example: Running DeepSeek-
R1
## Example: Running DeepSeek-
V3.1-Terminus
An example command to launch service for W8A8 DeepSeek-
R1
on a Xeon® 6980P server
An example command to launch service for W8A8
_INT8
DeepSeek-
V3.1-Terminus
on a Xeon® 6980P server
:
```
bash
python
-m
sglang.launch_server
\
--model
meituan/DeepSeek-R1
-Channel-
INT
8
\
--trust-remote-code
\
--disable-overlap-schedule
\
--device
cpu
\
--quantization
w8a8_int8
\
--host
0.0.0.0
\
--mem-fraction-static
0.8
\
--enable-torch-compile
\
--torch-compile-max-bs
4
\
python
-m
sglang.launch_server
\
--model
IntervitensInc/DeepSeek-V3.1-Terminus
-Channel-
int
8
\
--trust-remote-code
\
--disable-overlap-schedule
\
--device
cpu
\
--quantization
w8a8_int8
\
--host
0.0.0.0
\
--mem-fraction-static
0.8
\
--enable-torch-compile
\
--torch-compile-max-bs
4
\
--tp
6
```
Similarly, an example command to launch service for FP8 DeepSeek-
R1
would be
Similarly, an example command to launch service for FP8 DeepSeek-
V3.1-Terminus
would be
:
```
bash
python
-m
sglang.launch_server
\
--model
deepseek-ai/DeepSeek-
R1
\
--model
deepseek-ai/DeepSeek-
V3.1-Terminus
\
--trust-remote-code
\
--disable-overlap-schedule
\
--device
cpu
\
...
...
python/sglang/srt/utils/common.py
View file @
007b849b
...
...
@@ -1623,13 +1623,18 @@ def get_cpu_memory_capacity():
for
numa_id
in
range
(
n_numa_node
):
file_meminfo
=
f
"node
{
numa_id
}
/meminfo"
with
open
(
os
.
path
.
join
(
file_prefix
,
file_meminfo
),
"r"
)
as
f
:
# 1st line contains 'MemTotal'
line
=
f
.
read
().
split
(
"
\n
"
)[
0
]
numa_mem_list
.
append
(
int
(
line
.
split
()[
3
]))
# MemTotal info is at the 1st line
line
=
f
.
readline
()
# Expected format: "Node 0 MemTotal: 100000000 kB"
parts
=
line
.
split
()
if
len
(
parts
)
>=
4
and
parts
[
2
]
==
"MemTotal:"
:
numa_mem_list
.
append
(
int
(
parts
[
3
]))
else
:
raise
ValueError
(
f
"Unexpected format in
{
file_meminfo
}
:
{
line
}
"
)
# Retrieved value in KB, need MB
numa_mem
=
float
(
min
(
numa_mem_list
)
//
1024
)
return
numa_mem
except
FileNotFoundError
:
except
(
FileNotFoundError
,
ValueError
,
IndexError
)
:
numa_mem
=
psutil
.
virtual_memory
().
total
/
n_numa_node
# Retrieved value in Byte, need MB
return
float
(
numa_mem
//
(
1
<<
20
))
...
...
sgl-kernel/pyproject_cpu.toml
View file @
007b849b
...
...
@@ -15,8 +15,7 @@ requires-python = ">=3.10"
license
=
{
file
=
"LICENSE"
}
classifiers
=
[
"Programming Language :: Python :: 3"
,
"License :: OSI Approved :: Apache Software License"
,
"Environment :: CPU"
"License :: OSI Approved :: Apache Software License"
]
dependencies
=
[]
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment