DeepseekR1_V3_tutorial.md 8.8 KB
Newer Older
liam's avatar
liam committed
1
<!-- omit in toc -->
liam's avatar
liam committed
2
# GPT-4/o1-level Local VSCode Copilot on a Desktop with only 24GB VRAM
liam's avatar
liam committed
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
- [SUMMARY](#summary)
	- [Prerequisites](#prerequisites)
	- [Bench Result](#bench-result)
		- [V0.2](#v02)
			- [Settings](#settings)
			- [Memory consumption:](#memory-consumption)
			- [Benchmark Results](#benchmark-results)
		- [V0.3-Preview](#v03-preview)
			- [Settings](#settings-1)
			- [Memory consumptions:](#memory-consumptions)
			- [Benchmark results](#benchmark-results-1)
	- [How to Run](#how-to-run)
		- [V0.2 Showcase](#v02-showcase)
			- [Single socket version (32 cores)](#single-socket-version-32-cores)
			- [Dual socket version (64 cores)](#dual-socket-version-64-cores)
		- [V0.3 Showcase](#v03-showcase)
			- [Dual socket version (64 cores)](#dual-socket-version-64-cores-1)
	- [Some Explanations](#some-explanations)
	- [FAQ](#faq)

liam's avatar
liam committed
23
24
# SUMMARY

liam's avatar
liam committed
25
> **Fed 10, 2025**: Support DeepseekR1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup.<br>
liam's avatar
liam committed
26
27
28
29
30
31
32

Hi, we're the KTransformers team (formerly known for our local CPU/GPU hybrid inference open source project with DeepSeek-V2).  

We've heard your requests for DeepSeek-R1/V3 support—and we're excited to finally deliver! 
Apologies for the wait, but we've been cooking up something truly amazing!

Today, we're proud to announce that we not only support DeepSeek-R1/V3, as showcased in the video below:  
liam's avatar
liam committed
33

liam's avatar
liam committed
34
35
36
37
38
https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285

</p>

- **[NEW!!!] Local 671B DeepSeek-Coder-V3/R1:** Running its Q4_K_M version using only 14GB VRAM and 382GB DRAM.
liam's avatar
liam committed
39
	- Prefill Speed (tokens/s): 
liam's avatar
liam committed
40
 		- KTransfermor: 54.21 (32 cores) → 74.362 (dual-socket, 2×32 cores) → 255.26 (optimized AMX-based MoE kernel, V0.3 only) → 286.55 (selectively using 6 experts, V0.3 only)  
liam's avatar
liam committed
41
 		- Compared to 10.31 tokens/s in llama.cpp with 2×32 cores, achieving up to **27.79× speedup**.  
liam's avatar
liam committed
42
 	- Decode Speed (tokens/s):  
liam's avatar
liam committed
43
 		- KTransfermor: 8.73 (32 cores) → 11.26 (dual-socket, 2×32 cores) → 13.69 (selectively using 6 experts, V0.3 only)  
liam's avatar
liam committed
44
 		- Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to **3.03× speedup**.  
liam's avatar
liam committed
45
46
  

liam's avatar
liam committed
47
We also give our upcoming optimizations previews, including an Intel AMX-accelerated kernel and a selective expert activation method, which will significantly enhance performance. With V0.3-preview, we achieve up to 286 tokens/s for prefill, making it up to **28× faster than llama.cpp** for local inference.
liam's avatar
liam committed
48
The binary distribution is available now and the source code will come ASAP! Check out the wheel package [here](https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.4/ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl)  
liam's avatar
liam committed
49
50


liam's avatar
liam committed
51
## Prerequisites
liam's avatar
liam committed
52
We run our best performance tests (V0.2) on <br>
liam's avatar
liam committed
53
CPU: Intel (R) Xeon (R) Gold 6454S 1T DRAM (2 NUMA nodes) <br>
liam's avatar
liam committed
54
GPU: 4090D 24G VRAM <br>
liam's avatar
liam committed
55
Memory: standard DDR5-4800 server DRAM (1 TB)
liam's avatar
liam committed
56
## Bench Result
liam's avatar
liam committed
57
### V0.2
liam's avatar
liam committed
58
#### Settings
liam's avatar
liam committed
59
60
61
- Model: DeepseekV3-q4km (int4)<br>
- CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 sockets, 2 numa nodes
- GPU: 4090D 24G VRAM
liam's avatar
liam committed
62
63
- We test after enough warm up
#### Memory consumption:
liam's avatar
liam committed
64
65
  - Single socket: 382G DRAM, at least 14GB VRAM
  - Dual socket: 1T DRAM, at least 14GB VRAM
liam's avatar
liam committed
66
67
68

#### Benchmark Results

liam's avatar
liam committed
69
"6 experts" case is part of V0.3's preview
liam's avatar
liam committed
70

liam's avatar
liam committed
71
| Prompt<br>(500 tokens) | Dual socket Ktrans (6 experts) | Dual socket Ktrans (8 experts) | Single socket Ktrans (6 experts) | Single socket Ktrans (8 experts)| llama.cpp (8 experts) | 
liam's avatar
liam committed
72
73
74
75
| --- | --- | --- | --- | --- | --- | 
| Prefill token/s | 97.32 | 82.94 | 65.14 | 54.21 | 10.31 |
| Decode token/s | 13.69 | 12.208 | 10.303 | 8.73 |4.51 |

liam's avatar
liam committed
76
**The highest speedup reaches up to <u>3.03x</u> in decoding and <u>9.44x</u> in prefill.**
liam's avatar
liam committed
77

chenht2022's avatar
chenht2022 committed
78
### V0.3-Preview
liam's avatar
liam committed
79
80
#### Settings
- Model: DeepseekV3-BF16 (online quant into int8 for CPU and int4 for GPU)
liam's avatar
liam committed
81
- CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes
chenht2022's avatar
chenht2022 committed
82
83
- GPU: (1~4)x 4090D 24GVRAM (requires more VRAM for longer prompt)

liam's avatar
liam committed
84
#### Memory consumptions:
liam's avatar
liam committed
85
- 644GB DRAM, at least 14GB VRAM
chenht2022's avatar
chenht2022 committed
86

liam's avatar
liam committed
87
#### Benchmark results
chenht2022's avatar
chenht2022 committed
88
89
90
91
92
| Prompt length  | 1K  | 2K  | 4K  | 8K |
|---------------|-----|-----|-----|-----|
| KTrans (8 experts) Prefill token/s |   185.96  |  255.26   |  252.58   |  195.62   |
| KTrans (6 experts) Prefill token/s |   203.70  |  286.55   |  271.08   |  207.20   |

liam's avatar
liam committed
93
**The prefill of KTrans V0.3 is up to <u>3.45x</u> times faster than KTrans V0.2, and is up to <u>27.79x</u> times faster than llama.cpp.**
liam's avatar
liam committed
94
**The decoding speed is the same as KTrans V0.2 (6 experts version) so it is omitted**
chenht2022's avatar
chenht2022 committed
95
96
97
98
99

The main acceleration comes from 
- Intel AMX instruction set and our specially designed cache friendly memory layout
- Expert selection strategy that selects fewer experts based on offline profile results of out of domain data

liam's avatar
liam committed
100
101
102

*From our research on DeepSeekV2, DeepSeekV3 and DeepSeekR1, 
when we slightly decrease the activation experts num in inference, 
liam's avatar
liam committed
103
the output quality doesn't change. But the speed of decoding and prefill 
liam's avatar
liam committed
104
105
is speed up which is inspiring. So our showcase makes use of this finding*

liam's avatar
liam committed
106
107
## How to Run
### V0.2 Showcase
liam's avatar
liam committed
108
109
#### Single socket version (32 cores)
Our local_chat test command is:
liam's avatar
liam committed
110
111
112
``` shell
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
liam's avatar
liam committed
113
numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your prompt txt file>  --cpu_infer 33  --cache_lens 1536 
liam's avatar
liam committed
114
115
<when you see chat, then press enter to load the text prompt_file>
```
liam's avatar
liam committed
116
117
118
119
\<your model path\> can be local or set from online hugging face like deepseek-ai/DeepSeek-V3. If online encounters connection problem, try use mirror (hf-mirror.com) <br>
\<your gguf path\> can also be online, but as its large we recommend you download it and quantize the model to what you want <br>
The command numactl -N 1 -m 1 aims to advoid data transfer between numa nodes
#### Dual socket version (64 cores)
liam's avatar
liam committed
120
Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set) <br>
liam's avatar
liam committed
121
Our local_chat test command is:
liam's avatar
liam committed
122
123
124
125
126
``` shell
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
export USE_NUMA=1
make dev_install # or sh ./install.sh
liam's avatar
liam committed
127
python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your prompt txt file>  --cpu_infer 65  --cache_lens 1536 
liam's avatar
liam committed
128
129
<when you see chat, then press enter to load the text prompt_file>
```
liam's avatar
liam committed
130
The parameters' meaning is the same. But As we  use dual socket, we set cpu_infer to 65
liam's avatar
liam committed
131
132
133
134
135

### V0.3 Showcase
#### Dual socket version (64 cores)
Our local_chat test command is:
``` shell
liam's avatar
liam committed
136
137
wget https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.4/ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl
pip install ./ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl
liam's avatar
liam committed
138
139
140
141
142
python -m ktransformers.local_chat --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your prompt txt file>  --cpu_infer 65  --cache_lens 1536 
<when you see chat, then press enter to load the text prompt_file>
```
The parameters' meaning is the same with V0.2. But As we  use dual socket, we set cpu_infer to 65

liam's avatar
liam committed
143
## Some Explanations
liam's avatar
liam committed
144
1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu. 
liam's avatar
liam committed
145
146
147
To avoid the cost of data transfer between nodes, we "copy" the critical matrix on 
both nodes which takes more memory consumption but accelerates the prefill and decoding process.
But this method takes huge memory and slow when loading weights, So be patient when loading
liam's avatar
liam committed
148
and monitor the memory usage. We are going to optimize this huge memory overhead. Stay tuned~ <br>
liam's avatar
liam committed
149
2. The command args `--cpu_infer 65` specifies how many cores to use (it's ok that it exceeds the physical number, 
liam's avatar
liam committed
150
151
152
153
154
155
156
157
158
159
160
but it's not the more the better. Adjust it slightly lower to your actual number of cores)<br>

3. Why CPU/GPU Hybrid Inference?
DeepSeek's MLA operators are highly computationally intensive. While running everything on CPU is possible, offloading the heavy computations to the GPU results in a massive performance boost.  

4. Where Does the Speedup Come From?

   - Expert Offload: Unlike traditional layer-based or KVCache offloading (as seen in llama.cpp), we offload the expert computation to the CPU and MLA/KVCache to GPU, aligning perfectly with DeepSeek’s architecture for optimal efficiency.  
   - Intel AMX Optimization – Our AMX-accelerated kernel is meticulously tuned, running several times faster than existing llama.cpp implementations. We plan to open-source this kernel after cleansing and are considering upstream contributions to llama.cpp.  

5. Why Intel CPUs?
liam's avatar
liam committed
161
162
163
Intel is currently the only CPU vendor that supports AMX-like instructions, which delivers significantly better performance compared to AVX-only alternatives.
## FAQ
[See detail](./FAQ.md)