DeepseekR1_V3_tutorial.md 5.7 KB
Newer Older
liam's avatar
liam committed
1
2
3
# GPT-4/o1-level Local VSCode Copilot on a Desktop with only 24GB VRAM
# SUMMARY

liam's avatar
liam committed
4
5
- **Fed 10, 2025**: Support DeepseekR1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~64x speedup.<br>

liam's avatar
liam committed
6
7
8
9
10
https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285

</p>

- **[NEW!!!] Local 671B DeepSeek-Coder-V3/R1:** Running its Q4_K_M version using only 14GB VRAM and 382GB DRAM.
liam's avatar
liam committed
11
	- Prefill Speed (tokens/s): 
liam's avatar
liam committed
12
 		- KTransfermor: 54.21 (32 cores) → 74.362 (dual-socket, 2×32 cores) → 255.26 (optimized AMX-based MoE kernel, V0.3 only) → 286.55 (selectively using 6 experts, V0.3 only)  
liam's avatar
liam committed
13
 		- Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to **63.53× speedup**.  
liam's avatar
liam committed
14
 	- Decode Speed (tokens/s):  
liam's avatar
liam committed
15
 		- KTransfermor: 8.73 (32 cores) → 11.26 (dual-socket, 2×32 cores) → 13.69 (selectively using 6 experts, V0.3 only)  
liam's avatar
liam committed
16
17
 		- Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to **3.03× speedup**.  
	- Upcoming Open Source Release:
liam's avatar
liam committed
18
		- AMX optimizations and selective expert activation will be open-sourced in V0.3.  
liam's avatar
liam committed
19
20
21
		- Currently available only in preview binary distribution, which can be found [here](xxx).  


liam's avatar
liam committed
22
## Prerequisites
liam's avatar
liam committed
23
We run our best performance tests (V0.2) on <br>
liam's avatar
liam committed
24
CPU: Intel (R) Xeon (R) Gold 6454S 1T DRAM (2 NUMA nodes) <br>
liam's avatar
liam committed
25
26
GPU: 4090D 24G VRAM <br>
## Bench Result
liam's avatar
liam committed
27
### V0.2
liam's avatar
liam committed
28
#### Settings
liam's avatar
liam committed
29
30
31
- Model: DeepseekV3-q4km (int4)<br>
- CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 sockets, 2 numa nodes
- GPU: 4090D 24G VRAM
liam's avatar
liam committed
32
33
- We test after enough warm up
#### Memory consumption:
liam's avatar
liam committed
34
35
  - Single socket: 382G DRAM, at least 14GB VRAM
  - Dual socket: 1T DRAM, at least 14GB VRAM
liam's avatar
liam committed
36
37
38

#### Benchmark Results

liam's avatar
liam committed
39
"6 experts" case is part of V0.3's preview
liam's avatar
liam committed
40

liam's avatar
liam committed
41
| Prompt<br>(500 tokens) | Dual socket Ktrans (6 experts) | Dual socket Ktrans (8 experts) | Single socket Ktrans (6 experts) | Single socket Ktrans (8 experts)| llama.cpp (8 experts) | 
liam's avatar
liam committed
42
43
44
45
| --- | --- | --- | --- | --- | --- | 
| Prefill token/s | 97.32 | 82.94 | 65.14 | 54.21 | 10.31 |
| Decode token/s | 13.69 | 12.208 | 10.303 | 8.73 |4.51 |

liam's avatar
liam committed
46
**The highest speedup reaches up to <u>3.03x</u> in decoding and <u>9.44x</u> in prefill.**
liam's avatar
liam committed
47

chenht2022's avatar
chenht2022 committed
48
### V0.3-Preview
liam's avatar
liam committed
49
50
#### Settings
- Model: DeepseekV3-BF16 (online quant into int8 for CPU and int4 for GPU)
liam's avatar
liam committed
51
- CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes
chenht2022's avatar
chenht2022 committed
52
53
- GPU: (1~4)x 4090D 24GVRAM (requires more VRAM for longer prompt)

liam's avatar
liam committed
54
#### Memory consumptions:
liam's avatar
liam committed
55
- 644GB DRAM, at least 14GB VRAM
chenht2022's avatar
chenht2022 committed
56

liam's avatar
liam committed
57
#### Benchmark results
chenht2022's avatar
chenht2022 committed
58
59
60
61
62
| Prompt length  | 1K  | 2K  | 4K  | 8K |
|---------------|-----|-----|-----|-----|
| KTrans (8 experts) Prefill token/s |   185.96  |  255.26   |  252.58   |  195.62   |
| KTrans (6 experts) Prefill token/s |   203.70  |  286.55   |  271.08   |  207.20   |

liam's avatar
liam committed
63
**The prefill of KTrans V0.3 is up to <u>3.45x</u> times faster than KTrans V0.2, and is up to <u>63.53x</u> times faster than llama.cpp.**
liam's avatar
liam committed
64
**The decoding speed is the same as KTrans V0.2 (6 experts version) so it is omitted**
chenht2022's avatar
chenht2022 committed
65
66
67
68
69

The main acceleration comes from 
- Intel AMX instruction set and our specially designed cache friendly memory layout
- Expert selection strategy that selects fewer experts based on offline profile results of out of domain data

liam's avatar
liam committed
70
71
72

*From our research on DeepSeekV2, DeepSeekV3 and DeepSeekR1, 
when we slightly decrease the activation experts num in inference, 
liam's avatar
liam committed
73
the output quality doesn't change. But the speed of decoding and prefill 
liam's avatar
liam committed
74
75
is speed up which is inspiring. So our showcase makes use of this finding*

liam's avatar
liam committed
76
77
## How to Run
### V0.2 Showcase
liam's avatar
liam committed
78
79
#### Single socket version (32 cores)
Our local_chat test command is:
liam's avatar
liam committed
80
81
82
``` shell
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
liam's avatar
liam committed
83
numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your prompt txt file>  --cpu_infer 33  --cache_lens 1536 
liam's avatar
liam committed
84
85
<when you see chat, then press enter to load the text prompt_file>
```
liam's avatar
liam committed
86
87
88
89
\<your model path\> can be local or set from online hugging face like deepseek-ai/DeepSeek-V3. If online encounters connection problem, try use mirror (hf-mirror.com) <br>
\<your gguf path\> can also be online, but as its large we recommend you download it and quantize the model to what you want <br>
The command numactl -N 1 -m 1 aims to advoid data transfer between numa nodes
#### Dual socket version (64 cores)
liam's avatar
liam committed
90
Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set) <br>
liam's avatar
liam committed
91
Our local_chat test command is:
liam's avatar
liam committed
92
93
94
95
96
``` shell
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
export USE_NUMA=1
make dev_install # or sh ./install.sh
liam's avatar
liam committed
97
python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your prompt txt file>  --cpu_infer 65  --cache_lens 1536 
liam's avatar
liam committed
98
99
<when you see chat, then press enter to load the text prompt_file>
```
liam's avatar
liam committed
100
The parameters' meaning is the same. But As we  use dual socket, we set cpu_infer to 65
liam's avatar
liam committed
101
## Some Explanations
liam's avatar
liam committed
102
1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu. 
liam's avatar
liam committed
103
104
105
To avoid the cost of data transfer between nodes, we "copy" the critical matrix on 
both nodes which takes more memory consumption but accelerates the prefill and decoding process.
But this method takes huge memory and slow when loading weights, So be patient when loading
liam's avatar
liam committed
106
107
108
and monitor the memory usage. (we are considering to make this method as an option). We are going to optimize this huge memory overhead. Stay tuned~ <br>
2. The command args `--cpu_infer 65` specifies how many cores to use (it's ok that it exceeds the physical number, 
but it's not the more the better. Adjust it slightly lower to your actual number of cores)<br>