DeepseekR1_V3_tutorial.md 4.54 KB
Newer Older
liam's avatar
liam committed
1
2
# Report
## Prerequisites
liam's avatar
liam committed
3
4
5
6
We run our best performance tests (V0.2) on <br>
CPU: Intel(R) Xeon(R) Gold 6454S 1T DRAM (2 NUMA nodes) <br>
GPU: 4090D 24G VRAM <br>
## Bench Result
liam's avatar
liam committed
7
### V0.2
liam's avatar
liam committed
8
9
10
#### Settings
- Model: DeepseekV3-q4km(int4)<br>
- CPU: cpu_model_name:Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes
liam's avatar
liam committed
11
- GPU: 4090D 24GVRAM
liam's avatar
liam committed
12
13
14
15
- We test after enough warm up
#### Memory consumption:
  - Single socket: 382G DRAM, at least 12G VRAM
  - Dual socket: 1T DRAM, at least 12G VRAM
liam's avatar
liam committed
16
17
18
19
20

#### Benchmark Results

"6 experts" case is part of v0.3's preview

liam's avatar
liam committed
21
| Prompt<br>(500 tokens) | Dual socket Ktrans (6 experts) | Dual socket Ktrans (8 experts) | Single socket Ktrans (6 experts) | Single socket Ktrans (8 experts)| llama.cpp (8 experts) | 
liam's avatar
liam committed
22
23
24
25
| --- | --- | --- | --- | --- | --- | 
| Prefill token/s | 97.32 | 82.94 | 65.14 | 54.21 | 10.31 |
| Decode token/s | 13.69 | 12.208 | 10.303 | 8.73 |4.51 |

liam's avatar
liam committed
26
**The highest speedup reaches up to <u>3.03x</u> in decoding and <u>9.44x</u> in prefill.**
liam's avatar
liam committed
27

chenht2022's avatar
chenht2022 committed
28
### V0.3-Preview
liam's avatar
liam committed
29
30
#### Settings
- Model: DeepseekV3-BF16 (online quant into int8 for CPU and int4 for GPU)
chenht2022's avatar
chenht2022 committed
31
- CPU: cpu_model_name:Intel(R) Xeon(R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes
chenht2022's avatar
chenht2022 committed
32
33
- GPU: (1~4)x 4090D 24GVRAM (requires more VRAM for longer prompt)

liam's avatar
liam committed
34
#### Memory consumptions:
chenht2022's avatar
chenht2022 committed
35
36
- 644GB DRAM, at least 12GB VRAM

liam's avatar
liam committed
37
#### Benchmark results
chenht2022's avatar
chenht2022 committed
38
39
40
41
42
| Prompt length  | 1K  | 2K  | 4K  | 8K |
|---------------|-----|-----|-----|-----|
| KTrans (8 experts) Prefill token/s |   185.96  |  255.26   |  252.58   |  195.62   |
| KTrans (6 experts) Prefill token/s |   203.70  |  286.55   |  271.08   |  207.20   |

liam's avatar
liam committed
43
**The prefill of KTrans V0.3 is up to <u>3.45x</u> times faster than KTrans V0.2, and is up to <u>63.53x</u> times faster than llama.cpp.**
liam's avatar
liam committed
44
**The decoding speed is the same as KTrans V0.2 (6 experts version) so it is omitted**
chenht2022's avatar
chenht2022 committed
45
46
47
48
49

The main acceleration comes from 
- Intel AMX instruction set and our specially designed cache friendly memory layout
- Expert selection strategy that selects fewer experts based on offline profile results of out of domain data

liam's avatar
liam committed
50
51
52

*From our research on DeepSeekV2, DeepSeekV3 and DeepSeekR1, 
when we slightly decrease the activation experts num in inference, 
liam's avatar
liam committed
53
the output quality doesn't change. But the speed of decoding and prefill 
liam's avatar
liam committed
54
55
is speed up which is inspiring. So our showcase makes use of this finding*

liam's avatar
liam committed
56
57
58
## How to Run
### V0.2 Showcase
#### Single socket version(32 cores)
liam's avatar
liam committed
59
60
61
62
63
64
65
66
our local_chat test command is:
``` shell
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your promt txt file>  --cpu_infer 33  --cache_lens 1536 
<when you see chat, then press enter to load the text prompt_file>
```
\<your model path\> can be local or set from onlie hugging face like deepseek-ai/DeepSeek-V3. If onlie encounters connection problem, try use mirror(hf-mirror.com) <br>
liam's avatar
liam committed
67
68
69
70
\<your gguf path\> can also be onlie, but as its large we recommend you download it and quantize the model to what you want <br>
The command numactl -N 1 -m 1 aims to adoid data transfer between numa nodes
#### Dual socket version(64 cores)
Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set) <br>
liam's avatar
liam committed
71
72
73
74
75
76
77
78
79
our local_chat test command is:
``` shell
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
export USE_NUMA=1
make dev_install # or sh ./install.sh
python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your promt txt file>  --cpu_infer 65  --cache_lens 1536 
<when you see chat, then press enter to load the text prompt_file>
```
liam's avatar
liam committed
80
81
The parameters meaning is the same. But As we  use dual socket, so we set cpu_infer to 65
## Some Explanations
liam's avatar
liam committed
82
1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu. 
liam's avatar
liam committed
83
84
85
To avoid the cost of data transfer between nodes, we "copy" the critical matrix on 
both nodes which takes more memory consumption but accelerates the prefill and decoding process.
But this method takes huge memory and slow when loading weights, So be patient when loading
liam's avatar
liam committed
86
and monitor the memory usage.(we are considering to make this method as an option). We are going to optimize this huge memory overhead. Stay tuned~ <br>
liam's avatar
liam committed
87
2. The command args `--cpu_infer 65` specifies how many cores to use(it's ok that it exceeds the physical number, 
liam's avatar
liam committed
88
but it's not the more the better. Adjust it slightly lower to your actual number of cores)<br>