DeepseekR1_V3_tutorial.md 16.7 KB
Newer Older
liam's avatar
liam committed
1
<!-- omit in toc -->
2

liam's avatar
liam committed
3
# GPT-4/o1-level Local VSCode Copilot on a Desktop with only 24GB VRAM
4

liam's avatar
liam committed
5
- [SUMMARY](#summary)
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
  - [Show Case Environment](#show-case-environment)
  - [Bench Result](#bench-result)
    - [V0.2.1](#v021)
      - [Memory consumption:](#memory-consumption)
      - [Change Log](#change-log)
      - [Benchmark Results](#benchmark-results)
    - [V0.2](#v02)
      - [Settings](#settings)
      - [Memory consumption:](#memory-consumption-1)
      - [Benchmark Results](#benchmark-results-1)
    - [V0.3-Preview](#v03-preview)
      - [Settings](#settings-1)
      - [Memory consumptions:](#memory-consumptions)
      - [Benchmark results](#benchmark-results-2)
  - [How to Run](#how-to-run)
    - [v0.2.2 \& v0.2.3 longer context \& FP8 kernel](#v022--v023-longer-context--fp8-kernel)
      - [longer context](#longer-context)
      - [FP8 kernel](#fp8-kernel)
    - [V0.2 \& V0.2.1 Showcase](#v02--v021-showcase)
      - [Single socket version (32 cores)](#single-socket-version-32-cores)
      - [Dual socket version (64 cores)](#dual-socket-version-64-cores)
    - [V0.3 Showcase](#v03-showcase)
      - [Dual socket version (64 cores)](#dual-socket-version-64-cores-1)
  - [Some Explanations](#some-explanations)
  - [Next](#next)
    - [Faster](#faster)
    - [Easier](#easier)
  - [FAQ](#faq)
    - [R1 No Thinking](#r1-no-thinking)
    - [More FAQ](#more-faq)
liam's avatar
liam committed
36

liam's avatar
liam committed
37
38
# SUMMARY

lusipad's avatar
lusipad committed
39
> **Feb 10, 2025**: Support DeepseekR1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup.<br>
liam's avatar
liam committed
40

41
Hi, we're the KTransformers team (formerly known for our local CPU/GPU hybrid inference open source project with DeepSeek-V2).
liam's avatar
liam committed
42

43
We've heard your requests for DeepSeek-R1/V3 support—and we're excited to finally deliver!
liam's avatar
liam committed
44
45
Apologies for the wait, but we've been cooking up something truly amazing!

46
Today, we're proud to announce that we not only support DeepSeek-R1/V3, as showcased in the video below:
liam's avatar
liam committed
47

liam's avatar
liam committed
48
49
50
51
52
https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285

</p>

- **[NEW!!!] Local 671B DeepSeek-Coder-V3/R1:** Running its Q4_K_M version using only 14GB VRAM and 382GB DRAM.
53
54
55
56
57
58
  - Prefill Speed (tokens/s):
    - KTransformers: 54.21 (32 cores) → 74.362 (dual-socket, 2×32 cores) → 255.26 (optimized AMX-based MoE kernel, V0.3 only) → 286.55 (selectively using 6 experts, V0.3 only)
    - Compared to 10.31 tokens/s in llama.cpp with 2×32 cores, achieving up to **27.79× speedup**.
  - Decode Speed (tokens/s):
    - KTransformers: 8.73 (32 cores) → 11.26 (dual-socket, 2×32 cores) → 13.69 (selectively using 6 experts, V0.3 only)
    - Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to **3.03× speedup**.
liam's avatar
liam committed
59

liam's avatar
liam committed
60
We also give our upcoming optimizations previews, including an Intel AMX-accelerated kernel and a selective expert activation method, which will significantly enhance performance. With V0.3-preview, we achieve up to 286 tokens/s for prefill, making it up to **28× faster than llama.cpp** for local inference.
61
The binary distribution is available now and the source code will come ASAP! Check out the wheel package [here](https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.4/ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl)
liam's avatar
liam committed
62

liam's avatar
liam committed
63
64
65
66
67
> **Feb 15, 2025**: KTransformers V0.2.1: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed (+15%) (Up to 16 Tokens/s), update docs [here](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).

We speed up the decode and prefill speed a littlt bit. The reason for the limited performance improvement mainly lies in the fact that the inference process is still constrained by the CPU's computational speed and memory bandwidth. The MLA part handled by the GPU accounts for a relatively small proportion.

Besides the improvements in speed, we've also significantly updated the documentation to enhance usability, including:<br>
68

liam's avatar
liam committed
69
70
71
72
- Added Multi-GPU configuration tutorial.
- Consolidated installation guide.
- Add a detailed tutorial on registering extra GPU memory with ExpertMarlin;

Azure's avatar
Azure committed
73
## Show Case Environment
74

liam's avatar
liam committed
75
We run our best performance tests (V0.2) on <br>
liam's avatar
liam committed
76
CPU: Intel (R) Xeon (R) Gold 6454S 1T DRAM (2 NUMA nodes) <br>
liam's avatar
liam committed
77
GPU: 4090D 24G VRAM <br>
liam's avatar
liam committed
78
Memory: standard DDR5-4800 server DRAM (1 TB), each socket with 8×DDR5-4800
79

liam's avatar
liam committed
80
## Bench Result
81

liam's avatar
liam committed
82
### V0.2.1
83

liam's avatar
liam committed
84
85
86
87
- Model: DeepseekV3-q4km (int4)<br>
- CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 sockets, 2 numa nodes
- GPU: 4090 24G VRAM
- We test after enough warm up
88

liam's avatar
liam committed
89
#### Memory consumption:
90
91
92
93

- Single socket: 382G DRAM, at least 14GB VRAM
- Dual socket: 1T DRAM, at least 14GB VRAM

liam's avatar
liam committed
94
#### Change Log
95

liam's avatar
liam committed
96
- Longer Context (from 4K to 8K for 24GB VRAM) and Slightly Faster Speed (+15%):<br>
97
  Integrated the highly efficient Triton MLA Kernel from the fantastic sglang project, enable much longer context length and slightly faster prefill/decode speed
hybcloud's avatar
hybcloud committed
98
- We suspect that some of the improvements come from the change of hardware platform (4090D->4090)
liam's avatar
liam committed
99

100
#### Benchmark Results
liam's avatar
liam committed
101
102
103
104

"6 experts" case is part of V0.3's preview


105
106
107
108
109
110
111
112
113
114
115
116
| Prompt               | hi (2)   | 1K (969)  | 2K (1930) | 4K (3846)               | 8K (7678) |
| -------------------- | -------- | --------- | --------- | ----------------------- | --------- |
| Output length        | 10tokens | 300tokens | 300tokens | 300tokens               | 300tokens |
| **6 experts V0.2.0** |          |           |           |                         |           |
| Prefill token/s      | 13       | 105       | 102       | 88                      | CUDA OOM  |
| decode token/s       | 16.8     | 15.4      | 14.2      | 13.0                    | CUDA OOM  |
| **6 experts V0.2.1** |          |           |           |                         |           |
| Prefill token/s      | 13       | 111       | 112.5     | 102**(1.16x speedup)**  | 101       |
| decode token/s       | 16.8     | 15.9      | 15.4      | 14.9**(1.15x speedup)** | 13.9      |
| **8 experts V0.2.1** |          |           |           |                         |           |
| Prefill token/s      | 12.2     | 88.2      | 88.5      | 81.9                    | 80        |
| Decode token/s       | 13.4     | 13.5      | 13.4      | 13.2                    | 12.4      |
liam's avatar
liam committed
117

liam's avatar
liam committed
118
### V0.2
119

liam's avatar
liam committed
120
#### Settings
121

liam's avatar
liam committed
122
123
124
- Model: DeepseekV3-q4km (int4)<br>
- CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 sockets, 2 numa nodes
- GPU: 4090D 24G VRAM
liam's avatar
liam committed
125
- We test after enough warm up
126

liam's avatar
liam committed
127
#### Memory consumption:
128
129
130

- Single socket: 382G DRAM, at least 14GB VRAM
- Dual socket: 1T DRAM, at least 14GB VRAM
liam's avatar
liam committed
131
132
133

#### Benchmark Results

liam's avatar
liam committed
134
"6 experts" case is part of V0.3's preview
liam's avatar
liam committed
135

136
137
138
139
140

| Prompt<br>(500 tokens) | Dual socket Ktrans (6 experts) | Dual socket Ktrans (8 experts) | Single socket Ktrans (6 experts) | Single socket Ktrans (8 experts) | llama.cpp (8 experts) |
| ---------------------- | ------------------------------ | ------------------------------ | -------------------------------- | -------------------------------- | --------------------- |
| Prefill token/s        | 97.32                          | 82.94                          | 65.14                            | 54.21                            | 10.31                 |
| Decode token/s         | 13.69                          | 12.208                         | 10.303                           | 8.73                             | 4.51                  |
liam's avatar
liam committed
141

liam's avatar
liam committed
142
**The highest speedup reaches up to <u>3.03x</u> in decoding and <u>9.44x</u> in prefill.**
liam's avatar
liam committed
143

chenht2022's avatar
chenht2022 committed
144
### V0.3-Preview
145

liam's avatar
liam committed
146
#### Settings
147

liam's avatar
liam committed
148
- Model: DeepseekV3-BF16 (online quant into int8 for CPU and int4 for GPU)
liam's avatar
liam committed
149
- CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes
chenht2022's avatar
chenht2022 committed
150
151
- GPU: (1~4)x 4090D 24GVRAM (requires more VRAM for longer prompt)

liam's avatar
liam committed
152
#### Memory consumptions:
153

liam's avatar
liam committed
154
- 644GB DRAM, at least 14GB VRAM
chenht2022's avatar
chenht2022 committed
155

liam's avatar
liam committed
156
#### Benchmark results
157
158
159
160
161
162


| Prompt length                      | 1K     | 2K     | 4K     | 8K     |
| ---------------------------------- | ------ | ------ | ------ | ------ |
| KTrans (8 experts) Prefill token/s | 185.96 | 255.26 | 252.58 | 195.62 |
| KTrans (6 experts) Prefill token/s | 203.70 | 286.55 | 271.08 | 207.20 |
chenht2022's avatar
chenht2022 committed
163

liam's avatar
liam committed
164
**The prefill of KTrans V0.3 is up to <u>3.45x</u> times faster than KTrans V0.2, and is up to <u>27.79x</u> times faster than llama.cpp.**
liam's avatar
liam committed
165
**The decoding speed is the same as KTrans V0.2 (6 experts version) so it is omitted**
chenht2022's avatar
chenht2022 committed
166

167
168
The main acceleration comes from

chenht2022's avatar
chenht2022 committed
169
170
171
- Intel AMX instruction set and our specially designed cache friendly memory layout
- Expert selection strategy that selects fewer experts based on offline profile results of out of domain data

172
173
174
*From our research on DeepSeekV2, DeepSeekV3 and DeepSeekR1,
when we slightly decrease the activation experts num in inference,
the output quality doesn't change. But the speed of decoding and prefill
liam's avatar
liam committed
175
176
is speed up which is inspiring. So our showcase makes use of this finding*

liam's avatar
liam committed
177
## How to Run
178
179
180
181
182
183
184
185
186
187
188
189
190
191

### v0.2.4 
We provide a server script, which supports multi-concurrency functionality in version v0.2.4.

```
python ktransformers/server/main.py --model_path /mnt/data/models/DeepSeek-V3 --gguf_path /mnt/data/models/DeepSeek-V3-GGUF/DeepSeek-V3-Q4_K_M/ --cpu_infer 62 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml --port 10002 --chunk_size 256 --max_new_tokens 1024 --max_batch_size 4 --port 10002 --cache_lens 32768 --backend_type balance_serve
```
It features the following arguments:

- `--chunk_size`: Maximum number of tokens processed in a single run by the engine.
- `--cache_lens`: Total length of kvcache allocated by the scheduler. All requests share a kvcache space corresponding to 32768 tokens, and the space occupied will be released after the requests are completed.
- `--backend_type`: `balance_serve` is a multi-concurrency backend engine introduced in version v0.2.4. The original single-concurrency engine is `ktransformers`.
- `--max_batch_size`: Maximum number of requests (prefill + decode) processed in a single run by the engine. (Supported only by `balance_serve`)

Azure's avatar
Azure committed
192
### v0.2.2 & v0.2.3 longer context & FP8 kernel
193

Azure's avatar
Azure committed
194
#### longer context
195

Azure's avatar
Azure committed
196
To use this feature, [install flashinfer](https://github.com/flashinfer-ai/flashinfer) first.
Atream's avatar
Atream committed
197

Atream's avatar
Atream committed
198
Note: The latest MLA kernel in FlashInfer still has a few minor issues. They are continuously fixing them on the main branch. If you are using FlashInfer, please install it from the main source code.
Azure's avatar
Azure committed
199

200
If you want to use long context(longer than 20K) for prefill, enable the matrix absorption MLA during the prefill phase, which will significantly reduce the size of the kv cache. Modify yaml file like this:
201

202
203
204
205
206
207
208
209
210
211
```
- match:
    name: "^model\\.layers\\..*\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
    kwargs:
      generate_device: "cuda"
      prefill_device: "cuda"
      absorb_for_prefill: True # change this to True to enable long context(prefill may slower).
```
Atream's avatar
Atream committed
212

213
214
If the VRAM is still insufficient, try reducing the `chunk_size` parameter (default is 8192) to further decrease the intermediate results during chunk prefill.

Azure's avatar
Azure committed
215
216
217
#### FP8 kernel

The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works:
218

Azure's avatar
Azure committed
219
220
221
222
223
224
225
226
227
- **FP8 GPU Kernel Integration**: FP8 linear layer acceleration kernels integrated in KTransformers
- **Hybrid Quantization Architecture**:
  - Attention and Shared-Expert modules use FP8 precision (enhances computational accuracy)
  - Experts modules retain GGML quantization (GGUF format, reside in CPU to save GPU memory)

So those who are persuing the best performance can use the FP8 linear kernel for DeepSeek-V3/R1.

The detailed guide is [here](./fp8_kernel.md).

liam's avatar
liam committed
228
### V0.2 & V0.2.1 Showcase
229

liam's avatar
liam committed
230
#### Single socket version (32 cores)
231

liam's avatar
liam committed
232
Our local_chat test command is:
233
234

```shell
liam's avatar
liam committed
235
numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your prompt txt file>  --cpu_infer 33 --max_new_tokens 1000
liam's avatar
liam committed
236
237
<when you see chat, then press enter to load the text prompt_file>
```
238
239

`<your model path>` can be local or set from online huggingface like deepseek-ai/DeepSeek-V3. If online encounters connection problem, try use mirror (hf-mirror.com) <br>
liam's avatar
liam committed
240
241
`<your gguf path>` can also be online, but as its large we recommend you download it and quantize the model to what you want (notice it's the dir path) <br>
`--max_new_tokens 1000` is the max output token length. If you find the answer is truncated, you
242
can increase the number for longer answer (But be aware of OOM, and increase it will slow down the generation rate.).
Azure's avatar
Azure committed
243
244

The command `numactl -N 1 -m 1` aims to advoid data transfer between numa nodes<br>
liam's avatar
liam committed
245
246
Attention! If you are testing R1 and it may skip thinking. So you can add arg: `--force_think true`. This is explained in [FAQ](#faq) part

liam's avatar
liam committed
247
#### Dual socket version (64 cores)
Azure's avatar
Azure committed
248

249
Make sure before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set). You may check the doc [here](./install.md) for install details. <br>
Azure's avatar
Azure committed
250
251

Test Command:
252
253

```shell
Azure's avatar
Azure committed
254
255
256
257
258
259
260
261
# ---For those who have not installed ktransformers---
# git clone https://github.com/kvcache-ai/ktransformers.git
# cd ktransformers
# git submodule init
# git submodule update
# export USE_NUMA=1
# make dev_install # or sh ./install.sh
# ----------------------------------------------------
liam's avatar
liam committed
262
python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your prompt txt file>  --cpu_infer 65 --max_new_tokens 1000
liam's avatar
liam committed
263
264
<when you see chat, then press enter to load the text prompt_file>
```
265

Azure's avatar
Azure committed
266
The parameters' meaning is the same. But As we use dual socket, we set cpu_infer to 65
liam's avatar
liam committed
267
268

### V0.3 Showcase
269

liam's avatar
liam committed
270
#### Dual socket version (64 cores)
271

liam's avatar
liam committed
272
Our local_chat test command is:
273
274

```shell
liam's avatar
liam committed
275
276
wget https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.4/ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl
pip install ./ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl
liam's avatar
liam committed
277
python -m ktransformers.local_chat --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your prompt txt file>  --cpu_infer 65 --max_new_tokens 1000
liam's avatar
liam committed
278
279
<when you see chat, then press enter to load the text prompt_file>
```
280

liam's avatar
liam committed
281
282
The parameters' meaning is the same with V0.2. But As we  use dual socket, we set cpu_infer to 65

liam's avatar
liam committed
283
## Some Explanations
liam's avatar
liam committed
284

285
286
287
288
289
290
291
1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu.
   To avoid the cost of data transfer between nodes, we "copy" the critical matrix on
   both nodes which takes more memory consumption but accelerates the prefill and decoding process.
   But this method takes huge memory and slow when loading weights, So be patient when loading
   and monitor the memory usage. We are going to optimize this huge memory overhead. Stay tuned~ <br>
2. The command args `--cpu_infer 65` specifies how many cores to use (it's ok that it exceeds the physical number,
   but it's not the more the better. Adjust it slightly lower to your actual number of cores)<br>
liam's avatar
liam committed
292
3. Why CPU/GPU Hybrid Inference?
293
   DeepSeek's MLA operators are highly computationally intensive. While running everything on CPU is possible, offloading the heavy computations to the GPU results in a massive performance boost.
liam's avatar
liam committed
294
295
4. Where Does the Speedup Come From?

296
297
   - Expert Offload: Unlike traditional layer-based or KVCache offloading (as seen in llama.cpp), we offload the expert computation to the CPU and MLA/KVCache to GPU, aligning perfectly with DeepSeek’s architecture for optimal efficiency.
   - Intel AMX Optimization – Our AMX-accelerated kernel is meticulously tuned, running several times faster than existing llama.cpp implementations. We plan to open-source this kernel after cleansing and are considering upstream contributions to llama.cpp.
liam's avatar
liam committed
298
5. Why Intel CPUs?
299
300
   Intel is currently the only CPU vendor that supports AMX-like instructions, which delivers significantly better performance compared to AVX-only alternatives.

liam's avatar
liam committed
301
## Next
302

liam's avatar
liam committed
303
### Faster
304

liam's avatar
liam committed
305
306
307
* The FlashInfer (https://github.com/flashinfer-ai/flashinfer) project is releasing an even more efficient fused MLA operator, promising further speedups
* vLLM has explored multi-token prediction in DeepSeek-V3, and support is on our roadmap for even better performance
* We are collaborating with Intel to enhance the AMX kernel (v0.3) and optimize for Xeon6/MRDIMM
308

liam's avatar
liam committed
309
### Easier
310

liam's avatar
liam committed
311
312
* Official Docker images to simplify installation
* Fix the server integration for web API access
313
* Fix the local chat only accepting a single line prompt (currently \n begins generating prompt)
liam's avatar
liam committed
314
315
* Support for more quantization types, including the highly requested dynamic quantization from unsloth

316
317
Stay tuned for more updates!

liam's avatar
liam committed
318
## FAQ
319

liam's avatar
liam committed
320
### R1 No Thinking
321

liam's avatar
liam committed
322
323
324
Attention! If you are testing R1 and it may skip thinking. So you can add arg: `--force_think true`. The detail is in [FAQ](./FAQ.md) part <br>

### More FAQ
325

cuichengyi's avatar
cuichengyi committed
326
[See detail](./FAQ.md)