install.md 16 KB
Newer Older
liam's avatar
liam committed
1
<!-- omit in toc -->
2

Azure's avatar
Azure committed
3
# How to Run DeepSeek-R1
4

5
6
7
8
9
10
- [How to Run DeepSeek-R1](#how-to-run-deepseek-r1)
  - [Preparation](#preparation)
  - [Installation](#installation)
    - [Attention](#attention)
    - [Supported models include](#supported-models-include)
    - [Support quantize format](#support-quantize-format)
liam's avatar
liam committed
11

12
13
In this document, we will show you how to install and run KTransformers on your local machine. There are two versions:

Azure's avatar
Azure committed
14
15
16
* V0.2 is the current main branch.
* V0.3 is a preview version only provides binary distribution for now.
* To reproduce our DeepSeek-R1/V3 results, please refer to [Deepseek-R1/V3 Tutorial](./DeepseekR1_V3_tutorial.md) for more detail settings after installation.
17

Azure's avatar
Azure committed
18
## Preparation
19

Azure's avatar
Azure committed
20
21
22
Some preparation:

- CUDA 12.1 and above, if you didn't have it yet, you may install from [here](https://developer.nvidia.com/cuda-downloads).
23

Azure's avatar
Azure committed
24
25
  ```sh
  # Adding CUDA to PATH
26
27
28
29
30
31
32
33
34
35
36
37
38
39
  if [ -d "/usr/local/cuda/bin" ]; then
      export PATH=$PATH:/usr/local/cuda/bin
  fi

  if [ -d "/usr/local/cuda/lib64" ]; then
      export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
      # Or you can add it to /etc/ld.so.conf and run ldconfig as root:
      # echo "/usr/local/cuda-12.x/lib64" | sudo tee -a /etc/ld.so.conf
      # sudo ldconfig
  fi

  if [ -d "/usr/local/cuda" ]; then
      export CUDA_PATH=$CUDA_PATH:/usr/local/cuda
  fi
Azure's avatar
Azure committed
40
  ```
41
42
- Linux-x86_64 with gcc, g++>=11 and cmake>=3.25 (using Ubuntu as an example)
- **Note**: The default CMake version in Ubuntu 22.04 LTS or higher may not support newer CUDA language dialects (e.g., CUDA 20). This can cause errors such as Target "cmTC_xxxxxx" requires the language dialect "CUDA20", but CMake does not know the compile flags to use to enable it. To resolve this, install a newer CMake version, for instance, by adding the Kitware APT repository.
43

Azure's avatar
Azure committed
44
  ```sh
45
46
  sudo apt-get update 
  sudo apt-get install build-essential cmake ninja-build patchelf
Azure's avatar
Azure committed
47
  ```
Zhoneym's avatar
Zhoneym committed
48
- We recommend using [Miniconda3](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh) or [Anaconda3](https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Linux-x86_64.sh) to create a virtual environment with Python=3.11 to run our program. Assuming your Anaconda installation directory is `~/anaconda3`, you should ensure that the version identifier of the GNU C++standard library used by Anaconda includes `GLIBCXX-3.4.32`
49

Azure's avatar
Azure committed
50
51
52
  ```sh
  conda create --name ktransformers python=3.11
  conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first
53

54
55
  conda install -c conda-forge libstdcxx-ng # Anaconda provides a package called `libstdcxx-ng` that includes a newer version of `libstdc++`, which can be installed via `conda-forge`.

jqz's avatar
jqz committed
56
  strings ~/anaconda3/envs/ktransformers/lib/libstdc++.so.6 | grep GLIBCXX
Azure's avatar
Azure committed
57
  ```
58
- Make sure that PyTorch, packaging, ninja is installed You can also [install previous versions of PyTorch](https://pytorch.org/get-started/previous-versions/)
59

Azure's avatar
Azure committed
60
  ```
61
62
  pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
  pip3 install packaging ninja cpufeature numpy
Azure's avatar
Azure committed
63
  ```
64
- At the same time, you should download and install the corresponding version of flash-attention from https://github.com/Dao-AILab/flash-attention/releases.
65

Azure's avatar
Azure committed
66
## Installation
67

liam's avatar
liam committed
68
### Attention
69

liam's avatar
liam committed
70
If you want to use numa support, not only do you need to set USE_NUMA=1, but you also need to make sure you have installed the libnuma-dev (`sudo apt-get install libnuma-dev` may help you).
Azure's avatar
Azure committed
71

72
73
74
75
76
77
[Optional] If you want to use the multi-concurrent version, please install the following dependencies.

```
sudo apt install libtbb-dev libssl-dev libcurl4-openssl-dev libaio1 libaio-dev libgflags-dev zlib1g-dev libfmt-dev
```

Azure's avatar
Azure committed
78
79
80
81
82
83
84
85
86
87
88
89
90
<!-- 1. ~~Use a Docker image, see [documentation for Docker](./doc/en/Docker.md)~~
   
   >We are working on the latest docker image, please wait for a while.

2. ~~You can install using Pypi (for linux):~~
    > We are working on the latest pypi package, please wait for a while.
   
   ```
   pip install ktransformers --no-build-isolation
   ```
   
   for windows we prepare a pre compiled whl package on [ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl](https://github.com/kvcache-ai/ktransformers/releases/download/v0.2.0/ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl), which require cuda-12.5, torch-2.4, python-3.11, more pre compiled package are being produced.  -->

91
Download source code and compile:
92

dongjw's avatar
dongjw committed
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
- init source code

  ```sh
  git clone https://github.com/kvcache-ai/ktransformers.git
  cd ktransformers
  git submodule update --init --recursive
  ```
- [Optional] If you want to run with website, please [compile the website](./api/server/website.md) before execute ``bash install.sh``
- For Linux

  - For simple install:

    ```shell
    bash install.sh
    ```
  - For those who have two cpu and 1T RAM:

    ```shell
    # Make sure your system has dual sockets and double size RAM than the model's size (e.g. 1T RAM for 512G model)
     apt install libnuma-dev
     export USE_NUMA=1
     bash install.sh # or #make dev_install
    ```
  - For Multi-concurrency with 500G RAM:
117

dongjw's avatar
dongjw committed
118
119
    ```shell
    USE_BALANCE_SERVE=1 bash ./install.sh
120
    ```
dongjw's avatar
dongjw committed
121
  - For Multi-concurrency with two cpu and 1T RAM:
122
123

    ```shell
dongjw's avatar
dongjw committed
124
    USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh
125
    ```
dongjw's avatar
dongjw committed
126
127
128
129
130
131
- For Windows (Windows native temprarily deprecated, please try WSL)

  ```shell
  install.bat
  ```

132
* If you are developer, you can make use of the makefile to compile and format the code. <br> the detailed usage of makefile is [here](./makefile_usage.md)
Azure's avatar
Azure committed
133
134
135
136

<h3>Local Chat</h3>
We provide a simple command-line local chat Python script that you can run for testing.

137
> Note: this is a very simple test tool only support one round chat without any memory about last input, if you want to try full ability of the model, you may go to [RESTful API and Web UI](#id_666).
Azure's avatar
Azure committed
138
139
140
141
142
143
144
145
146
147
148
149

<h4>Run Example</h4>

```shell
# Begin from root of your cloned repo!
# Begin from root of your cloned repo!!
# Begin from root of your cloned repo!!! 

# Download mzwing/DeepSeek-V2-Lite-Chat-GGUF from huggingface
mkdir DeepSeek-V2-Lite-Chat-GGUF
cd DeepSeek-V2-Lite-Chat-GGUF

150
wget https://huggingface.co/mradermacher/DeepSeek-V2-Lite-GGUF/resolve/main/DeepSeek-V2-Lite.Q4_K_M.gguf -O DeepSeek-V2-Lite-Chat.Q4_K_M.gguf
Azure's avatar
Azure committed
151
152
153
154
155
156
157
158
159
160

cd .. # Move to repo's root dir

# Start local chat
python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF

# If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try:
# GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite
# python  ktransformers.local_chat --model_path ./DeepSeek-V2-Lite --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
```
dongjw's avatar
dongjw committed
161

Azure's avatar
Azure committed
162
163
It features the following arguments:

164
- `--model_path` (required): Name of the model (such as "deepseek-ai/DeepSeek-V2-Lite-Chat" which will automatically download configs from [Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)). Or if you already got local files  you may directly use that path to initialize the model.
Azure's avatar
Azure committed
165

166
167
  > Note: <strong>.safetensors</strong> files are not required in the directory. We only need config files to build model and tokenizer.
  >
Azure's avatar
Azure committed
168
- `--gguf_path` (required): Path of a directory containing GGUF files which could that can be downloaded from [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main). Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model.
Azure's avatar
Azure committed
169
- `--optimize_config_path` (required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models.
Azure's avatar
Azure committed
170
171
172
- `--max_new_tokens`: Int (default=1000). Maximum number of new tokens to generate.
- `--cpu_infer`: Int (default=10). The number of CPUs used for inference. Should ideally be set to the (total number of cores - 2).

173
174
175
176
177
178
<h3>Start Server</h3>
We provide a server script, which supports multi-concurrency functionality in version v0.2.4.

```
python ktransformers/server/main.py --model_path /mnt/data/models/DeepSeek-V3 --gguf_path /mnt/data/models/DeepSeek-V3-GGUF/DeepSeek-V3-Q4_K_M/ --cpu_infer 62 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml --port 10002 --chunk_size 256 --max_new_tokens 1024 --max_batch_size 4 --port 10002 --cache_lens 32768 --backend_type balance_serve
```
dongjw's avatar
dongjw committed
179

180
181
182
183
184
185
186
It features the following arguments:

- `--chunk_size`: Maximum number of tokens processed in a single run by the engine.
- `--cache_lens`: Total length of kvcache allocated by the scheduler. All requests share a kvcache space corresponding to 32768 tokens, and the space occupied will be released after the requests are completed.
- `--backend_type`: `balance_serve` is a multi-concurrency backend engine introduced in version v0.2.4. The original single-concurrency engine is `ktransformers`.
- `--max_batch_size`: Maximum number of requests (prefill + decode) processed in a single run by the engine. (Supported only by `balance_serve`)

Azure's avatar
Azure committed
187
188
189
<details>
<summary>Supported Models/quantization</summary>

190
### Supported models include
Azure's avatar
Azure committed
191

192
193
194
195
196
197
198
199
200
201
202

| ✅**Supported Models** | ❌**Deprecated Models**    |
| ---------------------- | -------------------------- |
| DeepSeek-R1            | ~~InternLM2.5-7B-Chat-1M~~ |
| DeepSeek-V3            |                            |
| DeepSeek-V2            |                            |
| DeepSeek-V2.5          |                            |
| Qwen2-57B              |                            |
| DeepSeek-V2-Lite       |                            |
| Mixtral-8x7B           |                            |
| Mixtral-8x22B          |                            |
Azure's avatar
Azure committed
203

204
### Support quantize format
Azure's avatar
Azure committed
205

206
207
208

| ✅**Supported Formats** | ❌**Deprecated Formats** |
| ----------------------- | ------------------------ |
209
210
211
| IQ1_S                   | ~~IQ2_XXS~~              |
| IQ2_XXS                 |                          |
| Q2_K_L                  |                          |
212
213
214
215
216
217
218
| Q2_K_XS                 |                          |
| Q3_K_M                  |                          |
| Q4_K_M                  |                          |
| Q5_K_M                  |                          |
| Q6_K                    |                          |
| Q8_0                    |                          |

Azure's avatar
Azure committed
219
220
221
222
223
</details>

<details>
<summary>Suggested Model</summary>

224

Azure's avatar
Azure committed
225
226
| Model Name                     | Model Size | VRAM  | Minimum DRAM    | Recommended DRAM  |
| ------------------------------ | ---------- | ----- | --------------- | ----------------- |
227
228
| DeepSeek-R1-q4_k_m             | 377G       | 14G   | 382G            | 512G              |
| DeepSeek-V3-q4_k_m             | 377G       | 14G   | 382G            | 512G              |
Azure's avatar
Azure committed
229
230
231
232
233
234
235
236
237
| DeepSeek-V2-q4_k_m             | 133G       | 11G   | 136G            | 192G              |
| DeepSeek-V2.5-q4_k_m           | 133G       | 11G   | 136G            | 192G              |
| DeepSeek-V2.5-IQ4_XS           | 117G       | 10G   | 107G            | 128G              |
| Qwen2-57B-A14B-Instruct-q4_k_m | 33G        | 8G    | 34G             | 64G               |
| DeepSeek-V2-Lite-q4_k_m        | 9.7G       | 3G    | 13G             | 16G               |
| Mixtral-8x7B-q4_k_m            | 25G        | 1.6G  | 51G             | 64G               |
| Mixtral-8x22B-q4_k_m           | 80G        | 4G    | 86.1G           | 96G               |
| InternLM2.5-7B-Chat-1M         | 15.5G      | 15.5G | 8G(32K context) | 150G (1M context) |

238
More will come soon. Please let us know which models you are most interested in.
Azure's avatar
Azure committed
239
240
241

Be aware that you need to be subject to their corresponding model licenses when using [DeepSeek](https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/LICENSE) and [QWen](https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/LICENSE).

242
</details>
Azure's avatar
Azure committed
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264

<details>
  <summary>Click To Show how to run other examples</summary>

* Qwen2-57B

  ```sh
  pip install flash_attn # For Qwen2

  mkdir Qwen2-57B-GGUF && cd Qwen2-57B-GGUF

  wget https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct-GGUF/resolve/main/qwen2-57b-a14b-instruct-q4_k_m.gguf?download=true -O qwen2-57b-a14b-instruct-q4_k_m.gguf

  cd ..

  python -m ktransformers.local_chat --model_name Qwen/Qwen2-57B-A14B-Instruct --gguf_path ./Qwen2-57B-GGUF

  # If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try:
  # GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct
  # python  ktransformers/local_chat.py --model_path ./Qwen2-57B-A14B-Instruct --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
  ```
* Deepseek-V2
265

Azure's avatar
Azure committed
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
  ```sh
  mkdir DeepSeek-V2-Chat-0628-GGUF && cd DeepSeek-V2-Chat-0628-GGUF
  # Download weights
  wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf
  wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00002-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00002-of-00004.gguf
  wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00003-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00003-of-00004.gguf
  wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00004-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00004-of-00004.gguf

  cd ..

  python -m ktransformers.local_chat --model_name deepseek-ai/DeepSeek-V2-Chat-0628 --gguf_path ./DeepSeek-V2-Chat-0628-GGUF

  # If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try:

  # GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat-0628

  # python -m ktransformers.local_chat --model_path ./DeepSeek-V2-Chat-0628 --gguf_path ./DeepSeek-V2-Chat-0628-GGUF
  ```

285
286
287
288
289
290
291
292

| model name       | weights download link                                                                                                 |
| ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| Qwen2-57B        | [Qwen2-57B-A14B-gguf-Q4K-M](https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct-GGUF/tree/main)                       |
| DeepseekV2-coder | [DeepSeek-Coder-V2-Instruct-gguf-Q4K-M](https://huggingface.co/LoneStriker/DeepSeek-Coder-V2-Instruct-GGUF/tree/main) |
| DeepseekV2-chat  | [DeepSeek-V2-Chat-gguf-Q4K-M](https://huggingface.co/bullerwins/DeepSeek-V2-Chat-0628-GGUF/tree/main)                 |
| DeepseekV2-lite  | [DeepSeek-V2-Lite-Chat-GGUF-Q4K-M](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main)                |
| DeepSeek-R1      | [DeepSeek-R1-gguf-Q4K-M](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-Q4_K_M)                |
Azure's avatar
Azure committed
293
294
295
296
297

</details>

<!-- pin block for jump -->

298
<span id='id_666'>
Azure's avatar
Azure committed
299

300
<h3>RESTful API and Web UI  </h3>
Azure's avatar
Azure committed
301
302
303
304
305
306

Start without website:

```sh
ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF --port 10002
```
dongjw's avatar
dongjw committed
307

Azure's avatar
Azure committed
308
309
310
311
312
Start with website:

```sh
ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF  --port 10002 --web True
```
dongjw's avatar
dongjw committed
313

Azure's avatar
Azure committed
314
315
316
317
318
Or you want to start server with transformers, the model_path should include safetensors

```bash
ktransformers --type transformers --model_path /mnt/data/model/Qwen2-0.5B-Instruct --port 10002 --web True
```
dongjw's avatar
dongjw committed
319

Azure's avatar
Azure committed
320
321
322
323
324
325
326
327
328
Access website with url [http://localhost:10002/web/index.html#/chat](http://localhost:10002/web/index.html#/chat) :

<p align="center">
  <picture>
    <img alt="Web UI" src="https://github.com/user-attachments/assets/615dca9b-a08c-4183-bbd3-ad1362680faf" width=90%>
  </picture>
</p>

More information about the RESTful API server can be found [here](doc/en/api/server/server.md). You can also find an example of integrating with Tabby [here](doc/en/api/server/tabby.md).