README.md 17.9 KB
Newer Older
chenxl's avatar
chenxl committed
1
2
3
<div align="center">
  <!-- <h1>KTransformers</h1> -->
  <p align="center">
chenxl's avatar
chenxl committed
4
5

<picture>
UnicornChan's avatar
UnicornChan committed
6
    <img alt="KTransformers" src="https://github.com/user-attachments/assets/d5a2492f-a415-4456-af99-4ab102f13f8b" width=50%>
Mingxing Zhang's avatar
Mingxing Zhang committed
7

chenxl's avatar
chenxl committed
8
</picture>
UnicornChan's avatar
UnicornChan committed
9

chenxl's avatar
chenxl committed
10
</p>
chenxl's avatar
chenxl committed
11
12
13
14
15
16
17
18
19
20
21
22
23
  <h3>A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations</h3>
  <strong><a href="#show-cases">🔥 Show Cases</a> | <a href="#quick-start">🚀 Quick Start</a> | <a href="#tutorial">📃 Tutorial</a> | <a href="https://github.com/kvcache-ai/ktransformers/discussions">💬  Discussion </a> </strong>
</div>

<h2 id="intro">🎉 Introduction</h2>
KTransformers, pronounced as Quick Transformers, is designed to enhance your 🤗 <a href="https://github.com/huggingface/transformers">Transformers</a> experience with advanced kernel optimizations and placement/parallelism strategies.
<br/><br/>
KTransformers is a flexible, Python-centric framework designed with extensibility at its core. 
By implementing and injecting an optimized module with a single line of code, users gain access to a Transformers-compatible
interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified ChatGPT-like web UI. 
<br/><br/>
Our vision for KTransformers is to serve as a flexible platform for experimenting with innovative LLM inference optimizations. Please let us know if you need any other features.

chenxl's avatar
chenxl committed
24
<h2 id="Updates">🔥 Updates</h2>
Azure's avatar
Azure committed
25

TangJingqi's avatar
TangJingqi committed
26
* **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md).
TangJingqi's avatar
TangJingqi committed
27
* **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
Azure's avatar
Azure committed
28
* **Aug 15, 2024**: Update detailed [TUTORIAL](doc/en/injection_tutorial.md) for injection and multi-GPU. 
chenxl's avatar
chenxl committed
29
* **Aug 14, 2024**: Support llamfile as linear backend. 
Azure's avatar
Azure committed
30
31
* **Aug 12, 2024**: Support multiple GPU; Support new model: mixtral 8\*7B  and 8\*22B; Support q2k, q3k, q5k dequant on gpu.
* **Aug 9, 2024**: Support windows native.
chenxl's avatar
chenxl committed
32
33

<h2 id="show-cases">🔥 Show Cases</h2>
chenxl's avatar
chenxl committed
34
<h3>1M Context Local Inference on a Desktop with Only 24GB VRAM</h3>
chenxl's avatar
chenxl committed
35
<p align="center">
UnicornChan's avatar
UnicornChan committed
36

chenxl's avatar
chenxl committed
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12

* **1M Context InternLM 2.5 7B**: Operates at full bf16 precision, utilizing 24GB VRAM and 150GB DRAM, which is feasible on a local desktop setup. It achieves a 92.88% success rate on the 1M "Needle In a Haystack" test and 100% on the 128K NIAH test.

<p align="center">
  <picture>
    <img alt="Single Needle Retrieval 128K" src="./doc/assets/needle_128K.png" width=100%>
  </picture>
</p>

<p align="center">
  <picture>
    <img alt="Single Needle Retrieval 1000K" src="./doc/assets/needle_1M.png" width=100%>
  </picture>
</p>

* **Enhanced Speed**: Reaches 16.91 tokens/s for generation with a 1M context using sparse attention, powered by llamafile kernels. This method is over 10 times faster than full attention approach of llama.cpp.

TangJingqi's avatar
TangJingqi committed
55
* **Flexible Sparse Attention Framework**: Offers a flexible block sparse attention framework for CPU offloaded decoding. Compatible with SnapKV, Quest, and InfLLm. Further information is available [here](./doc/en/long_context_introduction.md).
chenxl's avatar
chenxl committed
56
57
58
59
60

<div>
<h3>GPT-4-level Local VSCode Copilot on a Desktop with only 24GB VRAM</h3>
</div>

UnicornChan's avatar
UnicornChan committed
61
62
https://github.com/user-attachments/assets/0b9fa2da-66f0-48eb-b4b9-f0e1f06f8927

chenxl's avatar
chenxl committed
63
64
</p>

Azure's avatar
Azure committed
65
- **Local 236B DeepSeek-Coder-V2:** Running its Q4_K_M version using only 21GB VRAM and 136GB DRAM, attainable on a local desktop machine, which scores even better than GPT4-0613 in [BigCodeBench](https://huggingface.co/blog/leaderboard-bigcodebench).
chenxl's avatar
chenxl committed
66
67
68

<p align="center">
  <picture>
Allen's avatar
Allen committed
69
    <img alt="DeepSeek-Coder-V2 Score" src="https://github.com/user-attachments/assets/d052924e-8631-44de-aad2-97c54b965693" width=100%>
chenxl's avatar
chenxl committed
70
71
72
73
74
75
76
  </picture>
</p>

- **Faster Speed:** Achieving 126 tokens/s for 2K prompt prefill and 13.6 tokens/s for generation through MoE offloading and injecting advanced kernels from [Llamafile](https://github.com/Mozilla-Ocho/llamafile/tree/main) and [Marlin](https://github.com/IST-DASLab/marlin).
- **VSCode Integration:** Wrapped into an OpenAI and Ollama compatible API for seamless integration as a backend for [Tabby](https://github.com/TabbyML/tabby) and various other frontends.

<p align="center">
Mingxing Zhang's avatar
Mingxing Zhang committed
77

UnicornChan's avatar
UnicornChan committed
78
79
https://github.com/user-attachments/assets/4c6a8a38-05aa-497d-8eb1-3a5b3918429c

chenxl's avatar
chenxl committed
80
81
82
83
84
85
86
87
88
89
</p>

<strong>More advanced features will coming soon, so stay tuned!</strong>

<h2 id="quick-start">🚀 Quick Start</h2>

<h3>Preparation</h3>
Some preparation:

- CUDA 12.1 and above, if you didn't have it yet, you may install from [here](https://developer.nvidia.com/cuda-downloads).
Azure's avatar
Azure committed
90
91
92
  
  ```sh
  # Adding CUDA to PATH
chenxl's avatar
chenxl committed
93
94
95
  export PATH=/usr/local/cuda/bin:$PATH
  export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
  export CUDA_PATH=/usr/local/cuda
_HYX_'s avatar
_HYX_ committed
96
97
  ```

chenxl's avatar
chenxl committed
98
- Linux-x86_64 with gcc, g++ and cmake
chenxl's avatar
chenxl committed
99
  
chenxl's avatar
chenxl committed
100
101
102
103
  ```sh
  sudo apt-get update
  sudo apt-get install gcc g++ cmake ninja-build
  ```
chenxl's avatar
chenxl committed
104

chenxl's avatar
chenxl committed
105
- We recommend using [Conda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh) to create a virtual environment with Python=3.11 to run our program.
chenxl's avatar
chenxl committed
106
  
chenxl's avatar
chenxl committed
107
108
109
110
111
  ```sh
  conda create --name ktransformers python=3.11
  conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first
  ```

chenxl's avatar
chenxl committed
112
- Make sure that PyTorch, packaging, ninja is installed
chenxl's avatar
chenxl committed
113
  
chenxl's avatar
chenxl committed
114
115
116
117
118
119
  ```
  pip install torch packaging ninja
  ```

<h3>Installation</h3>

Sam's avatar
Sam committed
120
1. Use a Docker image, see [documentation for Docker](./doc/en/Docker.md) 
chenxl's avatar
chenxl committed
121

chenxl's avatar
chenxl committed
122
123
2. You can install using Pypi (for linux):
   
chenxl's avatar
chenxl committed
124
   ```
125
   pip install ktransformers --no-build-isolation
chenxl's avatar
chenxl committed
126
   ```
chenxl's avatar
chenxl committed
127
   
128
   for windows we prepare a pre compiled whl package in [ktransformers-0.1.1+cu125torch24avx2-cp311-cp311-win_amd64.whl](https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.1/ktransformers-0.1.1+cu125torch24avx2-cp311-cp311-win_amd64.whl), which require cuda-12.5, torch-2.4, python-3.11, more pre compiled package are being produced. 
chenxl's avatar
chenxl committed
129

130
3. Or you can download source code and compile:
chenxl's avatar
chenxl committed
131
   
132
   - init source code 
chenxl's avatar
chenxl committed
133
     
134
135
136
137
138
139
     ```sh
     git clone https://github.com/kvcache-ai/ktransformers.git
     cd ktransformers
     git submodule init
     git submodule update
     ```
Azure's avatar
Azure committed
140

141
   - [Optional] If you want to run with website, please [compile the website](./doc/en/api/server/website.md) before execute ```bash install.sh```
Azure's avatar
Azure committed
142

143
   - Compile and install (for Linux)
chenxl's avatar
chenxl committed
144
     
145
146
147
     ```
     bash install.sh
     ```
Azure's avatar
Azure committed
148

149
   - Compile and install(for Windows)
chenxl's avatar
chenxl committed
150
     
151
152
     ```
     install.bat
chenxl's avatar
chenxl committed
153
     ```
154

chenxl's avatar
chenxl committed
155
<h3>Local Chat</h3>
chenxl's avatar
chenxl committed
156
We provide a simple command-line local chat Python script that you can run for testing.
chenxl's avatar
chenxl committed
157

chenxl's avatar
chenxl committed
158
> Note that this is a very simple test tool only support one round chat without any memory about last input, if you want to try full ability of the model, you may go to [RESTful API and Web UI](#id_666). We use the DeepSeek-V2-Lite-Chat-GGUF model as an example here. But we also support other models, you can replace it with any other model that you want to test. 
chenxl's avatar
chenxl committed
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175

<h4>Run Example</h4>

```shell
# Begin from root of your cloned repo!
# Begin from root of your cloned repo!!
# Begin from root of your cloned repo!!! 

# Download mzwing/DeepSeek-V2-Lite-Chat-GGUF from huggingface
mkdir DeepSeek-V2-Lite-Chat-GGUF
cd DeepSeek-V2-Lite-Chat-GGUF

wget https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/resolve/main/DeepSeek-V2-Lite-Chat.Q4_K_M.gguf -O DeepSeek-V2-Lite-Chat.Q4_K_M.gguf

cd .. # Move to repo's root dir

# Start local chat
chenxl's avatar
chenxl committed
176
python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
chenxl's avatar
chenxl committed
177
178
179

# If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try:
# GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite
chenxl's avatar
chenxl committed
180
# python  ktransformers.local_chat --model_path ./DeepSeek-V2-Lite --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
chenxl's avatar
chenxl committed
181
182
183
184
185
```

It features the following arguments:

- `--model_path` (required): Name of the model (such as "deepseek-ai/DeepSeek-V2-Lite-Chat" which will automatically download configs from [Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)). Or if you already got local files  you may directly use that path to initialize the model.  
chenxl's avatar
chenxl committed
186
187
188
  
  > Note: <strong>.safetensors</strong> files are not required in the directory. We only need config files to build model and tokenizer.

Azure's avatar
Azure committed
189
- `--gguf_path` (required): Path of a directory containing GGUF files which could that can be downloaded from [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main).
chenxl's avatar
chenxl committed
190

chenxl's avatar
chenxl committed
191
- `--optimize_rule_path` (required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models.
chenxl's avatar
chenxl committed
192

chenxl's avatar
chenxl committed
193
- `--max_new_tokens`: Int (default=1000). Maximum number of new tokens to generate.
chenxl's avatar
chenxl committed
194

chenxl's avatar
chenxl committed
195
196
- `--cpu_infer`: Int (default=10). The number of CPUs used for inference. Should ideally be set to the (total number of cores - 2).

Azure's avatar
Azure committed
197
<h3 id="supported-model"> Suggested Model</h3>
chenxl's avatar
chenxl committed
198

chenxl's avatar
chenxl committed
199
200
| Model Name                     | Model Size | VRAM  | Minimum DRAM    | Recommended DRAM  |
| ------------------------------ | ---------- | ----- | --------------- | ----------------- |
Azure's avatar
Azure committed
201
| DeepSeek-V2-q4_k_m             | 133G       | 24G   | 136G            | 192G              |
chenxl's avatar
chenxl committed
202
203
204
205
206
| Qwen2-57B-A14B-Instruct-q4_k_m | 33G        | 8G    | 34G             | 64G               |
| DeepSeek-V2-Lite-q4_k_m        | 9.7G       | 3G    | 13G             | 16G               |
| Mixtral-8x7B-q4_k_m            | 25G        | 1.6G  | 51G             | 64G               |
| Mixtral-8x22B-q4_k_m           | 80G        | 4G    | 86.1G           | 96G               |
| InternLM2.5-7B-Chat-1M         | 15.5G      | 15.5G | 8G(32K context) | 150G (1M context) |
chenxl's avatar
chenxl committed
207
208
209
210
211
212
213
214
215
216

More will come soon. Please let us know which models you are most interested in. 

Be aware that you need to be subject to their corresponding model licenses when using [DeepSeek](https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/LICENSE) and [QWen](https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/LICENSE).

<details>
  <summary>Click To Show how to run other examples</summary>

* Qwen2-57B

Azure's avatar
Azure committed
217
218
  ```sh
  pip install flash_attn # For Qwen2
chenxl's avatar
chenxl committed
219

Azure's avatar
Azure committed
220
  mkdir Qwen2-57B-GGUF && cd Qwen2-57B-GGUF
chenxl's avatar
chenxl committed
221

Azure's avatar
Azure committed
222
  wget https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct-GGUF/resolve/main/qwen2-57b-a14b-instruct-q4_k_m.gguf?download=true -O qwen2-57b-a14b-instruct-q4_k_m.gguf
chenxl's avatar
chenxl committed
223

Azure's avatar
Azure committed
224
  cd ..
chenxl's avatar
chenxl committed
225

Azure's avatar
Azure committed
226
  python -m ktransformers.local_chat --model_name Qwen/Qwen2-57B-A14B-Instruct --gguf_path ./Qwen2-57B-GGUF
chenxl's avatar
chenxl committed
227

Azure's avatar
Azure committed
228
229
230
231
  # If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try:
  # GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct
  # python  ktransformers/local_chat.py --model_path ./Qwen2-57B-A14B-Instruct --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
  ```
chenxl's avatar
chenxl committed
232
233

* DeepseekV2
chenxl's avatar
chenxl committed
234
  
Azure's avatar
Azure committed
235
236
237
238
239
240
241
  ```sh
  mkdir DeepSeek-V2-Chat-0628-GGUF && cd DeepSeek-V2-Chat-0628-GGUF
  # Download weights
  wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf
  wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00002-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00002-of-00004.gguf
  wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00003-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00003-of-00004.gguf
  wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00004-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00004-of-00004.gguf
chenxl's avatar
chenxl committed
242

Azure's avatar
Azure committed
243
  cd ..
chenxl's avatar
chenxl committed
244

Azure's avatar
Azure committed
245
  python -m ktransformers.local_chat --model_name deepseek-ai/DeepSeek-V2-Chat-0628 --gguf_path ./DeepSeek-V2-Chat-0628-GGUF
chenxl's avatar
chenxl committed
246

Azure's avatar
Azure committed
247
  # If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try:
chenxl's avatar
chenxl committed
248

Azure's avatar
Azure committed
249
  # GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat-0628
chenxl's avatar
chenxl committed
250

Azure's avatar
Azure committed
251
  # python -m ktransformers.local_chat --model_path ./DeepSeek-V2-Chat-0628 --gguf_path ./DeepSeek-V2-Chat-0628-GGUF
chenxl's avatar
chenxl committed
252

chenxl's avatar
chenxl committed
253
```
Azure's avatar
Azure committed
254
255
256
```

  ```
chenxl's avatar
chenxl committed
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277

| model name | weights download link |
|----------|----------|
| Qwen2-57B | [Qwen2-57B-A14B-gguf-Q4K-M](https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct-GGUF/tree/main) |
| DeepseekV2-coder |[DeepSeek-Coder-V2-Instruct-gguf-Q4K-M](https://huggingface.co/LoneStriker/DeepSeek-Coder-V2-Instruct-GGUF/tree/main) |
| DeepseekV2-chat |[DeepSeek-V2-Chat-gguf-Q4K-M](https://huggingface.co/bullerwins/DeepSeek-V2-Chat-0628-GGUF/tree/main) |
| DeepseekV2-lite | [DeepSeek-V2-Lite-Chat-GGUF-Q4K-M](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main) |

</details>

<!-- pin block for jump -->
<span id='id_666'> 

<h3>RESTful API and Web UI</h3>


Start without website:

```sh
ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF --port 10002
```
chenxl's avatar
chenxl committed
278

chenxl's avatar
chenxl committed
279
Start with website:
chenxl's avatar
chenxl committed
280

chenxl's avatar
chenxl committed
281
282
283
```sh
ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF  --port 10002 --web True
```
chenxl's avatar
chenxl committed
284

chenxl's avatar
chenxl committed
285
Or you want to start server with transformers, the model_path should include safetensors
chenxl's avatar
chenxl committed
286

chenxl's avatar
chenxl committed
287
288
289
290
291
292
293
294
```bash
ktransformers --type transformers --model_path /mnt/data/model/Qwen2-0.5B-Instruct --port 10002 --web True
```

Access website with url [http://localhost:10002/web/index.html#/chat](http://localhost:10002/web/index.html#/chat) :

<p align="center">
  <picture>
UnicornChan's avatar
UnicornChan committed
295
    <img alt="Web UI" src="https://github.com/user-attachments/assets/615dca9b-a08c-4183-bbd3-ad1362680faf" width=90%>
chenxl's avatar
chenxl committed
296
297
298
299
300
301
302
  </picture>
</p>

More information about the RESTful API server can be found [here](doc/en/api/server/server.md). You can also find an example of integrating with Tabby [here](doc/en/api/server/tabby.md).

<h2 id="tutorial">📃 Brief Injection Tutorial</h2>
At the heart of KTransformers is a user-friendly, template-based injection framework. 
chenxl's avatar
chenxl committed
303
This allows researchers to easily replace original torch modules with optimized variants. It also simplifies the process of combining multiple optimizations, allowing the exploration of their synergistic effects.
chenxl's avatar
chenxl committed
304
305
306
307

</br>
<p align="center">
  <picture>
UnicornChan's avatar
UnicornChan committed
308
    <img alt="Inject-Struction" src="https://github.com/user-attachments/assets/6b4c1e54-9f6d-45c5-a3fc-8fa45e7d257e" width=65%>
chenxl's avatar
chenxl committed
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
  </picture>
</p>

Given that vLLM already serves as a great framework for large-scale deployment optimizations, KTransformers is particularly focused on local deployments that are constrained by limited resources. We pay special attention to heterogeneous computing opportunities, such as GPU/CPU offloading of quantized models. For example, we support the efficient <a herf="https://github.com/Mozilla-Ocho/llamafile/tree/main">Llamafile</a> and <a herf="https://github.com/IST-DASLab/marlin">Marlin</a> kernels for CPU and GPU, respectively. More details can be found <a herf="doc/en/operators/llamafile.md">here</a>.

<h3>Example Usage</h3>
To utilize the provided kernels, users only need to create a YAML-based injection template and add the call to `optimize_and_load_gguf` before using the Transformers model.

```python
with torch.device("meta"):
    model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
optimize_and_load_gguf(model, optimize_rule_path, gguf_path, config)
...
generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000)
```

In this example, the AutoModel is first initialized on the meta device to avoid occupying any memory resources. Then, `optimize_and_load_gguf` iterates through all sub-modules of the model, matches rules specified in your YAML rule file, and replaces them with advanced modules as specified.

After injection, the original `generate` interface is available, but we also provide a compatible `prefill_and_generate` method, which enables further optimizations like CUDAGraph to improve generation speed.

329
330
331
332
<h3>How to custom your model</h3>

A detailed tutorial of the injection and multi-GPU using DeepSeek-V2 as an example is given [here](doc/en/injection_tutorial.md).

chenxl's avatar
chenxl committed
333
334
335
336
337
338
339
Below is an example of a YAML template for replacing all original Linear modules with Marlin, an advanced 4-bit quantization kernel.

```yaml
- match:
    name: "^model\\.layers\\..*$"  # regular expression 
    class: torch.nn.Linear  # only match modules matching name and class simultaneously
  replace:
340
    class: ktransformers.operators.linear.KTransformerLinear  # optimized Kernel on quantized data types
chenxl's avatar
chenxl committed
341
342
343
    device: "cpu"   # which devices to load this module when initializing
    kwargs:
      generate_device: "cuda"
344
      generate_linear_type: "QuantizedLinearMarlin"
chenxl's avatar
chenxl committed
345
346
347
348
349
350
```

Each rule in the YAML file has two parts: `match` and `replace`. The `match` part specifies which module should be replaced, and the `replace` part specifies the module to be injected into the model along with the initialization keywords.

You can find example rule templates for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models, in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory. These templates are used to power the `local_chat.py` demo.

351
If you are interested in our design principles and the implementation of the injection framework, please refer to the [design document](doc/en/deepseek-v2-injection.md).
chenxl's avatar
chenxl committed
352
353
354
355
356
357

<h2 id="ack">Acknowledgment and Contributors</h2>

The development of KTransformer is based on the flexible and versatile framework provided by Transformers. We also benefit from advanced kernels such as GGUF/GGML, Llamafile, and Marlin. We are planning to contribute back to the community by upstreaming our modifications.

KTransformer is actively maintained and developed by contributors from the <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> at Tsinghua University and members from <a href="http://approaching.ai/">Approaching.AI</a>. We welcome new contributors to join us in making KTransformer faster and easier to use.