***Enhanced Speed**: Reaches 16.91 tokens/s for generation with a 1M context using sparse attention, powered by llamafile kernels. This method is over 10 times faster than full attention approach of llama.cpp.
***Flexible Sparse Attention Framework**: Offers a flexible block sparse attention framework for CPU offloaded decoding. Compatible with SnapKV, Quest, and InfLLm. Further information is available [here](./doc/en/long_context_introduction.md).
-->
<strong>More advanced features will coming soon, so stay tuned!</strong>
<h2 id="quick-start">🚀 Quick Start</h2>
<h3>Preparation</h3>
Some preparation:
- CUDA 12.1 and above, if you didn't have it yet, you may install from [here](https://developer.nvidia.com/cuda-downloads).
- We recommend using [Conda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh) to create a virtual environment with Python=3.11 to run our program.
```sh
conda create --name ktransformers python=3.11
conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first
```
- Make sure that PyTorch, packaging, ninja is installed
1. Use a Docker image, see [documentation for Docker](./doc/en/Docker.md)
2. You can install using Pypi (for linux):
```
pip install ktransformers --no-build-isolation
```
for windows we prepare a pre compiled whl package on [ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl](https://github.com/kvcache-ai/ktransformers/releases/download/v0.2.0/ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl), which require cuda-12.5, torch-2.4, python-3.11, more pre compiled package are being produced.
-[Optional] If you want to run with website, please [compile the website](./doc/en/api/server/website.md) before execute ```bash install.sh```
- Compile and install (for Linux)
```
bash install.sh
```
- Compile and install(for Windows)
```
install.bat
```
4. If you are developer, you can make use of the makefile to compile and format the code. <br> the detailed usage of makefile is [here](./doc/en/makefile_usage.md)
<h3>Local Chat</h3>
We provide a simple command-line local chat Python script that you can run for testing.
> Note that this is a very simple test tool only support one round chat without any memory about last input, if you want to try full ability of the model, you may go to [RESTful API and Web UI](#id_666). We use the DeepSeek-V2-Lite-Chat-GGUF model as an example here. But we also support other models, you can replace it with any other model that you want to test.
<h4>Run Example</h4>
```shell
# Begin from root of your cloned repo!
# Begin from root of your cloned repo!!
# Begin from root of your cloned repo!!!
# Download mzwing/DeepSeek-V2-Lite-Chat-GGUF from huggingface
-`--model_path` (required): Name of the model (such as "deepseek-ai/DeepSeek-V2-Lite-Chat" which will automatically download configs from [Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)). Or if you already got local files you may directly use that path to initialize the model.
> Note: <strong>.safetensors</strong> files are not required in the directory. We only need config files to build model and tokenizer.
-`--gguf_path` (required): Path of a directory containing GGUF files which could that can be downloaded from [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main). Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model.
-`--optimize_rule_path` (required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models.
-`--max_new_tokens`: Int (default=1000). Maximum number of new tokens to generate.
-`--cpu_infer`: Int (default=10). The number of CPUs used for inference. Should ideally be set to the (total number of cores - 2).
Getting started with KTransformers is simple! Follow the steps below to set up and start using it.
<h3 id="suggested-model"> Suggested Model</h3>
### 📥 Installation
| Model Name | Model Size | VRAM | Minimum DRAM | Recommended DRAM |
To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/).
More will come soon. Please let us know which models you are most interested in.
Be aware that you need to be subject to their corresponding model licenses when using [DeepSeek](https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/LICENSE) and [QWen](https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/LICENSE).
<details>
<summary>Click To Show how to run other examples</summary>
More information about the RESTful API server can be found [here](doc/en/api/server/server.md). You can also find an example of integrating with Tabby [here](doc/en/api/server/tabby.md).
<h2 id="tutorial">📃 Brief Injection Tutorial</h2>
At the heart of KTransformers is a user-friendly, template-based injection framework.
This allows researchers to easily replace original torch modules with optimized variants. It also simplifies the process of combining multiple optimizations, allowing the exploration of their synergistic effects.
# GPT-4/o1-level Local VSCode Copilot on a Desktop with only 24GB VRAM
-[SUMMARY](#summary)
-[Prerequisites](#prerequisites)
-[Show Case Environment](#show-case-environment)
-[Bench Result](#bench-result)
-[V0.2](#v02)
-[Settings](#settings)
...
...
@@ -50,7 +50,7 @@ We also give our upcoming optimizations previews, including an Intel AMX-acceler
The binary distribution is available now and the source code will come ASAP! Check out the wheel package [here](https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.4/ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl)
## Prerequisites
## Show Case Environment
We run our best performance tests (V0.2) on <br>
CPU: Intel (R) Xeon (R) Gold 6454S 1T DRAM (2 NUMA nodes) <br>
GPU: 4090D 24G VRAM <br>
...
...
@@ -110,10 +110,6 @@ is speed up which is inspiring. So our showcase makes use of this finding*
`<your gguf path>` can also be online, but as its large we recommend you download it and quantize the model to what you want (notice it's the dir path) <br>
`--max_new_tokens 1000` is the max output token length. If you find the answer is truncated, you
can increase the number for longer answer (But be aware of OOM, and increase it will slow down the generation rate.).
<br>
The command numactl -N 1 -m 1 aims to advoid data transfer between numa nodes<br>
The command `numactl -N 1 -m 1` aims to advoid data transfer between numa nodes<br>
Attention! If you are testing R1 and it may skip thinking. So you can add arg: `--force_think true`. This is explained in [FAQ](#faq) part
#### Dual socket version (64 cores)
Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set) <br>
Our local_chat test command is:
Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set). You may check the doc [here](./install.md) for install details. <br>
# Tutorial: Heterogeneous and Local DeepSeek-V2 Inference
# Tutorial: Heterogeneous and Local MoE Inference
DeepSeek-(Code)-V2 is a series of strong mixture-of-experts (MoE) models, featuring a total of 236 billion parameters, with 21 billion parameters activated per token. This model has demonstrated remarkable reasoning capabilities across various benchmarks, positioning it as one of the SOTA open models and nearly comparable in performance to GPT-4.
DeepSeek-(Code)-V2 is a series of strong mixture-of-experts (MoE) models, featuring a total of 236 billion parameters, with 21 billion parameters activated per token. This model has demonstrated remarkable reasoning capabilities across various benchmarks, positioning it as one of the SOTA open models and nearly comparable in performance to GPT-4. DeepSeek-R1 uses a similar architecture to DeepSeek-V2, but with a bigger number of parameters.
In this document, we will show you how to run the local_chat.py script to test the DeepSeek-R1's performance. There are two versions:
* V0.2 is the current main branch.
* V0.3 is a preview version only provides binary distribution for now.
* To reproduce our DeepSeek-R1/V3 results, please refer to [Deepseek-R1/V3 Tutorial](./DeepseekR1_V3_tutorial.md) for more detail settings after installation.
## Preparation
Some preparation:
- CUDA 12.1 and above, if you didn't have it yet, you may install from [here](https://developer.nvidia.com/cuda-downloads).
- We recommend using [Conda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh) to create a virtual environment with Python=3.11 to run our program.
```sh
conda create --name ktransformers python=3.11
conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first
```
- Make sure that PyTorch, packaging, ninja is installed
<!-- 1. ~~Use a Docker image, see [documentation for Docker](./doc/en/Docker.md)~~
>We are working on the latest docker image, please wait for a while.
2. ~~You can install using Pypi (for linux):~~
> We are working on the latest pypi package, please wait for a while.
```
pip install ktransformers --no-build-isolation
```
for windows we prepare a pre compiled whl package on [ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl](https://github.com/kvcache-ai/ktransformers/releases/download/v0.2.0/ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl), which require cuda-12.5, torch-2.4, python-3.11, more pre compiled package are being produced. -->
-[Optional] If you want to run with website, please [compile the website](./doc/en/api/server/website.md) before execute ```bash install.sh```
- For Linux
- For simple install:
```shell
bash install.sh
```
- For those who have two cpu and 1T RAM:
```shell
# Make sure your system has dual sockets and double size RAM than the model's size (e.g. 1T RAM for 512G model)
export USE_NUMA=1
bash install.sh # or `make dev_install`
```
- For Windows
```shell
install.bat
```
* If you are developer, you can make use of the makefile to compile and format the code. <br> the detailed usage of makefile is [here](./doc/en/makefile_usage.md)
<h3>Local Chat</h3>
We provide a simple command-line local chat Python script that you can run for testing.
> Note: this is a very simple test tool only support one round chat without any memory about last input, if you want to try full ability of the model, you may go to [RESTful API and Web UI](#id_666).
<h4>Run Example</h4>
```shell
# Begin from root of your cloned repo!
# Begin from root of your cloned repo!!
# Begin from root of your cloned repo!!!
# Download mzwing/DeepSeek-V2-Lite-Chat-GGUF from huggingface
-`--model_path` (required): Name of the model (such as "deepseek-ai/DeepSeek-V2-Lite-Chat" which will automatically download configs from [Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)). Or if you already got local files you may directly use that path to initialize the model.
> Note: <strong>.safetensors</strong> files are not required in the directory. We only need config files to build model and tokenizer.
-`--gguf_path` (required): Path of a directory containing GGUF files which could that can be downloaded from [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main). Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model.
-`--optimize_rule_path` (required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models.
-`--max_new_tokens`: Int (default=1000). Maximum number of new tokens to generate.
-`--cpu_infer`: Int (default=10). The number of CPUs used for inference. Should ideally be set to the (total number of cores - 2).
More will come soon. Please let us know which models you are most interested in.
Be aware that you need to be subject to their corresponding model licenses when using [DeepSeek](https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/LICENSE) and [QWen](https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/LICENSE).
</details>
<details>
<summary>Click To Show how to run other examples</summary>
More information about the RESTful API server can be found [here](doc/en/api/server/server.md). You can also find an example of integrating with Tabby [here](doc/en/api/server/tabby.md).
Assume you have read the [Injection Tutorial](./injection_tutorial.md) and have a basic understanding of how to inject a model. In this tutorial, we will show you how to use KTransformers to run a model on multiple GPUs.
If you have multiple GPUs, you can set the device for each module to different GPUs.
DeepseekV2-Chat got 60 layers, if we got 2 GPUs, we can allocate 30 layers to each GPU. Complete multi GPU rule examples [here](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml).
First of all, for multi-GPU, we have to inject an new operator `KDeepseekV2Model`. And set division of the layers to different GPUs. For our case, we have to set the `transfer_map` in the `KDeepseekV2Model` operatoras as follows:
* The loading speed will be significantly slower for each expert moved to the GPU.
* You have to close the cuda graph if you want to move the experts to the GPU.
* For DeepSeek-R1/V3, each expert moved to the GPU will consume approximately 6GB of VRAM.
* The first matched rule in yaml will be applied. For example, if you have two rules that match the same layer, only the first rule's replacement will be valid.