@@ -22,6 +22,12 @@ interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified
...
@@ -22,6 +22,12 @@ interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified
<br/><br/>
<br/><br/>
Our vision for KTransformers is to serve as a flexible platform for experimenting with innovative LLM inference optimizations. Please let us know if you need any other features.
Our vision for KTransformers is to serve as a flexible platform for experimenting with innovative LLM inference optimizations. Please let us know if you need any other features.
<h2 id="Updates">🔥 Updates</h2>
***Aug 15, 2024**: Update detailed [TUTORIAL](doc/en/injection_tutorial.md) for injection and multi-GPU.
***Aug 14, 2024**: Support llamfile as linear backend,
***Aug 12, 2024**: Support multiple GPU; Support new model: mixtral 8\*7B and 8\*22B; Support q2k, q3k, q5k dequant on gpu.
***Aug 9, 2024**: Support windows native.
<h2 id="show-cases">🔥 Show Cases</h2>
<h2 id="show-cases">🔥 Show Cases</h2>
<h3>GPT-4-level Local VSCode Copilot on a Desktop with only 24GB VRAM</h3>
<h3>GPT-4-level Local VSCode Copilot on a Desktop with only 24GB VRAM</h3>
@@ -165,7 +165,7 @@ Through these two rules, we place all previously unmatched layers (and their sub
...
@@ -165,7 +165,7 @@ Through these two rules, we place all previously unmatched layers (and their sub
## Muti-GPU
## Muti-GPU
If you have multiple GPUs, you can set the device for each module to different GPUs.
If you have multiple GPUs, you can set the device for each module to different GPUs.
DeepseekV2-Chat got 60 layers, if we got 2 GPUs, we can allocate 30 layers to each GPU. Complete multi GPU rule examples [here](ktransformers/optimize/optimize_rules).
DeepseekV2-Chat got 60 layers, if we got 2 GPUs, we can allocate 30 layers to each GPU. Complete multi GPU rule examples [here](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu.yaml).