@@ -23,7 +23,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
...
@@ -23,7 +23,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
<h2 id="Updates">🔥 Updates</h2>
<h2 id="Updates">🔥 Updates</h2>
***Fed 10, 2025**: Support DeepseekR1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to XXX speedup. The Detailed tutorial is [here](./doc/en/DeepseekR1_V3_tutorial.md)
***Fed 10, 2025**: Support DeepseekR1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~64X speedup. The Detailed tutorial is [here](./doc/en/DeepseekR1_V3_tutorial.md)
***Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md).
***Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md).
***Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
***Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
***Aug 15, 2024**: Update detailed [TUTORIAL](doc/en/injection_tutorial.md) for injection and multi-GPU.
***Aug 15, 2024**: Update detailed [TUTORIAL](doc/en/injection_tutorial.md) for injection and multi-GPU.
- Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to **3.03× speedup**.
- Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to **3.03× speedup**.
- Upcoming Open Source Release:
- Upcoming Open Source Release:
- AMX optimizations and selective expert activation will be open-sourced in v0.3.
- AMX optimizations and selective expert activation will be open-sourced in v0.3.
- Currently available only in preview binary distribution, which can be found here.
- Currently available only in preview binary distribution, which can be found [here](xxx).
-**Local 236B DeepSeek-Coder-V2:** Running its Q4_K_M version using only 21GB VRAM and 136GB DRAM, attainable on a local desktop machine, which scores even better than GPT4-0613 in [BigCodeBench](https://huggingface.co/blog/leaderboard-bigcodebench).
-**Local 236B DeepSeek-Coder-V2:** Running its Q4_K_M version using only 21GB VRAM and 136GB DRAM, attainable on a local desktop machine, which scores even better than GPT4-0613 in [BigCodeBench](https://huggingface.co/blog/leaderboard-bigcodebench).
The parameters meaning is the same. But As we use dual socket, so we set cpu_infer to 65.
The parameters meaning is the same. But As we use dual socket, so we set cpu_infer to 65.
## some explanations
## some explanations
1. From our perspective on DeepSeekV2, DeepSeekV3 and DeepSeekR1,
1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu.
when we slightly decrease the activation experts num in inference,
the output quality doesn't change(within 1% accuracy drop),But the speed of decoding and prefill
is speed up about 30% which is inspiring. So our showcase makes use of this finding,
changing the activation experts of DeepSeekV3/R1 from 8 to 6. <br>
2. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu.
To avoid the cost of data transfer between nodes, we "copy" the critical matrix on
To avoid the cost of data transfer between nodes, we "copy" the critical matrix on
both nodes which takes more memory consumption but accelerates the prefill and decoding process.
both nodes which takes more memory consumption but accelerates the prefill and decoding process.
But this method takes huge memory and slow when loading weights, So be patient when loading
But this method takes huge memory and slow when loading weights, So be patient when loading
and monitor the memory usage.(we are considering to make this method as an option)<br>
and monitor the memory usage.(we are considering to make this method as an option). We are going to optimize this huge memory overhead. Stay tuned~ <br>
3. the command args `--cpu_infer 65` specifies how many cores to use(it's ok that it exceeds the physical number,
2. the command args `--cpu_infer 65` specifies how many cores to use(it's ok that it exceeds the physical number,
but it's not the more the better. Adjust it slight lower to your actual number of cores)<br>
but it's not the more the better. Adjust it slightly lower to your actual number of cores)<br>