Merge pull request #1029 from kvcache-ai/mian-update-doc

fix local_chat bug and update doc

Merge pull request #1029 from kvcache-ai/mian-update-doc
fix local_chat bug and update doc
72e8e16f · wang jiahao · GitHub · 9654bc1c · 1b767293 · 72e8e16f
Unverified Commit 72e8e16f authored Apr 03, 2025 by wang jiahao Committed by GitHub Apr 03, 2025
4 changed files
--- a/doc/README.md
+++ b/doc/README.md
@@ -22,6 +22,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin

 <h2 id="Updates">🔥 Updates</h2>

+* **Apr 2, 2025**: Support Multi-concurrency. ([Tutorial](./en/balance-serve.md)).
 * **Mar 27, 2025**: Support Multi-concurrency.
 * **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./en/ROCm.md)).
 * **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./en/fp8_kernel.md) weights. Support 139K [Longer Context](./en/DeepseekR1_V3_tutorial.md#v022-longer-context) for DeepSeek-V3 and R1 in 24GB VRAM.

--- a/doc/en/balance-serve.md
+++ b/doc/en/balance-serve.md
@@ -41,6 +41,16 @@ Implemented **balance_serve** engine based on **FlashInfer** @qiyuxinlin @ovowei
 Implemented a **continuous batching** scheduler in C++ @ErvinXie
 release: bump version v0.2.4 by @Atream @Azure-Tang @ErvinXie  @qiyuxinlin @ovowei @KMSorSMS @SkqLiao

+## Download the Docker image for testing v0.2.4
+Visit the [link](https://hub.docker.com/r/approachingai/ktransformers/tags) to pull the image, using `v0.2.4-AVX512` as an example.
+
+```bash
+docker pull approachingai/ktransformers:v0.2.4-AVX512
+docker run -it --gpus all --privileged --shm-size 64g --name ktrans --network=host -v /mnt:/mnt approachingai/ktransformers:v0.2.4-AVX512 /bin/bash
+# Open a new terminal
+docker exec -it ktrans bash
+```
+
 ## Installation Guide

 ⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!
@@ -49,7 +59,7 @@ release: bump version v0.2.4 by @Atream @Azure-Tang @ErvinXie  @qiyuxinlin @ovow

 ⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!

-### 1. Set Up Conda Environment
+### 2. Set Up Conda Environment

 We recommend using Miniconda3/Anaconda3 for environment management:

@@ -74,6 +84,9 @@ strings ~/anaconda3/envs/ktransformers/lib/libstdc++.so.6 | grep GLIBCXX

 ```bash
 sudo apt install libtbb-dev libssl-dev libcurl4-openssl-dev libaio1 libaio-dev libfmt-dev libgflags-dev zlib1g-dev patchelf
+pip3 install packaging ninja cpufeature numpy openai
+pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
+
 ```

 ### 3. Build ktransformers
@@ -87,7 +100,7 @@ git submodule update --init --recursive

 # Install single NUMA dependencies
 USE_BALANCE_SERVE=1  bash ./install.sh
-# Or Install Dual NUMA dependencies
+# For those who have two cpu and 1T RAM（Dual NUMA）:
 USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh
 ```


--- a/ktransformers/operators/attention.py
+++ b/ktransformers/operators/attention.py
@@ -422,6 +422,7 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
                if q_len == 1:
                    self.mla_wrapper.plan(None,None,None,
                                        position_ids.squeeze(1)+1,
+                                        None,
                                        self.num_heads,
                                        self.kv_lora_rank,
                                        self.qk_rope_head_dim,

--- a/ktransformers/util/utils.py
+++ b/ktransformers/util/utils.py
@@ -254,7 +254,7 @@ def prefill_and_generate(model, tokenizer, inputs, max_new_tokens=10000, use_cud
        start_time = time.time()
        for i in range(1, max_new_tokens):
            if use_flashinfer_mla:
-                MLAWrapperSingleton.plan_all(None,None,None,position_ids.squeeze(1)+1,
+                MLAWrapperSingleton.plan_all(None,None,None,position_ids.squeeze(1)+1,None,
                                             num_heads, head_dim_ckv, head_dim_kpe, past_key_values.page_size,
                                             model.model.layers[0].self_attn.softmax_scale, torch.bfloat16, torch.bfloat16)
            global warm_uped