Merge branch 'main' into work-concurrent

3a5330b2 · Azure-Tang · 80c5cbec · f142f4df · 3a5330b2 · 3a5330b2
Commit 3a5330b2 authored Apr 01, 2025 by Azure-Tang
9 changed files
--- a/.github/workflows/score.yml
+++ b/.github/workflows/score.yml
+name: Human Eval Score
+run-name: Human Eval Score
+on: workflow_dispatch
+jobs:
+  Human-Eval-Score:
+    runs-on: self-hosted
+    steps:
+      - run: echo "🎉 The job was automatically triggered by a ${{ github.event_name }} event."
+      - run: echo "🔎 The name of your branch is ${{ github.ref }} and your repository is ${{ github.repository }}."
+      - name: Check out repository code
+        uses: actions/checkout@v4
+      - run: echo "💡 The ${{ github.repository }} repository has been cloned to the runner."
+      - name: Human Eval Run
+        run: |
+          set -e
+          source /home/qujing3/anaconda3/etc/profile.d/conda.sh
+          conda activate ktransformers-dev
+          export PATH=/usr/local/cuda-12.4/bin:$PATH
+          export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH
+          export CUDA_HOME=/usr/local/cuda-12.4
+          cd ${{ github.workspace }}
+          python ktransformers/tests/score.py
+
+      - run: echo "This job's status is ${{ job.status }}."
--- a/README.md
+++ b/README.md
@@ -25,7 +25,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin

 * **Mar 27, 2025**: Support Multi-concurrency.
 * **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./doc/en/ROCm.md)).
-* **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context) for DeepSeek-V3 and R1 in 24GB VRAM.
+* **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022--v023-longer-context--fp8-kernel) for DeepSeek-V3 and R1 in 24GB VRAM.
 * **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context).
 * **Feb 15, 2025**: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed （+15%, up to 16 Tokens/s), update [docs](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
 * **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md).
@@ -163,9 +163,9 @@ If you are interested in our design principles and the implementation of the inj

 <h2 id="ack">Acknowledgment and Contributors</h2>

-The development of KTransformer is based on the flexible and versatile framework provided by Transformers. We also benefit from advanced kernels such as GGUF/GGML, Llamafile, Marlin, sglang and flashinfer. We are planning to contribute back to the community by upstreaming our modifications.
+The development of KTransformers is based on the flexible and versatile framework provided by Transformers. We also benefit from advanced kernels such as GGUF/GGML, Llamafile, Marlin, sglang and flashinfer. We are planning to contribute back to the community by upstreaming our modifications.

-KTransformer is actively maintained and developed by contributors from the <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> at Tsinghua University and members from <a href="http://approaching.ai/">Approaching.AI</a>. We welcome new contributors to join us in making KTransformer faster and easier to use.
+KTransformers is actively maintained and developed by contributors from the <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> at Tsinghua University and members from <a href="http://approaching.ai/">Approaching.AI</a>. We welcome new contributors to join us in making KTransformers faster and easier to use.

 <h2 id="ack">Discussion</h2>


--- a/README_ZH.md
+++ b/README_ZH.md
@@ -152,9 +152,9 @@ YAML 文件中的每个规则都有两部分：`match` 和 `replace`。`match` 

 <h2 id="ack">致谢和贡献者</h2>

-KTransformer 的开发基于 Transformers 提供的灵活和多功能框架。我们还受益于 GGUF/GGML、Llamafile 、 Marlin、sglang和flashinfer 等高级内核。我们计划通过向上游贡献我们的修改来回馈社区。
+KTransformers 的开发基于 Transformers 提供的灵活和多功能框架。我们还受益于 GGUF/GGML、Llamafile 、 Marlin、sglang和flashinfer 等高级内核。我们计划通过向上游贡献我们的修改来回馈社区。

-KTransformer 由清华大学 <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> 小组的成员以及 <a href="http://approaching.ai/">Approaching.AI</a> 的成员积极维护和开发。我们欢迎新的贡献者加入我们，使 KTransformer 更快、更易于使用。
+KTransformers 由清华大学 <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> 小组的成员以及 <a href="http://approaching.ai/">Approaching.AI</a> 的成员积极维护和开发。我们欢迎新的贡献者加入我们，使 KTransformers 更快、更易于使用。


 <h2 id="ack">讨论</h2>

--- a/WeChatGroup.png
+++ b/WeChatGroup.png
--- a/doc/SUMMARY.md
+++ b/doc/SUMMARY.md
-# Ktransformer
+# Ktransformers

 [Introduction](./README.md)
 # Install

--- a/doc/en/Docker.md
+++ b/doc/en/Docker.md
@@ -9,7 +9,7 @@ There is a Docker image available for our project, you can pull the docker image
 ```
 docker pull approachingai/ktransformers:0.2.1
 ```
-**Notice**: In this image, we compile the ktransformers in AVX512 instuction CPUs, if your cpu not support AVX512, it is suggested to recompile and install ktransformer in the /workspace/ktransformers directory within the container.
+**Notice**: In this image, we compile the ktransformers in AVX512 instuction CPUs, if your cpu not support AVX512, it is suggested to recompile and install ktransformers in the /workspace/ktransformers directory within the container.

 ## Building docker image locally
 - Download Dockerfile in [there](../../Dockerfile)

--- a/doc/en/FAQ.md
+++ b/doc/en/FAQ.md
@@ -118,7 +118,7 @@ From: https://github.com/kvcache-ai/ktransformers/issues/374

 1. First, download the latest source code using git.
 2. Then, modify the DeepSeek-V3-Chat-multi-gpu-4.yaml in the source code and all related yaml files, replacing all instances of KLinearMarlin with KLinearTorch.
-3. Next, you need to compile from the ktransformer source code until it successfully compiles on your local machine.
+3. Next, you need to compile from the ktransformers source code until it successfully compiles on your local machine.
 4. Then, install flash-attn. It won't be used, but not installing it will cause an error.
 5. Then, modify local_chat.py, replacing all instances of flash_attention_2 with eager.
 6. Then, run local_chat.py. Be sure to follow the official tutorial's commands and adjust according to your local machine's parameters.

--- a/ktransformers/operators/gate.py
+++ b/ktransformers/operators/gate.py
@@ -132,6 +132,7 @@ def grouped_topk(hidden_states: torch.Tensor,
                 renormalize: bool,
                 num_expert_group: int = 0,
                 topk_group: int = 0,
+                 routed_scaling_factor: float = 1.0,
                 scoring_func: str = "sigmoid",
                 e_score_correction_bias: Optional[torch.Tensor] = None):

@@ -163,8 +164,8 @@ def grouped_topk(hidden_states: torch.Tensor,
    score_mask = group_mask.unsqueeze(-1).expand(
        num_token, num_expert_group,
        scores.shape[-1] // num_expert_group).reshape(num_token, -1)  # [n, e]
-    tmp_scores = scores.masked_fill(~score_mask.bool(),
-                                    float("-inf"))  # [n, e]
+    tmp_scores = scores.masked_fill(~score_mask.bool(), 0.0)
+                                    #float("-inf"))  # [n, e]

    if e_score_correction_bias is not None:
        topk_ids = torch.topk(tmp_scores, k=topk, dim=-1, sorted=False)[1]
@@ -176,9 +177,10 @@ def grouped_topk(hidden_states: torch.Tensor,
                                            dim=-1,
                                            sorted=False)

-    if renormalize:
-        topk_weights = topk_weights / topk_weights.sum(dim=-1, keepdim=True)
-
+    if topk > 1 and renormalize:
+        denominator = topk_weights.sum(dim=-1, keepdim=True) + 1e-20
+        topk_weights = topk_weights / denominator
+    topk_weights = topk_weights * routed_scaling_factor # must multiply the scaling factor
    return topk_ids.to(torch.long), topk_weights.to(torch.float32)

 class KMoEGateDeepSeekV3(BaseInjectedModule, KMoEGateBase):
@@ -204,6 +206,7 @@ class KMoEGateDeepSeekV3(BaseInjectedModule, KMoEGateBase):
        self.is_windows = os.name == 'nt'
        self.use_quant = use_quant
        if not self.is_windows and use_quant:
+            print("injecting gate_linear")
            self.gate_linear = nn.Linear(self.gating_dim, self.n_routed_experts, device=generate_device)
            self.gate_linear = KTransformersLinear(key + ".ffn_gate_inp", 
                                               gguf_loader, config, self.gate_linear, #orig_module
@@ -212,22 +215,20 @@ class KMoEGateDeepSeekV3(BaseInjectedModule, KMoEGateBase):
            self.gate_linear = None

    def forward(self, hidden_states) -> torch.Tensor:
-        if self.is_windows:
+        if True or self.is_windows:
            return self.orig_module.forward(hidden_states)
        
        bsz, seq_len, h = hidden_states.shape
        ### compute gating score
        hidden_states = hidden_states.view(-1, h)
        if self.use_quant:
-            logits = self.gate_linear.forward(logits)
+            logits = self.gate_linear.forward(hidden_states)
        else:
            logits = F.linear(
                hidden_states.type(torch.float32), self.weight.type(torch.float32), None
            )
-            
-        return grouped_topk(hidden_states, logits,
-                            self.top_k, self.norm_topk_prob,
-                            self.n_group, self.topk_group)
+        return grouped_topk(hidden_states, logits, self.top_k, self.norm_topk_prob, self.n_group,
+                            self.topk_group, self.routed_scaling_factor, "sigmoid", self.e_score_correction_bias)

    def load(self, w: dict | nn.Parameter | tuple | None = None, device: str|None = None):
        if device is None: device = self.device

--- a/ktransformers/tests/score.py
+++ b/ktransformers/tests/score.py
+import subprocess
+import time
+import requests
+import sys
+import os
+
+def wait_for_server(base_url: str, timeout: int = None) -> None:
+    start_time = time.time()
+    while True:
+        try:
+            response = requests.get(
+                f"{base_url}/v1/models",
+                headers={"Authorization": "Bearer None"},
+            )
+            if response.status_code == 200:
+                print("Server is ready.")
+                break
+        except requests.exceptions.RequestException:
+            time.sleep(1)
+            if timeout and time.time() - start_time > timeout:
+                raise TimeoutError("Server did not become ready within timeout period")
+
+server_cmd = [
+    "numactl", "-N", "1", "-m", "1",
+    "/home/qujing3/anaconda3/envs/ktransformers-dev/bin/ktransformers",
+    "--model_path", "/home/qujing3/models/DeepSeek-R1-Q4_K_M/config",
+    "--gguf_path", "/home/qujing3/models/DeepSeek-V3-GGUF/DeepSeek-V3-Q4_K_M",
+    "--port", "10002",
+    "--cpu_infer", "48",
+    "--optimize_config_path", "ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml",
+    "--max_new_tokens", "3000",
+    "--cache_lens", "6000"
+]
+
+print("Starting ktransformers server...")
+print(" ".join(server_cmd))
+with open("/tmp/server_log.txt", "w") as f:
+    server_process = subprocess.Popen(server_cmd, stdout=f, stderr=f, text=True)
+
+try:
+    wait_for_server("http://localhost:10002", timeout=600)
+
+    eval_cmd = ["python", "ktransformers/tests/humaneval/eval_api.py"]
+    print("Running eval_api.py...")
+    print(f"Command: {' '.join(eval_cmd)}")
+    
+    env = os.environ.copy()
+    env["PYTHONUNBUFFERED"] = "1"
+    
+    eval_process = subprocess.Popen(
+        eval_cmd,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.PIPE,
+        text=True,
+        bufsize=1,
+        env=env,
+        universal_newlines=True
+    )
+    
+    import threading
+    import queue
+    
+    def enqueue_output(out, queue):
+        for line in iter(out.readline, ''):
+            queue.put(line)
+        out.close()
+    
+    stdout_queue = queue.Queue()
+    stderr_queue = queue.Queue()
+    
+    stdout_thread = threading.Thread(target=enqueue_output, args=(eval_process.stdout, stdout_queue))
+    stderr_thread = threading.Thread(target=enqueue_output, args=(eval_process.stderr, stderr_queue))
+    
+    stdout_thread.daemon = True
+    stderr_thread.daemon = True
+    stdout_thread.start()
+    stderr_thread.start()
+    
+    while eval_process.poll() is None:
+        try:
+            line = stdout_queue.get_nowait()
+            print(line, end='', flush=True)
+        except queue.Empty:
+            pass
+            
+        try:
+            line = stderr_queue.get_nowait()
+            print(line, end='', file=sys.stderr, flush=True)
+        except queue.Empty:
+            pass
+        
+        time.sleep(1)
+
+    while not stdout_queue.empty():
+        print(stdout_queue.get(), end='', flush=True)
+    while not stderr_queue.empty():
+        print(stderr_queue.get(), end='', file=sys.stderr, flush=True)
+        
+    eval_process.wait()
+    print(f"eval_api.py completed with exit code: {eval_process.returncode}")
+
+    evaluate_cmd = [
+        "evaluate_functional_correctness",
+        "ktransformers/tests/humaneval/results/api/eval_b.jsonl"
+    ]
+    print("Running evaluate_functional_correctness...")
+    print(f"Command: {' '.join(evaluate_cmd)}")
+    
+    evaluate_process = subprocess.Popen(
+        evaluate_cmd,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.PIPE,
+        text=True,
+        bufsize=1,
+        universal_newlines=True
+    )
+    
+    for line in evaluate_process.stdout:
+        print(line, end='', flush=True)
+    for line in evaluate_process.stderr:
+        print(line, end='', file=sys.stderr, flush=True)
+        
+    evaluate_process.wait()
+    
+    print(f"evaluate_functional_correctness completed with exit code: {evaluate_process.returncode}")
+    if evaluate_process.returncode != 0:
+        print(f"evaluate_functional_correctness exited with code {evaluate_process.returncode}")
+        sys.exit(evaluate_process.returncode)
+
+finally:
+    print("Stopping ktransformers server...")
+    server_process.terminate()
+    try:
+        server_process.wait(timeout=30)
+    except subprocess.TimeoutExpired:
+        print("Server did not terminate gracefully, forcing...")
+        server_process.kill()
\ No newline at end of file