Commit 3a5330b2 authored by Azure-Tang's avatar Azure-Tang
Browse files

Merge branch 'main' into work-concurrent

parents 80c5cbec f142f4df
name: Human Eval Score
run-name: Human Eval Score
on: workflow_dispatch
jobs:
Human-Eval-Score:
runs-on: self-hosted
steps:
- run: echo "🎉 The job was automatically triggered by a ${{ github.event_name }} event."
- run: echo "🔎 The name of your branch is ${{ github.ref }} and your repository is ${{ github.repository }}."
- name: Check out repository code
uses: actions/checkout@v4
- run: echo "💡 The ${{ github.repository }} repository has been cloned to the runner."
- name: Human Eval Run
run: |
set -e
source /home/qujing3/anaconda3/etc/profile.d/conda.sh
conda activate ktransformers-dev
export PATH=/usr/local/cuda-12.4/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda-12.4
cd ${{ github.workspace }}
python ktransformers/tests/score.py
- run: echo "This job's status is ${{ job.status }}."
......@@ -25,7 +25,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
* **Mar 27, 2025**: Support Multi-concurrency.
* **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./doc/en/ROCm.md)).
* **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context) for DeepSeek-V3 and R1 in 24GB VRAM.
* **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022--v023-longer-context--fp8-kernel) for DeepSeek-V3 and R1 in 24GB VRAM.
* **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context).
* **Feb 15, 2025**: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed (+15%, up to 16 Tokens/s), update [docs](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md).
......@@ -163,9 +163,9 @@ If you are interested in our design principles and the implementation of the inj
<h2 id="ack">Acknowledgment and Contributors</h2>
The development of KTransformer is based on the flexible and versatile framework provided by Transformers. We also benefit from advanced kernels such as GGUF/GGML, Llamafile, Marlin, sglang and flashinfer. We are planning to contribute back to the community by upstreaming our modifications.
The development of KTransformers is based on the flexible and versatile framework provided by Transformers. We also benefit from advanced kernels such as GGUF/GGML, Llamafile, Marlin, sglang and flashinfer. We are planning to contribute back to the community by upstreaming our modifications.
KTransformer is actively maintained and developed by contributors from the <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> at Tsinghua University and members from <a href="http://approaching.ai/">Approaching.AI</a>. We welcome new contributors to join us in making KTransformer faster and easier to use.
KTransformers is actively maintained and developed by contributors from the <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> at Tsinghua University and members from <a href="http://approaching.ai/">Approaching.AI</a>. We welcome new contributors to join us in making KTransformers faster and easier to use.
<h2 id="ack">Discussion</h2>
......
......@@ -152,9 +152,9 @@ YAML 文件中的每个规则都有两部分:`match` 和 `replace`。`match`
<h2 id="ack">致谢和贡献者</h2>
KTransformer 的开发基于 Transformers 提供的灵活和多功能框架。我们还受益于 GGUF/GGML、Llamafile 、 Marlin、sglang和flashinfer 等高级内核。我们计划通过向上游贡献我们的修改来回馈社区。
KTransformers 的开发基于 Transformers 提供的灵活和多功能框架。我们还受益于 GGUF/GGML、Llamafile 、 Marlin、sglang和flashinfer 等高级内核。我们计划通过向上游贡献我们的修改来回馈社区。
KTransformer 由清华大学 <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> 小组的成员以及 <a href="http://approaching.ai/">Approaching.AI</a> 的成员积极维护和开发。我们欢迎新的贡献者加入我们,使 KTransformer 更快、更易于使用。
KTransformers 由清华大学 <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> 小组的成员以及 <a href="http://approaching.ai/">Approaching.AI</a> 的成员积极维护和开发。我们欢迎新的贡献者加入我们,使 KTransformers 更快、更易于使用。
<h2 id="ack">讨论</h2>
......
WeChatGroup.png

258 KB | W: | H:

WeChatGroup.png

420 KB | W: | H:

WeChatGroup.png
WeChatGroup.png
WeChatGroup.png
WeChatGroup.png
  • 2-up
  • Swipe
  • Onion skin
# Ktransformer
# Ktransformers
[Introduction](./README.md)
# Install
......
......@@ -9,7 +9,7 @@ There is a Docker image available for our project, you can pull the docker image
```
docker pull approachingai/ktransformers:0.2.1
```
**Notice**: In this image, we compile the ktransformers in AVX512 instuction CPUs, if your cpu not support AVX512, it is suggested to recompile and install ktransformer in the /workspace/ktransformers directory within the container.
**Notice**: In this image, we compile the ktransformers in AVX512 instuction CPUs, if your cpu not support AVX512, it is suggested to recompile and install ktransformers in the /workspace/ktransformers directory within the container.
## Building docker image locally
- Download Dockerfile in [there](../../Dockerfile)
......
......@@ -118,7 +118,7 @@ From: https://github.com/kvcache-ai/ktransformers/issues/374
1. First, download the latest source code using git.
2. Then, modify the DeepSeek-V3-Chat-multi-gpu-4.yaml in the source code and all related yaml files, replacing all instances of KLinearMarlin with KLinearTorch.
3. Next, you need to compile from the ktransformer source code until it successfully compiles on your local machine.
3. Next, you need to compile from the ktransformers source code until it successfully compiles on your local machine.
4. Then, install flash-attn. It won't be used, but not installing it will cause an error.
5. Then, modify local_chat.py, replacing all instances of flash_attention_2 with eager.
6. Then, run local_chat.py. Be sure to follow the official tutorial's commands and adjust according to your local machine's parameters.
......
......@@ -132,6 +132,7 @@ def grouped_topk(hidden_states: torch.Tensor,
renormalize: bool,
num_expert_group: int = 0,
topk_group: int = 0,
routed_scaling_factor: float = 1.0,
scoring_func: str = "sigmoid",
e_score_correction_bias: Optional[torch.Tensor] = None):
......@@ -163,8 +164,8 @@ def grouped_topk(hidden_states: torch.Tensor,
score_mask = group_mask.unsqueeze(-1).expand(
num_token, num_expert_group,
scores.shape[-1] // num_expert_group).reshape(num_token, -1) # [n, e]
tmp_scores = scores.masked_fill(~score_mask.bool(),
float("-inf")) # [n, e]
tmp_scores = scores.masked_fill(~score_mask.bool(), 0.0)
#float("-inf")) # [n, e]
if e_score_correction_bias is not None:
topk_ids = torch.topk(tmp_scores, k=topk, dim=-1, sorted=False)[1]
......@@ -176,9 +177,10 @@ def grouped_topk(hidden_states: torch.Tensor,
dim=-1,
sorted=False)
if renormalize:
topk_weights = topk_weights / topk_weights.sum(dim=-1, keepdim=True)
if topk > 1 and renormalize:
denominator = topk_weights.sum(dim=-1, keepdim=True) + 1e-20
topk_weights = topk_weights / denominator
topk_weights = topk_weights * routed_scaling_factor # must multiply the scaling factor
return topk_ids.to(torch.long), topk_weights.to(torch.float32)
class KMoEGateDeepSeekV3(BaseInjectedModule, KMoEGateBase):
......@@ -204,6 +206,7 @@ class KMoEGateDeepSeekV3(BaseInjectedModule, KMoEGateBase):
self.is_windows = os.name == 'nt'
self.use_quant = use_quant
if not self.is_windows and use_quant:
print("injecting gate_linear")
self.gate_linear = nn.Linear(self.gating_dim, self.n_routed_experts, device=generate_device)
self.gate_linear = KTransformersLinear(key + ".ffn_gate_inp",
gguf_loader, config, self.gate_linear, #orig_module
......@@ -212,22 +215,20 @@ class KMoEGateDeepSeekV3(BaseInjectedModule, KMoEGateBase):
self.gate_linear = None
def forward(self, hidden_states) -> torch.Tensor:
if self.is_windows:
if True or self.is_windows:
return self.orig_module.forward(hidden_states)
bsz, seq_len, h = hidden_states.shape
### compute gating score
hidden_states = hidden_states.view(-1, h)
if self.use_quant:
logits = self.gate_linear.forward(logits)
logits = self.gate_linear.forward(hidden_states)
else:
logits = F.linear(
hidden_states.type(torch.float32), self.weight.type(torch.float32), None
)
return grouped_topk(hidden_states, logits,
self.top_k, self.norm_topk_prob,
self.n_group, self.topk_group)
return grouped_topk(hidden_states, logits, self.top_k, self.norm_topk_prob, self.n_group,
self.topk_group, self.routed_scaling_factor, "sigmoid", self.e_score_correction_bias)
def load(self, w: dict | nn.Parameter | tuple | None = None, device: str|None = None):
if device is None: device = self.device
......
import subprocess
import time
import requests
import sys
import os
def wait_for_server(base_url: str, timeout: int = None) -> None:
start_time = time.time()
while True:
try:
response = requests.get(
f"{base_url}/v1/models",
headers={"Authorization": "Bearer None"},
)
if response.status_code == 200:
print("Server is ready.")
break
except requests.exceptions.RequestException:
time.sleep(1)
if timeout and time.time() - start_time > timeout:
raise TimeoutError("Server did not become ready within timeout period")
server_cmd = [
"numactl", "-N", "1", "-m", "1",
"/home/qujing3/anaconda3/envs/ktransformers-dev/bin/ktransformers",
"--model_path", "/home/qujing3/models/DeepSeek-R1-Q4_K_M/config",
"--gguf_path", "/home/qujing3/models/DeepSeek-V3-GGUF/DeepSeek-V3-Q4_K_M",
"--port", "10002",
"--cpu_infer", "48",
"--optimize_config_path", "ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml",
"--max_new_tokens", "3000",
"--cache_lens", "6000"
]
print("Starting ktransformers server...")
print(" ".join(server_cmd))
with open("/tmp/server_log.txt", "w") as f:
server_process = subprocess.Popen(server_cmd, stdout=f, stderr=f, text=True)
try:
wait_for_server("http://localhost:10002", timeout=600)
eval_cmd = ["python", "ktransformers/tests/humaneval/eval_api.py"]
print("Running eval_api.py...")
print(f"Command: {' '.join(eval_cmd)}")
env = os.environ.copy()
env["PYTHONUNBUFFERED"] = "1"
eval_process = subprocess.Popen(
eval_cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
bufsize=1,
env=env,
universal_newlines=True
)
import threading
import queue
def enqueue_output(out, queue):
for line in iter(out.readline, ''):
queue.put(line)
out.close()
stdout_queue = queue.Queue()
stderr_queue = queue.Queue()
stdout_thread = threading.Thread(target=enqueue_output, args=(eval_process.stdout, stdout_queue))
stderr_thread = threading.Thread(target=enqueue_output, args=(eval_process.stderr, stderr_queue))
stdout_thread.daemon = True
stderr_thread.daemon = True
stdout_thread.start()
stderr_thread.start()
while eval_process.poll() is None:
try:
line = stdout_queue.get_nowait()
print(line, end='', flush=True)
except queue.Empty:
pass
try:
line = stderr_queue.get_nowait()
print(line, end='', file=sys.stderr, flush=True)
except queue.Empty:
pass
time.sleep(1)
while not stdout_queue.empty():
print(stdout_queue.get(), end='', flush=True)
while not stderr_queue.empty():
print(stderr_queue.get(), end='', file=sys.stderr, flush=True)
eval_process.wait()
print(f"eval_api.py completed with exit code: {eval_process.returncode}")
evaluate_cmd = [
"evaluate_functional_correctness",
"ktransformers/tests/humaneval/results/api/eval_b.jsonl"
]
print("Running evaluate_functional_correctness...")
print(f"Command: {' '.join(evaluate_cmd)}")
evaluate_process = subprocess.Popen(
evaluate_cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
bufsize=1,
universal_newlines=True
)
for line in evaluate_process.stdout:
print(line, end='', flush=True)
for line in evaluate_process.stderr:
print(line, end='', file=sys.stderr, flush=True)
evaluate_process.wait()
print(f"evaluate_functional_correctness completed with exit code: {evaluate_process.returncode}")
if evaluate_process.returncode != 0:
print(f"evaluate_functional_correctness exited with code {evaluate_process.returncode}")
sys.exit(evaluate_process.returncode)
finally:
print("Stopping ktransformers server...")
server_process.terminate()
try:
server_process.wait(timeout=30)
except subprocess.TimeoutExpired:
print("Server did not terminate gracefully, forcing...")
server_process.kill()
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment