Merge dcu patch to 0.3.1

0bc665e9 · fengchao · 32f3d7be · b8596252 · 0bc665e9 · 0bc665e9
Commit 0bc665e9 authored Jun 06, 2025 by fengchao
11 changed files
--- a/.setup.py.swp
+++ b/.setup.py.swp
--- a/README.md
+++ b/README.md
-<div align="center">
-  <!-- <h1>KTransformers</h1> -->
-  <p align="center">
+# <div align="center"><strong>KTransformers</strong></div>
+## 简介
+🤗 Transformers 提供了数以千计的预训练模型，支持 100 多种语言的文本分类、信息抽取、问答、摘要、翻译、文本生成。它的宗旨是让最先进的 NLP 技术人人易用。

-<picture>
-    <img alt="KTransformers" src="https://github.com/user-attachments/assets/d5a2492f-a415-4456-af99-4ab102f13f8b" width=50%>
+🤗 Transformers 提供了便于快速下载和使用的API，让你可以把预训练模型用在给定文本、在你的数据集上微调然后通过 [model hub](https://huggingface.co/models) 与社区共享。同时，每个定义的 Python 模块都是完全独立的，便于修改和快速进行研究实验。

-</picture>
+🤗 Transformers 支持三个最热门的深度学习库： [Jax](https://jax.readthedocs.io/en/latest/), [PyTorch](https://pytorch.org/) 以及 [TensorFlow](https://www.tensorflow.org/) — 并与之无缝整合。你可以直接使用一个框架训练你的模型然后用另一个加载和推理。

-</p>
-  <h3>A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations</h3>
-  <strong><a href="#show-cases">🌟 Show Cases</a> | <a href="#quick-start">🚀 Quick Start</a> | <a href="#tutorial">📃 Tutorial</a> | <a href="https://github.com/kvcache-ai/ktransformers/discussions">💬  Discussion </a>|<a href="#FAQ"> 🙋 FAQ</a> </strong>
-</div>
+## 安装
+组件支持组合

-<h2 id="intro">🎉 Introduction</h2>
-KTransformers, pronounced as Quick Transformers, is designed to enhance your 🤗 <a href="https://github.com/huggingface/transformers">Transformers</a> experience with advanced kernel optimizations and placement/parallelism strategies.
-<br/><br/>
-KTransformers is a flexible, Python-centric framework designed with extensibility at its core. 
-By implementing and injecting an optimized module with a single line of code, users gain access to a Transformers-compatible
-interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified ChatGPT-like web UI. 
-<br/><br/>
-Our vision for KTransformers is to serve as a flexible platform for experimenting with innovative LLM inference optimizations. Please let us know if you need any other features.
+   | PyTorch版本 | fastpt版本  |KTransformers版本      | DTK版本                  | Python版本       | 推荐编译方式 |
+   | ----------- | ----------- | ----------- | ------------------------ | -----------------| ------------ |
+   | 2.4.1       | 2.0.1       |0.2.3        | >= 25.04                 | 3.8、3.10、3.11  | fastpt不转码 |

-<h2 id="Updates">🔥 Updates</h2>
+ pytorch版本大于2.4.1 && dtk版本大于25.04 推荐使用fastpt不转码编译。

-* **May 14, 2025**: Support Intel Arc GPU ([Tutorial](./doc/en/xpu.md)).
-
-* **Apr 29, 2025**: Support AMX-Int8、 AMX-BF16 and Qwen3MoE ([Tutorial](./doc/en/AMX.md))
-
-https://github.com/user-attachments/assets/fafe8aec-4e22-49a8-8553-59fb5c6b00a2
-
-
-
-
-* **Apr 9, 2025**: Experimental support for LLaMA 4 models ([Tutorial](./doc/en/llama4.md)).
-* **Apr 2, 2025**: Support Multi-concurrency. ([Tutorial](./doc/en/balance-serve.md)).
-
-https://github.com/user-attachments/assets/faa3bda2-928b-45a7-b44f-21e12ec84b8a
-
-* **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./doc/en/ROCm.md)).
-* **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022--v023-longer-context--fp8-kernel) for DeepSeek-V3 and R1 in 24GB VRAM.
-* **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context).
-* **Feb 15, 2025**: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed （+15%, up to 16 Tokens/s), update [docs](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
-* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md).
-* **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
-* **Aug 15, 2024**: Update detailed [tutorial](doc/en/injection_tutorial.md) for injection and multi-GPU.
-* **Aug 14, 2024**: Support llamfile as linear backend.
-* **Aug 12, 2024**: Support multiple GPU; Support new model: mixtral 8\*7B  and 8\*22B; Support q2k, q3k, q5k dequant on gpu.
-* **Aug 9, 2024**: Support windows native.
-
-<!-- * **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md). -->
-
-<h2 id="show-cases">🌟 Show Cases</h2>
-
-<div>
-<h3>GPT-4/o1-level Local VSCode Copilot on a Desktop with only 24GB VRAM</h3>
-</div>
-
-https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285
-
-</p>
-
- **[NEW!!!] Local 671B DeepSeek-Coder-V3/R1:** Running its Q4_K_M version using only 14GB VRAM and 382GB DRAM([Tutorial](./doc/en/DeepseekR1_V3_tutorial.md)).
-
-  - Prefill Speed (tokens/s):
-    - KTransformers: 54.21 (32 cores) → 74.362 (dual-socket, 2×32 cores) → 255.26 (optimized AMX-based MoE kernel, V0.3 only) → 286.55 (selectively using 6 experts, V0.3 only)
-    - Compared to 10.31 tokens/s in llama.cpp with 2×32 cores, achieving up to **27.79× speedup**.
-  - Decode Speed (tokens/s):
-    - KTransformers: 8.73 (32 cores) → 11.26 (dual-socket, 2×32 cores) → 13.69 (selectively using 6 experts, V0.3 only)
-    - Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to **3.03× speedup**.
-  - Upcoming Open Source Release:
-    - AMX optimizations and selective expert activation will be open-sourced in V0.3.
-    - Currently available only in preview binary distribution, which can be downloaded [here](./doc/en/DeepseekR1_V3_tutorial.md).
- **Local 236B DeepSeek-Coder-V2:** Running its Q4_K_M version using only 21GB VRAM and 136GB DRAM, attainable on a local desktop machine, which scores even better than GPT4-0613 in [BigCodeBench](https://huggingface.co/blog/leaderboard-bigcodebench).
-
-<p align="center">
-  <picture>
-    <img alt="DeepSeek-Coder-V2 Score" src="https://github.com/user-attachments/assets/d052924e-8631-44de-aad2-97c54b965693" width=100%>
-  </picture>
-</p>
-
- **Faster Speed:** Achieving 126 tokens/s for 2K prompt prefill and 13.6 tokens/s for generation through MoE offloading and injecting advanced kernels from [Llamafile](https://github.com/Mozilla-Ocho/llamafile/tree/main) and [Marlin](https://github.com/IST-DASLab/marlin).
- **VSCode Integration:** Wrapped into an OpenAI and Ollama compatible API for seamless integration as a backend for [Tabby](https://github.com/TabbyML/tabby) and various other frontends.
-
-<p align="center">
-
-https://github.com/user-attachments/assets/4c6a8a38-05aa-497d-8eb1-3a5b3918429c
-
-</p>
-
-<!-- <h3>1M Context Local Inference on a Desktop with Only 24GB VRAM</h3>
-<p align="center">
-
-https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12
-
-* **1M Context InternLM 2.5 7B**: Operates at full bf16 precision, utilizing 24GB VRAM and 150GB DRAM, which is feasible on a local desktop setup. It achieves a 92.88% success rate on the 1M "Needle In a Haystack" test and 100% on the 128K NIAH test.
-
-<p align="center">
-  <picture>
-    <img alt="Single Needle Retrieval 128K" src="./doc/assets/needle_128K.png" width=100%>
-  </picture>
-</p>
-
-<p align="center">
-  <picture>
-    <img alt="Single Needle Retrieval 1000K" src="./doc/assets/needle_1M.png" width=100%>
-  </picture>
-</p>
-
-* **Enhanced Speed**: Reaches 16.91 tokens/s for generation with a 1M context using sparse attention, powered by llamafile kernels. This method is over 10 times faster than full attention approach of llama.cpp.
-
-* **Flexible Sparse Attention Framework**: Offers a flexible block sparse attention framework for CPU offloaded decoding. Compatible with SnapKV, Quest, and InfLLm. Further information is available [here](./doc/en/long_context_introduction.md).
- -->
-
-<strong>More advanced features will coming soon, so stay tuned!</strong>
-
-<h2 id="quick-start">🚀 Quick Start</h2>
-
-Getting started with KTransformers is simple! Follow the steps below to set up and start using it.
-
-### 📥 Installation
-
-To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/en/install.html).
-
-<h2 id="tutorial">📃 Brief Injection Tutorial</h2>
-At the heart of KTransformers is a user-friendly, template-based injection framework. 
-This allows researchers to easily replace original torch modules with optimized variants. It also simplifies the process of combining multiple optimizations, allowing the exploration of their synergistic effects.
-
-</br>
-<p align="center">
-  <picture>
-    <img alt="Inject-Struction" src="https://github.com/user-attachments/assets/6b4c1e54-9f6d-45c5-a3fc-8fa45e7d257e" width=65%>
-  </picture>
-</p>
-
-Given that vLLM already serves as a great framework for large-scale deployment optimizations, KTransformers is particularly focused on local deployments that are constrained by limited resources. We pay special attention to heterogeneous computing opportunities, such as GPU/CPU offloading of quantized models. For example, we support the efficient <a herf="https://github.com/Mozilla-Ocho/llamafile/tree/main">Llamafile</a> and <a herf="https://github.com/IST-DASLab/marlin">Marlin</a> kernels for CPU and GPU, respectively. More details can be found <a herf="doc/en/operators/llamafile.md">here</a>.
-
-<h3>Example Usage</h3>
-To utilize the provided kernels, users only need to create a YAML-based injection template and add the call to `optimize_and_load_gguf` before using the Transformers model.
-
-```python
-with torch.device("meta"):
-    model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
-optimize_and_load_gguf(model, optimize_config_path, gguf_path, config)
-...
-generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000)
+### 1、使用pip方式安装
+KTransformers whl包下载目录：[光和开发者社区](https://download.sourcefind.cn:65024/4/main)，选择对应的pytorch版本和python版本下载对应KTransformers的whl包
+```shell
+pip install torch* (下载torch的whl包)
+pip install fastpt* --no-deps (下载fastpt的whl包)
+source  /usr/local/bin/fastpt -E
+pip install ktransformers* (下载的ktransformers-fastpt的whl包) --no-deps
 ```
+### 2、使用源码编译方式安装

-In this example, the AutoModel is first initialized on the meta device to avoid occupying any memory resources. Then, `optimize_and_load_gguf` iterates through all sub-modules of the model, matches rules specified in your YAML rule file, and replaces them with advanced modules as specified.
-
-After injection, the original `generate` interface is available, but we also provide a compatible `prefill_and_generate` method, which enables further optimizations like CUDAGraph to improve generation speed.
+#### 编译环境准备
+提供基于fastpt不转码编译：

-<h3>How to custom your model</h3>
+1. 基于光源pytorch基础镜像环境：镜像下载地址：[光合开发者社区](https://sourcefind.cn/#/image/dcu/pytorch)，根据pytorch、python、dtk及系统下载对应的镜像版本。

-A detailed tutorial of the injection and multi-GPU using DeepSeek-V2 as an example is given [here](doc/en/injection_tutorial.md).
-
-Below is an example of a YAML template for replacing all original Linear modules with Marlin, an advanced 4-bit quantization kernel.
-
-```yaml
- match:
-    name: "^model\\.layers\\..*$"  # regular expression 
-    class: torch.nn.Linear  # only match modules matching name and class simultaneously
-  replace:
-    class: ktransformers.operators.linear.KTransformerLinear  # optimized Kernel on quantized data types
-    device: "cpu"   # which devices to load this module when initializing
-    kwargs:
-      generate_device: "cuda"
-      generate_linear_type: "QuantizedLinearMarlin"
+2. 基于现有python环境：安装pytorch，fastpt whl包下载目录：[光合开发者社区](https://sourcefind.cn/#/image/dcu/pytorch)，根据python、dtk版本,下载对应pytorch的whl包。安装命令如下：
+```shell
+pip install cpufeature
+pip install torch* (下载torch的whl包)
+pip install fastpt* --no-deps (下载fastpt的whl包, 安装顺序，先安装torch，后安装fastpt)
+pip install setuptools==59.5.0 wheel
 ```

-Each rule in the YAML file has two parts: `match` and `replace`. The `match` part specifies which module should be replaced, and the `replace` part specifies the module to be injected into the model along with the initialization keywords.
-
-You can find example rule templates for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models, in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory. These templates are used to power the `local_chat.py` demo.
-
-If you are interested in our design principles and the implementation of the injection framework, please refer to the [design document](doc/en/deepseek-v2-injection.md).
-
-<h2 id="ack">Acknowledgment and Contributors</h2>
-
-The development of KTransformers is based on the flexible and versatile framework provided by Transformers. We also benefit from advanced kernels such as GGUF/GGML, Llamafile, Marlin, sglang and flashinfer. We are planning to contribute back to the community by upstreaming our modifications.
-
-KTransformers is actively maintained and developed by contributors from the <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> at Tsinghua University and members from <a href="http://approaching.ai/">Approaching.AI</a>. We welcome new contributors to join us in making KTransformers faster and easier to use.
+#### 源码编译安装
+- 代码下载
+```shell
+git clone https://developer.sourcefind.cn/codes/OpenDAS/ktransformers.git # 根据编译需要切换分支
+```
+- 提供2种源码编译方式（进入ktransformers目录）：
+```
+1. 设置不转码编译环境变量
+source /usr/local/bin/fastpt -C

-<h2 id="ack">Discussion</h2>
+2. 编译whl包并安装
+bash install_dcu.sh
+pip3 install dist/ktransformers*.whl --no-deps
+```
+#### 注意事项
+ 若使用pip install下载安装过慢，可添加pypi清华源：-i https://pypi.tuna.tsinghua.edu.cn/simple/
+ ROCM_PATH为dtk的路径，默认为/opt/dtk

-If you have any questions, feel free to open an issue. Alternatively, you can join our WeChat group for further discussion. QR Code: [WeChat Group](WeChatGroup.png)
+## 验证
+```
+python3
+Python 3.10.12 (main, Feb  4 2025, 14:57:36) [GCC 11.4.0] on linux
+Type "help", "copyright", "credits" or "license" for more information.
+>>> import ktransformers
+>>> ktransformers.__version__
+'0.2.3post1'
+>>>
+```
+版本号与官方版本同步，查询该软件的版本号，例如0.2.3post1；

-<h2 id="FAQ">🙋 FAQ</h2>
+## Known Issue
+- 无

-Some common questions are answered in the [FAQ](doc/en/FAQ.md).
+## 参考资料
+- [README_ORIGIN](README_ORIGIN.md)
+- [README_zh-CN](README_zh-CN.md)
+- [https://github.com/kvcache-ai/ktransformers](https://github.com/kvcache-ai/ktransformers)
--- a/csrc/ktransformers_ext/cuda/gptq_marlin/gptq_marlin_dtypes.cuh
+++ b/csrc/ktransformers_ext/cuda/gptq_marlin/gptq_marlin_dtypes.cuh
@@ -60,7 +60,7 @@ class ScalarType<nv_bfloat16> {
  using FragC = Vec<float, 4>;
  using FragS = Vec<nv_bfloat162, 1>;

-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 800
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 800 || KTRANSFORMERS_USE_DTK
  static __device__ float inline num2float(const nv_bfloat16 x) {
    return __bfloat162float(x);
  }

--- a/install_dcu.sh
+++ b/install_dcu.sh
+#!/bin/bash
+set -e
+
+# 清理构建目录和旧的分发文件
+rm -rf build
+rm -rf dist
+rm -rf *.egg-info
+rm -rf ktransformers/ktransformers_ext/build
+rm -rf ktransformers/ktransformers_ext/cuda/build
+rm -rf ktransformers/ktransformers_ext/cuda/dist
+rm -rf ktransformers/ktransformers_ext/cuda/*.egg-info
+echo "初始化Git子模块..."
+git submodule update --init --recursive
+
+export CMAKE_BUILD_PARALLEL_LEVEL=32
+
+echo "构建ktransformers wheel包"
+
+mkdir -p dist
+
+KTRANSFORMERS_FORCE_BUILD=TRUE pip wheel . -w dist --no-build-isolation --no-deps
+
+echo "生成的wheel包位于："
+ls -l dist/*.whl
+
+echo "构建成功！wheel包已生成在dist目录"
+
--- a/ktransformers/__init__.py
+++ b/ktransformers/__init__.py
 #!/usr/bin/env python
 # coding=utf-8
 '''
-Description  : 
+Description  :
 Author       : kkk1nak0
 Date         : 2024-08-15 07:34:46
 Version      : 1.0.0
-LastEditors  : chenxl 
+LastEditors  : chenxl
 LastEditTime : 2025-02-15 03:53:02
 '''
 __version__ = "0.3"
+__hcu_version__ = '0.3+das.dtk2504'
--- a/ktransformers/local_chat.py
+++ b/ktransformers/local_chat.py
@@ -28,7 +28,8 @@ from ktransformers.models.modeling_qwen2_moe import Qwen2MoeForCausalLM
 from ktransformers.models.modeling_deepseek_v3 import DeepseekV3ForCausalLM
 from ktransformers.models.modeling_llama import LlamaForCausalLM
 from ktransformers.models.modeling_mixtral import MixtralForCausalLM
-from ktransformers.util.utils import prefill_and_generate, get_compute_capability
+#from ktransformers.util.utils import prefill_and_generate, get_compute_capability
+from ktransformers.util.utils import prefill_and_generate, get_compute_capability, get_device_name
 from ktransformers.server.config.config import Config
 from ktransformers.operators.flashinfer_wrapper import flashinfer_enabled
 from ktransformers.util.vendors import device_manager, get_device, to_device, GPUVendor
@@ -175,7 +176,7 @@ def local_chat(
            assert Config().long_context_config['max_seq_len'] > input_tensor.shape[1] + max_new_tokens, \
            "please change max_seq_len in  ~/.ktransformers/config.yaml"
        
-        if system != "Windows" and (config.architectures[0] == "DeepseekV2ForCausalLM" or config.architectures[0] == "DeepseekV3ForCausalLM") and flashinfer_enabled and get_compute_capability() >= 8 and device_manager.gpu_vendor == GPUVendor.NVIDIA:
+        if system != "Windows" and (config.architectures[0] == "DeepseekV2ForCausalLM" or config.architectures[0] == "DeepseekV3ForCausalLM") and flashinfer_enabled and get_compute_capability() >= 8 and device_manager.gpu_vendor == GPUVendor.NVIDIA or ("Z100" in get_device_name()) or ("Z100L" in get_device_name()) or ("K100" in get_device_name()):
            generated = prefill_and_generate(
                model, tokenizer, input_tensor.to(device), max_new_tokens, use_cuda_graph, mode = mode, force_think = force_think, chunk_size = chunk_size,
                use_flashinfer_mla = True, num_heads = config.num_attention_heads, head_dim_ckv = config.kv_lora_rank, head_dim_kpe = config.qk_rope_head_dim, q_head_dim = config.qk_rope_head_dim + config.qk_nope_head_dim

--- a/ktransformers/operators/attention.py
+++ b/ktransformers/operators/attention.py
@@ -16,7 +16,8 @@ from ktransformers.models.modeling_deepseek import DeepseekV2Attention, apply_ro
 from typing import Optional, Tuple
 from ktransformers.operators.base_operator import BaseInjectedModule
 from ktransformers.util.custom_loader import GGUFLoader
-from ktransformers.util.utils import get_compute_capability
+#from ktransformers.util.utils import get_compute_capability
+from ktransformers.util.utils import get_compute_capability, get_device_name
 import logging
 from transformers.configuration_utils import PretrainedConfig
 from transformers.cache_utils import Cache
@@ -703,10 +704,12 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
                cache_position,
                **kwargs,
            )
-        elif (os.name == 'nt'
-              or get_compute_capability() < 8
+        elif (os.name == 'nt'  or get_compute_capability()<8
              or hidden_states.device.type == 'cpu'
-              or device_manager.gpu_vendor != GPUVendor.NVIDIA):
+              or device_manager.gpu_vendor != GPUVendor.NVIDIA) 
+              or ("Z100" in get_device_name()) 
+              or ("Z100L" in get_device_name()) or ("K100" in get_device_name()):
+            print("for Windows or GPU before ampere or Z100/Z100L or K100, use forward_windows")
            return self.forward_windows(
                hidden_states,
                attention_mask,

--- a/ktransformers/operators/models.py
+++ b/ktransformers/operators/models.py
@@ -57,8 +57,9 @@ from ktransformers.util.vendors import device_manager, get_device, to_device, GP
 from transformers.models.qwen2_moe.configuration_qwen2_moe import Qwen2MoeConfig
 from ktransformers.models.configuration_llama import LlamaConfig
 from ktransformers.operators.base_operator import BaseInjectedModule
-from ktransformers.util.utils import InferenceState, get_compute_capability
 from ktransformers.util.custom_loader import GGUFLoader
+#from ktransformers.util.utils import InferenceState, get_compute_capability
+from ktransformers.util.utils import InferenceState, get_compute_capability, get_device_name
 from transformers.configuration_utils import PretrainedConfig
 from ktransformers.models.modeling_llama import (
    LlamaDecoderLayer,
@@ -657,11 +658,15 @@ class KDeepseekV2Model(BaseInjectedModule):
        if per_layer_prefill_flag:
            causal_mask = None
        else:
-            if (os.name == 'nt'
-                or get_compute_capability() < 8
+            #if os.name == 'nt' or get_compute_capability()<8:
+             #   print("for Windows or GPU before ampere, use forward_windows")
+            if os.name == 'nt' or get_compute_capability()<8  
                or (self.transfer_map is not None and 'cpu' in self.transfer_map.values())
-                or device_manager.gpu_vendor != GPUVendor.NVIDIA):
-                # print("for Windows or GPU before ampere, use forward_windows")
+                or device_manager.gpu_vendor != GPUVendor.NVIDIA
+                or (self.transfer_map is not None and 'cpu' in self.transfer_map.values())
+                or device_manager.gpu_vendor != GPUVendor.NVIDIA
+                or ("Z100" in get_device_name()) or ("Z100L" in get_device_name()) or ("K100" in get_device_name()):
+                print("for Windows or GPU before ampere or Z100/Z100L or K100, use forward_windows")
                # only use mask in forward windows or can't flash attn
                causal_mask = self._update_causal_mask(
                    attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions

--- a/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-dcu.yaml
+++ b/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-dcu.yaml
+- match:
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
+  replace:
+    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+
+- match:
+    name: "^lm_head$"  # regular expression 
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+
+- match:
+    name: "^model\\.layers\\.(?!.*self_attn\\.kv_b_proj).*$"  # regular expression 
+    class: torch.nn.Linear  # only match modules matching name and class simultaneously
+  replace:
+    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      generate_op: "KLinearTorch"
+      prefill_op: "KLinearTorch"
+- match:
+    name: "^model\\.layers\\..*\\.mlp$"
+    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
+  replace:
+    class: ktransformers.operators.experts.KDeepseekV3MoE     # mlp module with custom forward function
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+- match:
+    class: ktransformers.models.modeling_deepseek_v3.MoEGate
+  replace:
+    class: ktransformers.operators.gate.KMoEGate
+    kwargs:
+      generate_device: "cuda:0"
+      prefill_device: "cuda:0"
+- match:
+    name: "^model\\.layers\\..*\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
+    kwargs:
+      prefill_device: "cuda"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op: "KExpertsCPU"
+      out_device: "cuda"
+  recursive: False # don't recursively inject submodules of this module
+- match:
+    name: "^model\\.layers\\..*\\.self_attn$"
+  replace:
+    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
+    kwargs:
+      generate_device: "cuda"
+      prefill_device: "cuda"
+      absorb_for_prefill: False # change this to True to enable long context(prefill may slower).
+- match:
+    name: "^model$"
+  replace:
+    class: "ktransformers.operators.models.KDeepseekV2Model"
+    kwargs:
+      per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
+- match:
+    name: "^model.embed_tokens"
+  replace:
+    class: "default"
+    kwargs:
+      generate_device: "cpu"
+      prefill_device: "cpu"
--- a/ktransformers/util/utils.py
+++ b/ktransformers/util/utils.py
@@ -63,6 +63,18 @@ def get_compute_capability(device:torch.device = None):
    else:
        return 0

+
+def get_device_name(device:torch.device = None):
+    if torch.cuda.is_available():
+        if device is None:
+            num_gpus = torch.cuda.device_count()
+            gpu_name = []
+            for gpu_id in range(num_gpus):
+                gpu_name.append(torch.cuda.get_device_name(gpu_id))
+            return gpu_name
+        else:
+            return torch.cuda.get_device_name(device)
+
 def set_module(model, submodule_key, module):
    tokens = submodule_key.split('.')
    sub_tokens = tokens[:-1]

--- a/setup.py
+++ b/setup.py
@@ -27,7 +27,9 @@ from typing import List, Optional, Literal
 import http.client
 import urllib.request
 import urllib.error
+import importlib
 from pathlib import Path
+from packaging import version
 from packaging.version import parse
 import torch
 import torch.version
@@ -595,22 +597,59 @@ class CMakeBuild(BuildExtension):
            ["cmake", "--build", build_temp, "--verbose", *build_args], cwd=build_temp
        )

-if CUDA_HOME is not None or ROCM_HOME is not None:
+def check_fastpt_version():
+    try:
+        # Try to import the fastpt module
+        fastpt = importlib.import_module('fastpt')
+
+        # Get version number
+        fastpt_version = getattr(fastpt, '__version__', None)
+        if fastpt_version is None:
+            raise ImportError("fastpt module doesn't have __version__ attribute, cannot determine version")
+
+        print(f"Detected fastpt installation, version: {fastpt_version}")
+
+        # Compare version numbers
+        if version.parse(fastpt_version) >= version.parse('2.0.1'):
+            print("fastpt version ≥ 2.0.1")
+            return True
+        else:
+            print(f"fastpt version {fastpt_version} < 2.0.1")
+            return False
+
+    except ImportError as e:
+        print(f"Error: fastpt not installed or import failed - {str(e)}")
+        raise
+
+
+try:
+    if check_fastpt_version():
+        USE_FASTPT_CUDA = os.getenv('USE_FASTPT_CUDA', '0') == '1'
+    else:
+        USE_FASTPT_CUDA = os.getenv('USE_FASTPT_CUDA', 'False').lower() == 'true'
+except Exception as e:
+        print(f"Program terminated: {str(e)}")
+if CUDA_HOME is not None:
+    extra_nvcc_flags = [
+        '-O3',
+        # '--use_fast_math',
+        '-Xcompiler', '-fPIC',
+        '-DKTRANSFORMERS_USE_CUDA',
+    ]
+
+    if USE_FASTPT_CUDA:
+        extra_nvcc_flags.append('-DKTRANSFORMERS_USE_DTK')
+
    ops_module = CUDAExtension('KTransformersOps', [
        'csrc/ktransformers_ext/cuda/custom_gguf/dequant.cu',
        'csrc/ktransformers_ext/cuda/binding.cpp',
        'csrc/ktransformers_ext/cuda/gptq_marlin/gptq_marlin.cu'
    ],
    extra_compile_args={
-            'cxx': ['-O3', '-DKTRANSFORMERS_USE_CUDA'],
-            'nvcc': [
-                '-O3',
-                # '--use_fast_math',
-                '-Xcompiler', '-fPIC',
-                '-DKTRANSFORMERS_USE_CUDA',
-            ]
-        }
-    )
+        'cxx': ['-O3', '-DKTRANSFORMERS_USE_CUDA'],
+        'nvcc': extra_nvcc_flags
+    }
+  )
 elif MUSA_HOME is not None:
    SimplePorting(cuda_dir_path="csrc/ktransformers_ext/cuda", mapping_rule={
        # Common rules
@@ -665,9 +704,50 @@ else:
        CMakeExtension("cpuinfer_ext", os.fspath(Path("").resolve() / "csrc" / "ktransformers_ext")),
    ]

+ROCM_PATH = os.getenv('ROCM_PATH')
+dtk_path = ROCM_PATH + '/.info/rocm_version'
+with open(dtk_path, 'r') as file:
+    content = file.read().strip()
+dtk_version = content.replace('.', '')
+print(dtk_version)
+cwd = os.path.dirname(os.path.abspath(__file__))
+ver_path = os.path.join(cwd, "ktransformers", "__init__.py")
+with open(ver_path, "r", encoding="utf-8") as file:
+    for line in file:
+        match = re.search(r'^__version__\s*=\s*["\'](.*?)["\']', line)
+        if match:
+            k_version = match.group(1)
+            break
+    else:
+        raise RuntimeError("未找到 __version__ 信息")
+
+with open(ver_path, 'r') as f:
+    lines = f.readlines()
+
+# 检查是否存在 __hcu_version__
+found = False
+new_lines = []
+for line in lines:
+    if line.startswith("__hcu_version__"):
+        # 替换已有的 __hcu_version__
+        version = k_version + '+das.dtk' + dtk_version
+        new_lines.append(f"__hcu_version__ = '{version}'\n")
+        found = True
+    else:
+        new_lines.append(line)
+
+# 如果未找到 __hcu_version__，则追加到文件末尾
+if not found:
+    version = k_version + '+das.dtk' + dtk_version
+    new_lines.append(f"__hcu_version__ = '{version}'\n")
+
+# 写回文件
+with open(ver_path, 'w') as f:
+    f.writelines(new_lines)
+
 setup(
    name=VersionInfo.PACKAGE_NAME,
-    version=VersionInfo().get_package_version(),
+    version=k_version + '+das.dtk' + dtk_version,
    install_requires=triton_dep,
    cmdclass={"bdist_wheel":BuildWheelsCommand ,"build_ext": CMakeBuild},
    ext_modules=ext_modules