Commit 0bc665e9 authored by fengchao's avatar fengchao
Browse files

Merge dcu patch to 0.3.1

parents 32f3d7be b8596252
File added
<div align="center">
<!-- <h1>KTransformers</h1> -->
<p align="center">
# <div align="center"><strong>KTransformers</strong></div>
## 简介
🤗 Transformers 提供了数以千计的预训练模型,支持 100 多种语言的文本分类、信息抽取、问答、摘要、翻译、文本生成。它的宗旨是让最先进的 NLP 技术人人易用。
<picture>
<img alt="KTransformers" src="https://github.com/user-attachments/assets/d5a2492f-a415-4456-af99-4ab102f13f8b" width=50%>
🤗 Transformers 提供了便于快速下载和使用的API,让你可以把预训练模型用在给定文本、在你的数据集上微调然后通过 [model hub](https://huggingface.co/models) 与社区共享。同时,每个定义的 Python 模块都是完全独立的,便于修改和快速进行研究实验。
</picture>
🤗 Transformers 支持三个最热门的深度学习库: [Jax](https://jax.readthedocs.io/en/latest/), [PyTorch](https://pytorch.org/) 以及 [TensorFlow](https://www.tensorflow.org/) — 并与之无缝整合。你可以直接使用一个框架训练你的模型然后用另一个加载和推理。
</p>
<h3>A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations</h3>
<strong><a href="#show-cases">🌟 Show Cases</a> | <a href="#quick-start">🚀 Quick Start</a> | <a href="#tutorial">📃 Tutorial</a> | <a href="https://github.com/kvcache-ai/ktransformers/discussions">💬 Discussion </a>|<a href="#FAQ"> 🙋 FAQ</a> </strong>
</div>
## 安装
组件支持组合
<h2 id="intro">🎉 Introduction</h2>
KTransformers, pronounced as Quick Transformers, is designed to enhance your 🤗 <a href="https://github.com/huggingface/transformers">Transformers</a> experience with advanced kernel optimizations and placement/parallelism strategies.
<br/><br/>
KTransformers is a flexible, Python-centric framework designed with extensibility at its core.
By implementing and injecting an optimized module with a single line of code, users gain access to a Transformers-compatible
interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified ChatGPT-like web UI.
<br/><br/>
Our vision for KTransformers is to serve as a flexible platform for experimenting with innovative LLM inference optimizations. Please let us know if you need any other features.
| PyTorch版本 | fastpt版本 |KTransformers版本 | DTK版本 | Python版本 | 推荐编译方式 |
| ----------- | ----------- | ----------- | ------------------------ | -----------------| ------------ |
| 2.4.1 | 2.0.1 |0.2.3 | >= 25.04 | 3.8、3.10、3.11 | fastpt不转码 |
<h2 id="Updates">🔥 Updates</h2>
+ pytorch版本大于2.4.1 && dtk版本大于25.04 推荐使用fastpt不转码编译。
* **May 14, 2025**: Support Intel Arc GPU ([Tutorial](./doc/en/xpu.md)).
* **Apr 29, 2025**: Support AMX-Int8、 AMX-BF16 and Qwen3MoE ([Tutorial](./doc/en/AMX.md))
https://github.com/user-attachments/assets/fafe8aec-4e22-49a8-8553-59fb5c6b00a2
* **Apr 9, 2025**: Experimental support for LLaMA 4 models ([Tutorial](./doc/en/llama4.md)).
* **Apr 2, 2025**: Support Multi-concurrency. ([Tutorial](./doc/en/balance-serve.md)).
https://github.com/user-attachments/assets/faa3bda2-928b-45a7-b44f-21e12ec84b8a
* **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./doc/en/ROCm.md)).
* **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022--v023-longer-context--fp8-kernel) for DeepSeek-V3 and R1 in 24GB VRAM.
* **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context).
* **Feb 15, 2025**: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed (+15%, up to 16 Tokens/s), update [docs](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md).
* **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
* **Aug 15, 2024**: Update detailed [tutorial](doc/en/injection_tutorial.md) for injection and multi-GPU.
* **Aug 14, 2024**: Support llamfile as linear backend.
* **Aug 12, 2024**: Support multiple GPU; Support new model: mixtral 8\*7B and 8\*22B; Support q2k, q3k, q5k dequant on gpu.
* **Aug 9, 2024**: Support windows native.
<!-- * **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md). -->
<h2 id="show-cases">🌟 Show Cases</h2>
<div>
<h3>GPT-4/o1-level Local VSCode Copilot on a Desktop with only 24GB VRAM</h3>
</div>
https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285
</p>
- **[NEW!!!] Local 671B DeepSeek-Coder-V3/R1:** Running its Q4_K_M version using only 14GB VRAM and 382GB DRAM([Tutorial](./doc/en/DeepseekR1_V3_tutorial.md)).
- Prefill Speed (tokens/s):
- KTransformers: 54.21 (32 cores) → 74.362 (dual-socket, 2×32 cores) → 255.26 (optimized AMX-based MoE kernel, V0.3 only) → 286.55 (selectively using 6 experts, V0.3 only)
- Compared to 10.31 tokens/s in llama.cpp with 2×32 cores, achieving up to **27.79× speedup**.
- Decode Speed (tokens/s):
- KTransformers: 8.73 (32 cores) → 11.26 (dual-socket, 2×32 cores) → 13.69 (selectively using 6 experts, V0.3 only)
- Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to **3.03× speedup**.
- Upcoming Open Source Release:
- AMX optimizations and selective expert activation will be open-sourced in V0.3.
- Currently available only in preview binary distribution, which can be downloaded [here](./doc/en/DeepseekR1_V3_tutorial.md).
- **Local 236B DeepSeek-Coder-V2:** Running its Q4_K_M version using only 21GB VRAM and 136GB DRAM, attainable on a local desktop machine, which scores even better than GPT4-0613 in [BigCodeBench](https://huggingface.co/blog/leaderboard-bigcodebench).
<p align="center">
<picture>
<img alt="DeepSeek-Coder-V2 Score" src="https://github.com/user-attachments/assets/d052924e-8631-44de-aad2-97c54b965693" width=100%>
</picture>
</p>
- **Faster Speed:** Achieving 126 tokens/s for 2K prompt prefill and 13.6 tokens/s for generation through MoE offloading and injecting advanced kernels from [Llamafile](https://github.com/Mozilla-Ocho/llamafile/tree/main) and [Marlin](https://github.com/IST-DASLab/marlin).
- **VSCode Integration:** Wrapped into an OpenAI and Ollama compatible API for seamless integration as a backend for [Tabby](https://github.com/TabbyML/tabby) and various other frontends.
<p align="center">
https://github.com/user-attachments/assets/4c6a8a38-05aa-497d-8eb1-3a5b3918429c
</p>
<!-- <h3>1M Context Local Inference on a Desktop with Only 24GB VRAM</h3>
<p align="center">
https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12
* **1M Context InternLM 2.5 7B**: Operates at full bf16 precision, utilizing 24GB VRAM and 150GB DRAM, which is feasible on a local desktop setup. It achieves a 92.88% success rate on the 1M "Needle In a Haystack" test and 100% on the 128K NIAH test.
<p align="center">
<picture>
<img alt="Single Needle Retrieval 128K" src="./doc/assets/needle_128K.png" width=100%>
</picture>
</p>
<p align="center">
<picture>
<img alt="Single Needle Retrieval 1000K" src="./doc/assets/needle_1M.png" width=100%>
</picture>
</p>
* **Enhanced Speed**: Reaches 16.91 tokens/s for generation with a 1M context using sparse attention, powered by llamafile kernels. This method is over 10 times faster than full attention approach of llama.cpp.
* **Flexible Sparse Attention Framework**: Offers a flexible block sparse attention framework for CPU offloaded decoding. Compatible with SnapKV, Quest, and InfLLm. Further information is available [here](./doc/en/long_context_introduction.md).
-->
<strong>More advanced features will coming soon, so stay tuned!</strong>
<h2 id="quick-start">🚀 Quick Start</h2>
Getting started with KTransformers is simple! Follow the steps below to set up and start using it.
### 📥 Installation
To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/en/install.html).
<h2 id="tutorial">📃 Brief Injection Tutorial</h2>
At the heart of KTransformers is a user-friendly, template-based injection framework.
This allows researchers to easily replace original torch modules with optimized variants. It also simplifies the process of combining multiple optimizations, allowing the exploration of their synergistic effects.
</br>
<p align="center">
<picture>
<img alt="Inject-Struction" src="https://github.com/user-attachments/assets/6b4c1e54-9f6d-45c5-a3fc-8fa45e7d257e" width=65%>
</picture>
</p>
Given that vLLM already serves as a great framework for large-scale deployment optimizations, KTransformers is particularly focused on local deployments that are constrained by limited resources. We pay special attention to heterogeneous computing opportunities, such as GPU/CPU offloading of quantized models. For example, we support the efficient <a herf="https://github.com/Mozilla-Ocho/llamafile/tree/main">Llamafile</a> and <a herf="https://github.com/IST-DASLab/marlin">Marlin</a> kernels for CPU and GPU, respectively. More details can be found <a herf="doc/en/operators/llamafile.md">here</a>.
<h3>Example Usage</h3>
To utilize the provided kernels, users only need to create a YAML-based injection template and add the call to `optimize_and_load_gguf` before using the Transformers model.
```python
with torch.device("meta"):
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
optimize_and_load_gguf(model, optimize_config_path, gguf_path, config)
...
generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000)
### 1、使用pip方式安装
KTransformers whl包下载目录:[光和开发者社区](https://download.sourcefind.cn:65024/4/main),选择对应的pytorch版本和python版本下载对应KTransformers的whl包
```shell
pip install torch* (下载torch的whl包)
pip install fastpt* --no-deps (下载fastpt的whl包)
source /usr/local/bin/fastpt -E
pip install ktransformers* (下载的ktransformers-fastpt的whl包) --no-deps
```
### 2、使用源码编译方式安装
In this example, the AutoModel is first initialized on the meta device to avoid occupying any memory resources. Then, `optimize_and_load_gguf` iterates through all sub-modules of the model, matches rules specified in your YAML rule file, and replaces them with advanced modules as specified.
After injection, the original `generate` interface is available, but we also provide a compatible `prefill_and_generate` method, which enables further optimizations like CUDAGraph to improve generation speed.
#### 编译环境准备
提供基于fastpt不转码编译:
<h3>How to custom your model</h3>
1. 基于光源pytorch基础镜像环境:镜像下载地址:[光合开发者社区](https://sourcefind.cn/#/image/dcu/pytorch),根据pytorch、python、dtk及系统下载对应的镜像版本。
A detailed tutorial of the injection and multi-GPU using DeepSeek-V2 as an example is given [here](doc/en/injection_tutorial.md).
Below is an example of a YAML template for replacing all original Linear modules with Marlin, an advanced 4-bit quantization kernel.
```yaml
- match:
name: "^model\\.layers\\..*$" # regular expression
class: torch.nn.Linear # only match modules matching name and class simultaneously
replace:
class: ktransformers.operators.linear.KTransformerLinear # optimized Kernel on quantized data types
device: "cpu" # which devices to load this module when initializing
kwargs:
generate_device: "cuda"
generate_linear_type: "QuantizedLinearMarlin"
2. 基于现有python环境:安装pytorch,fastpt whl包下载目录:[光合开发者社区](https://sourcefind.cn/#/image/dcu/pytorch),根据python、dtk版本,下载对应pytorch的whl包。安装命令如下:
```shell
pip install cpufeature
pip install torch* (下载torch的whl包)
pip install fastpt* --no-deps (下载fastpt的whl包, 安装顺序,先安装torch,后安装fastpt)
pip install setuptools==59.5.0 wheel
```
Each rule in the YAML file has two parts: `match` and `replace`. The `match` part specifies which module should be replaced, and the `replace` part specifies the module to be injected into the model along with the initialization keywords.
You can find example rule templates for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models, in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory. These templates are used to power the `local_chat.py` demo.
If you are interested in our design principles and the implementation of the injection framework, please refer to the [design document](doc/en/deepseek-v2-injection.md).
<h2 id="ack">Acknowledgment and Contributors</h2>
The development of KTransformers is based on the flexible and versatile framework provided by Transformers. We also benefit from advanced kernels such as GGUF/GGML, Llamafile, Marlin, sglang and flashinfer. We are planning to contribute back to the community by upstreaming our modifications.
KTransformers is actively maintained and developed by contributors from the <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> at Tsinghua University and members from <a href="http://approaching.ai/">Approaching.AI</a>. We welcome new contributors to join us in making KTransformers faster and easier to use.
#### 源码编译安装
- 代码下载
```shell
git clone https://developer.sourcefind.cn/codes/OpenDAS/ktransformers.git # 根据编译需要切换分支
```
- 提供2种源码编译方式(进入ktransformers目录):
```
1. 设置不转码编译环境变量
source /usr/local/bin/fastpt -C
<h2 id="ack">Discussion</h2>
2. 编译whl包并安装
bash install_dcu.sh
pip3 install dist/ktransformers*.whl --no-deps
```
#### 注意事项
+ 若使用pip install下载安装过慢,可添加pypi清华源:-i https://pypi.tuna.tsinghua.edu.cn/simple/
+ ROCM_PATH为dtk的路径,默认为/opt/dtk
If you have any questions, feel free to open an issue. Alternatively, you can join our WeChat group for further discussion. QR Code: [WeChat Group](WeChatGroup.png)
## 验证
```
python3
Python 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ktransformers
>>> ktransformers.__version__
'0.2.3post1'
>>>
```
版本号与官方版本同步,查询该软件的版本号,例如0.2.3post1;
<h2 id="FAQ">🙋 FAQ</h2>
## Known Issue
-
Some common questions are answered in the [FAQ](doc/en/FAQ.md).
## 参考资料
- [README_ORIGIN](README_ORIGIN.md)
- [README_zh-CN](README_zh-CN.md)
- [https://github.com/kvcache-ai/ktransformers](https://github.com/kvcache-ai/ktransformers)
......@@ -60,7 +60,7 @@ class ScalarType<nv_bfloat16> {
using FragC = Vec<float, 4>;
using FragS = Vec<nv_bfloat162, 1>;
#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 800
#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 800 || KTRANSFORMERS_USE_DTK
static __device__ float inline num2float(const nv_bfloat16 x) {
return __bfloat162float(x);
}
......
#!/bin/bash
set -e
# 清理构建目录和旧的分发文件
rm -rf build
rm -rf dist
rm -rf *.egg-info
rm -rf ktransformers/ktransformers_ext/build
rm -rf ktransformers/ktransformers_ext/cuda/build
rm -rf ktransformers/ktransformers_ext/cuda/dist
rm -rf ktransformers/ktransformers_ext/cuda/*.egg-info
echo "初始化Git子模块..."
git submodule update --init --recursive
export CMAKE_BUILD_PARALLEL_LEVEL=32
echo "构建ktransformers wheel包"
mkdir -p dist
KTRANSFORMERS_FORCE_BUILD=TRUE pip wheel . -w dist --no-build-isolation --no-deps
echo "生成的wheel包位于:"
ls -l dist/*.whl
echo "构建成功!wheel包已生成在dist目录"
#!/usr/bin/env python
# coding=utf-8
'''
Description :
Description :
Author : kkk1nak0
Date : 2024-08-15 07:34:46
Version : 1.0.0
LastEditors : chenxl
LastEditors : chenxl
LastEditTime : 2025-02-15 03:53:02
'''
__version__ = "0.3"
__hcu_version__ = '0.3+das.dtk2504'
......@@ -28,7 +28,8 @@ from ktransformers.models.modeling_qwen2_moe import Qwen2MoeForCausalLM
from ktransformers.models.modeling_deepseek_v3 import DeepseekV3ForCausalLM
from ktransformers.models.modeling_llama import LlamaForCausalLM
from ktransformers.models.modeling_mixtral import MixtralForCausalLM
from ktransformers.util.utils import prefill_and_generate, get_compute_capability
#from ktransformers.util.utils import prefill_and_generate, get_compute_capability
from ktransformers.util.utils import prefill_and_generate, get_compute_capability, get_device_name
from ktransformers.server.config.config import Config
from ktransformers.operators.flashinfer_wrapper import flashinfer_enabled
from ktransformers.util.vendors import device_manager, get_device, to_device, GPUVendor
......@@ -175,7 +176,7 @@ def local_chat(
assert Config().long_context_config['max_seq_len'] > input_tensor.shape[1] + max_new_tokens, \
"please change max_seq_len in ~/.ktransformers/config.yaml"
if system != "Windows" and (config.architectures[0] == "DeepseekV2ForCausalLM" or config.architectures[0] == "DeepseekV3ForCausalLM") and flashinfer_enabled and get_compute_capability() >= 8 and device_manager.gpu_vendor == GPUVendor.NVIDIA:
if system != "Windows" and (config.architectures[0] == "DeepseekV2ForCausalLM" or config.architectures[0] == "DeepseekV3ForCausalLM") and flashinfer_enabled and get_compute_capability() >= 8 and device_manager.gpu_vendor == GPUVendor.NVIDIA or ("Z100" in get_device_name()) or ("Z100L" in get_device_name()) or ("K100" in get_device_name()):
generated = prefill_and_generate(
model, tokenizer, input_tensor.to(device), max_new_tokens, use_cuda_graph, mode = mode, force_think = force_think, chunk_size = chunk_size,
use_flashinfer_mla = True, num_heads = config.num_attention_heads, head_dim_ckv = config.kv_lora_rank, head_dim_kpe = config.qk_rope_head_dim, q_head_dim = config.qk_rope_head_dim + config.qk_nope_head_dim
......
......@@ -16,7 +16,8 @@ from ktransformers.models.modeling_deepseek import DeepseekV2Attention, apply_ro
from typing import Optional, Tuple
from ktransformers.operators.base_operator import BaseInjectedModule
from ktransformers.util.custom_loader import GGUFLoader
from ktransformers.util.utils import get_compute_capability
#from ktransformers.util.utils import get_compute_capability
from ktransformers.util.utils import get_compute_capability, get_device_name
import logging
from transformers.configuration_utils import PretrainedConfig
from transformers.cache_utils import Cache
......@@ -703,10 +704,12 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
cache_position,
**kwargs,
)
elif (os.name == 'nt'
or get_compute_capability() < 8
elif (os.name == 'nt' or get_compute_capability()<8
or hidden_states.device.type == 'cpu'
or device_manager.gpu_vendor != GPUVendor.NVIDIA):
or device_manager.gpu_vendor != GPUVendor.NVIDIA)
or ("Z100" in get_device_name())
or ("Z100L" in get_device_name()) or ("K100" in get_device_name()):
print("for Windows or GPU before ampere or Z100/Z100L or K100, use forward_windows")
return self.forward_windows(
hidden_states,
attention_mask,
......
......@@ -57,8 +57,9 @@ from ktransformers.util.vendors import device_manager, get_device, to_device, GP
from transformers.models.qwen2_moe.configuration_qwen2_moe import Qwen2MoeConfig
from ktransformers.models.configuration_llama import LlamaConfig
from ktransformers.operators.base_operator import BaseInjectedModule
from ktransformers.util.utils import InferenceState, get_compute_capability
from ktransformers.util.custom_loader import GGUFLoader
#from ktransformers.util.utils import InferenceState, get_compute_capability
from ktransformers.util.utils import InferenceState, get_compute_capability, get_device_name
from transformers.configuration_utils import PretrainedConfig
from ktransformers.models.modeling_llama import (
LlamaDecoderLayer,
......@@ -657,11 +658,15 @@ class KDeepseekV2Model(BaseInjectedModule):
if per_layer_prefill_flag:
causal_mask = None
else:
if (os.name == 'nt'
or get_compute_capability() < 8
#if os.name == 'nt' or get_compute_capability()<8:
# print("for Windows or GPU before ampere, use forward_windows")
if os.name == 'nt' or get_compute_capability()<8
or (self.transfer_map is not None and 'cpu' in self.transfer_map.values())
or device_manager.gpu_vendor != GPUVendor.NVIDIA):
# print("for Windows or GPU before ampere, use forward_windows")
or device_manager.gpu_vendor != GPUVendor.NVIDIA
or (self.transfer_map is not None and 'cpu' in self.transfer_map.values())
or device_manager.gpu_vendor != GPUVendor.NVIDIA
or ("Z100" in get_device_name()) or ("Z100L" in get_device_name()) or ("K100" in get_device_name()):
print("for Windows or GPU before ampere or Z100/Z100L or K100, use forward_windows")
# only use mask in forward windows or can't flash attn
causal_mask = self._update_causal_mask(
attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
......
- match:
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
kwargs:
generate_device: "cuda"
prefill_device: "cuda"
- match:
name: "^lm_head$" # regular expression
class: torch.nn.Linear # only match modules matching name and class simultaneously
replace:
class: ktransformers.operators.linear.KTransformersLinear # optimized Kernel on quantized data types
kwargs:
generate_device: "cuda"
prefill_device: "cuda"
generate_op: "KLinearTorch"
prefill_op: "KLinearTorch"
- match:
name: "^model\\.layers\\.(?!.*self_attn\\.kv_b_proj).*$" # regular expression
class: torch.nn.Linear # only match modules matching name and class simultaneously
replace:
class: ktransformers.operators.linear.KTransformersLinear # optimized Kernel on quantized data types
kwargs:
generate_device: "cuda"
prefill_device: "cuda"
generate_op: "KLinearTorch"
prefill_op: "KLinearTorch"
- match:
name: "^model\\.layers\\..*\\.mlp$"
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
replace:
class: ktransformers.operators.experts.KDeepseekV3MoE # mlp module with custom forward function
kwargs:
generate_device: "cuda"
prefill_device: "cuda"
- match:
class: ktransformers.models.modeling_deepseek_v3.MoEGate
replace:
class: ktransformers.operators.gate.KMoEGate
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "^model\\.layers\\..*\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
kwargs:
prefill_device: "cuda"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda"
recursive: False # don't recursively inject submodules of this module
- match:
name: "^model\\.layers\\..*\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
kwargs:
generate_device: "cuda"
prefill_device: "cuda"
absorb_for_prefill: False # change this to True to enable long context(prefill may slower).
- match:
name: "^model$"
replace:
class: "ktransformers.operators.models.KDeepseekV2Model"
kwargs:
per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
- match:
name: "^model.embed_tokens"
replace:
class: "default"
kwargs:
generate_device: "cpu"
prefill_device: "cpu"
......@@ -63,6 +63,18 @@ def get_compute_capability(device:torch.device = None):
else:
return 0
def get_device_name(device:torch.device = None):
if torch.cuda.is_available():
if device is None:
num_gpus = torch.cuda.device_count()
gpu_name = []
for gpu_id in range(num_gpus):
gpu_name.append(torch.cuda.get_device_name(gpu_id))
return gpu_name
else:
return torch.cuda.get_device_name(device)
def set_module(model, submodule_key, module):
tokens = submodule_key.split('.')
sub_tokens = tokens[:-1]
......
......@@ -27,7 +27,9 @@ from typing import List, Optional, Literal
import http.client
import urllib.request
import urllib.error
import importlib
from pathlib import Path
from packaging import version
from packaging.version import parse
import torch
import torch.version
......@@ -595,22 +597,59 @@ class CMakeBuild(BuildExtension):
["cmake", "--build", build_temp, "--verbose", *build_args], cwd=build_temp
)
if CUDA_HOME is not None or ROCM_HOME is not None:
def check_fastpt_version():
try:
# Try to import the fastpt module
fastpt = importlib.import_module('fastpt')
# Get version number
fastpt_version = getattr(fastpt, '__version__', None)
if fastpt_version is None:
raise ImportError("fastpt module doesn't have __version__ attribute, cannot determine version")
print(f"Detected fastpt installation, version: {fastpt_version}")
# Compare version numbers
if version.parse(fastpt_version) >= version.parse('2.0.1'):
print("fastpt version ≥ 2.0.1")
return True
else:
print(f"fastpt version {fastpt_version} < 2.0.1")
return False
except ImportError as e:
print(f"Error: fastpt not installed or import failed - {str(e)}")
raise
try:
if check_fastpt_version():
USE_FASTPT_CUDA = os.getenv('USE_FASTPT_CUDA', '0') == '1'
else:
USE_FASTPT_CUDA = os.getenv('USE_FASTPT_CUDA', 'False').lower() == 'true'
except Exception as e:
print(f"Program terminated: {str(e)}")
if CUDA_HOME is not None:
extra_nvcc_flags = [
'-O3',
# '--use_fast_math',
'-Xcompiler', '-fPIC',
'-DKTRANSFORMERS_USE_CUDA',
]
if USE_FASTPT_CUDA:
extra_nvcc_flags.append('-DKTRANSFORMERS_USE_DTK')
ops_module = CUDAExtension('KTransformersOps', [
'csrc/ktransformers_ext/cuda/custom_gguf/dequant.cu',
'csrc/ktransformers_ext/cuda/binding.cpp',
'csrc/ktransformers_ext/cuda/gptq_marlin/gptq_marlin.cu'
],
extra_compile_args={
'cxx': ['-O3', '-DKTRANSFORMERS_USE_CUDA'],
'nvcc': [
'-O3',
# '--use_fast_math',
'-Xcompiler', '-fPIC',
'-DKTRANSFORMERS_USE_CUDA',
]
}
)
'cxx': ['-O3', '-DKTRANSFORMERS_USE_CUDA'],
'nvcc': extra_nvcc_flags
}
)
elif MUSA_HOME is not None:
SimplePorting(cuda_dir_path="csrc/ktransformers_ext/cuda", mapping_rule={
# Common rules
......@@ -665,9 +704,50 @@ else:
CMakeExtension("cpuinfer_ext", os.fspath(Path("").resolve() / "csrc" / "ktransformers_ext")),
]
ROCM_PATH = os.getenv('ROCM_PATH')
dtk_path = ROCM_PATH + '/.info/rocm_version'
with open(dtk_path, 'r') as file:
content = file.read().strip()
dtk_version = content.replace('.', '')
print(dtk_version)
cwd = os.path.dirname(os.path.abspath(__file__))
ver_path = os.path.join(cwd, "ktransformers", "__init__.py")
with open(ver_path, "r", encoding="utf-8") as file:
for line in file:
match = re.search(r'^__version__\s*=\s*["\'](.*?)["\']', line)
if match:
k_version = match.group(1)
break
else:
raise RuntimeError("未找到 __version__ 信息")
with open(ver_path, 'r') as f:
lines = f.readlines()
# 检查是否存在 __hcu_version__
found = False
new_lines = []
for line in lines:
if line.startswith("__hcu_version__"):
# 替换已有的 __hcu_version__
version = k_version + '+das.dtk' + dtk_version
new_lines.append(f"__hcu_version__ = '{version}'\n")
found = True
else:
new_lines.append(line)
# 如果未找到 __hcu_version__,则追加到文件末尾
if not found:
version = k_version + '+das.dtk' + dtk_version
new_lines.append(f"__hcu_version__ = '{version}'\n")
# 写回文件
with open(ver_path, 'w') as f:
f.writelines(new_lines)
setup(
name=VersionInfo.PACKAGE_NAME,
version=VersionInfo().get_package_version(),
version=k_version + '+das.dtk' + dtk_version,
install_requires=triton_dep,
cmdclass={"bdist_wheel":BuildWheelsCommand ,"build_ext": CMakeBuild},
ext_modules=ext_modules
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment