Unverified Commit ca1dc1e7 authored by Atream's avatar Atream Committed by GitHub
Browse files

Merge branch 'main' into main

parents d3b45d57 505f4e2c
...@@ -59,6 +59,7 @@ Supported operators and their corresponding classes are as follows: ...@@ -59,6 +59,7 @@ Supported operators and their corresponding classes are as follows:
| Linear | KTransformersLinear | KLinearMarlin | Marlin as backend | | Linear | KTransformersLinear | KLinearMarlin | Marlin as backend |
| | | KLinearTorch | pytorch as backend | | | | KLinearTorch | pytorch as backend |
| | | KLinearCPUInfer | llamafile as backend | | | | KLinearCPUInfer | llamafile as backend |
| | | KLinearFP8 | Triton fp8_gemm kernel. Requires GPU be able to caluculate fp8 data |
| experts | KTransformersExperts | KExpertsTorch | pytorch as backend | | experts | KTransformersExperts | KExpertsTorch | pytorch as backend |
| | | KExpertsMarlin | Marlin as backend | | | | KExpertsMarlin | Marlin as backend |
| | | KExpertsCPU | llamafile as backend | | | | KExpertsCPU | llamafile as backend |
......
...@@ -11,31 +11,50 @@ Some preparation: ...@@ -11,31 +11,50 @@ Some preparation:
```sh ```sh
# Adding CUDA to PATH # Adding CUDA to PATH
export PATH=/usr/local/cuda/bin:$PATH if [ -d "/usr/local/cuda/bin" ]; then
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH export PATH=$PATH:/usr/local/cuda/bin
export CUDA_PATH=/usr/local/cuda fi
if [ -d "/usr/local/cuda/lib64" ]; then
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
# Or you can add it to /etc/ld.so.conf and run ldconfig as root:
# echo "/usr/local/cuda-12.x/lib64" | sudo tee -a /etc/ld.so.conf
# sudo ldconfig
fi
if [ -d "/usr/local/cuda" ]; then
export CUDA_PATH=$CUDA_PATH:/usr/local/cuda
fi
``` ```
- Linux-x86_64 with gcc, g++ and cmake - Linux-x86_64 with gcc, g++ and cmake (using Ubuntu as an example)
```sh ```sh
sudo apt-get update sudo apt-get update
sudo apt-get install gcc g++ cmake ninja-build sudo apt-get install build-essential cmake ninja-build
``` ```
- We recommend using [Conda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh) to create a virtual environment with Python=3.11 to run our program. - We recommend using [Miniconda3](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh) or [Anaconda3](https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Linux-x86_64.sh) to create a virtual environment with Python=3.11 to run our program. Assuming your Anaconda installation directory is `~/anaconda3`, you should ensure that the version identifier of the GNU C++standard library used by Anaconda includes `GLIBCXX-3.4.32`
```sh ```sh
conda create --name ktransformers python=3.11 conda create --name ktransformers python=3.11
conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first
conda install -c conda-forge libstdcxx-ng # Anaconda provides a package called `libstdcxx-ng` that includes a newer version of `libstdc++`, which can be installed via `conda-forge`.
strings ~/anaconda3/envs/ktransformers-0.3/lib/libstdc++.so.6 | grep GLIBCXX
``` ```
- Make sure that PyTorch, packaging, ninja is installed - Make sure that PyTorch, packaging, ninja is installed You can also [install previous versions of PyTorch](https://pytorch.org/get-started/previous-versions/)
``` ```
pip install torch packaging ninja cpufeature numpy pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip3 install packaging ninja cpufeature numpy
``` ```
- At the same time, you should download and install the corresponding version of flash-attention from https://github.com/Dao-AILab/flash-attention/releases.
## Installation ## Installation
<!-- 1. ~~Use a Docker image, see [documentation for Docker](./doc/en/Docker.md)~~ <!-- 1. ~~Use a Docker image, see [documentation for Docker](./doc/en/Docker.md)~~
...@@ -62,7 +81,7 @@ Some preparation: ...@@ -62,7 +81,7 @@ Some preparation:
git submodule update git submodule update
``` ```
- [Optional] If you want to run with website, please [compile the website](./doc/en/api/server/website.md) before execute ```bash install.sh``` - [Optional] If you want to run with website, please [compile the website](./api/server/website.md) before execute ```bash install.sh```
- For Linux - For Linux
- For simple install: - For simple install:
...@@ -84,7 +103,7 @@ Some preparation: ...@@ -84,7 +103,7 @@ Some preparation:
install.bat install.bat
``` ```
* If you are developer, you can make use of the makefile to compile and format the code. <br> the detailed usage of makefile is [here](./doc/en/makefile_usage.md) * If you are developer, you can make use of the makefile to compile and format the code. <br> the detailed usage of makefile is [here](./makefile_usage.md)
<h3>Local Chat</h3> <h3>Local Chat</h3>
We provide a simple command-line local chat Python script that you can run for testing. We provide a simple command-line local chat Python script that you can run for testing.
...@@ -102,7 +121,7 @@ We provide a simple command-line local chat Python script that you can run for t ...@@ -102,7 +121,7 @@ We provide a simple command-line local chat Python script that you can run for t
mkdir DeepSeek-V2-Lite-Chat-GGUF mkdir DeepSeek-V2-Lite-Chat-GGUF
cd DeepSeek-V2-Lite-Chat-GGUF cd DeepSeek-V2-Lite-Chat-GGUF
wget https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/resolve/main/DeepSeek-V2-Lite-Chat.Q4_K_M.gguf -O DeepSeek-V2-Lite-Chat.Q4_K_M.gguf wget https://huggingface.co/mradermacher/DeepSeek-V2-Lite-GGUF/resolve/main/DeepSeek-V2-Lite.Q4_K_M.gguf -O DeepSeek-V2-Lite-Chat.Q4_K_M.gguf
cd .. # Move to repo's root dir cd .. # Move to repo's root dir
...@@ -122,7 +141,7 @@ It features the following arguments: ...@@ -122,7 +141,7 @@ It features the following arguments:
- `--gguf_path` (required): Path of a directory containing GGUF files which could that can be downloaded from [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main). Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model. - `--gguf_path` (required): Path of a directory containing GGUF files which could that can be downloaded from [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main). Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model.
- `--optimize_rule_path` (required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models. - `--optimize_config_path` (required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models.
- `--max_new_tokens`: Int (default=1000). Maximum number of new tokens to generate. - `--max_new_tokens`: Int (default=1000). Maximum number of new tokens to generate.
...@@ -235,7 +254,7 @@ Be aware that you need to be subject to their corresponding model licenses when ...@@ -235,7 +254,7 @@ Be aware that you need to be subject to their corresponding model licenses when
<!-- pin block for jump --> <!-- pin block for jump -->
<span id='id_666'> <span id='id_666'>
<h3>RESTful API and Web UI (deprected) </h3> <h3>RESTful API and Web UI </h3>
Start without website: Start without website:
......
...@@ -160,9 +160,14 @@ DeepSeek 的 MLA 操作符计算密集。虽然全部在 CPU 上运行是可行 ...@@ -160,9 +160,14 @@ DeepSeek 的 MLA 操作符计算密集。虽然全部在 CPU 上运行是可行
5. 为什么选择英特尔 CPU? 5. 为什么选择英特尔 CPU?
英特尔目前是唯一支持 AMX 类似指令的 CPU 供应商,与仅支持 AVX 的替代方案相比,性能显著更好。 英特尔目前是唯一支持 AMX 类似指令的 CPU 供应商,与仅支持 AVX 的替代方案相比,性能显著更好。
## 常见问题解答 ## 常见问题解答
### R1 不返回思考过程 ### R1 不返回思考过程
注意!如果测试 R1 可能会跳过思考。因此,可以添加参数:`--force_think true`。详细信息在 [常见问题解答](./FAQ.md) 部分中。 <br> 注意!如果测试 R1 可能会跳过思考。因此,可以添加参数:`--force_think true`。详细信息在 [常见问题解答](./FAQ.md) 部分中。 <br>
## 问题
* 修复服务器集成功能以实现网络API访问支持
* 修复本地聊天功能仅支持单行提示输入的问题(目前输入换行符(\n)即开始生成提示)
### 更多常见问题解答 ### 更多常见问题解答
[详见](./FAQ.md) [详见](./FAQ.md)
...@@ -2,6 +2,8 @@ ...@@ -2,6 +2,8 @@
set -e set -e
# clear build dirs # clear build dirs
rm -rf build
rm -rf *.egg-info
rm -rf ktransformers/ktransformers_ext/build rm -rf ktransformers/ktransformers_ext/build
rm -rf ktransformers/ktransformers_ext/cuda/build rm -rf ktransformers/ktransformers_ext/cuda/build
rm -rf ktransformers/ktransformers_ext/cuda/dist rm -rf ktransformers/ktransformers_ext/cuda/dist
......
...@@ -8,4 +8,4 @@ Version : 1.0.0 ...@@ -8,4 +8,4 @@ Version : 1.0.0
LastEditors : chenxl LastEditors : chenxl
LastEditTime : 2025-02-15 03:53:02 LastEditTime : 2025-02-15 03:53:02
''' '''
__version__ = "0.2.1" __version__ = "0.2.2rc1"
\ No newline at end of file \ No newline at end of file
...@@ -30,6 +30,8 @@ if (NOT MSVC) ...@@ -30,6 +30,8 @@ if (NOT MSVC)
option(LLAMA_F16C "llama: enable F16C" OFF) option(LLAMA_F16C "llama: enable F16C" OFF)
endif() endif()
option(LLAMA_AVX512_FANCY_SIMD "llama: enable AVX512-VL, AVX512-BW, AVX512-DQ, AVX512-VNNI" OFF) option(LLAMA_AVX512_FANCY_SIMD "llama: enable AVX512-VL, AVX512-BW, AVX512-DQ, AVX512-VNNI" OFF)
option(KTRANSFORMERS_USE_CUDA "ktransformers: use CUDA" OFF)
option(KTRANSFORMERS_USE_MUSA "ktransformers: use MUSA" OFF)
# Architecture specific # Architecture specific
# TODO: probably these flags need to be tweaked on some architectures # TODO: probably these flags need to be tweaked on some architectures
...@@ -207,9 +209,33 @@ add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/../../third_party/llama.cpp ${CMAKE ...@@ -207,9 +209,33 @@ add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/../../third_party/llama.cpp ${CMAKE
include_directories(${CMAKE_CURRENT_SOURCE_DIR}/../../third_party) include_directories(${CMAKE_CURRENT_SOURCE_DIR}/../../third_party)
if (WIN32) if (WIN32)
include_directories("$ENV{CUDA_PATH}/include") include_directories("$ENV{CUDA_PATH}/include")
add_compile_definitions(KTRANSFORMERS_USE_CUDA=1)
elseif (UNIX) elseif (UNIX)
find_package(CUDA REQUIRED) if (KTRANSFORMERS_USE_CUDA)
include_directories("${CUDA_INCLUDE_DIRS}") find_package(CUDA REQUIRED)
include_directories("${CUDA_INCLUDE_DIRS}")
add_compile_definitions(KTRANSFORMERS_USE_CUDA=1)
endif()
if (KTRANSFORMERS_USE_MUSA)
if (NOT EXISTS $ENV{MUSA_PATH})
if (NOT EXISTS /opt/musa)
set(MUSA_PATH /usr/local/musa)
else()
set(MUSA_PATH /opt/musa)
endif()
else()
set(MUSA_PATH $ENV{MUSA_PATH})
endif()
list(APPEND CMAKE_MODULE_PATH "${MUSA_PATH}/cmake")
find_package(MUSAToolkit)
if (MUSAToolkit_FOUND)
message(STATUS "MUSA Toolkit found")
add_compile_definitions(KTRANSFORMERS_USE_MUSA=1)
endif()
endif()
endif() endif()
aux_source_directory(${CMAKE_CURRENT_SOURCE_DIR} SOURCE_DIR1) aux_source_directory(${CMAKE_CURRENT_SOURCE_DIR} SOURCE_DIR1)
...@@ -225,10 +251,15 @@ target_link_libraries(${PROJECT_NAME} PRIVATE llama) ...@@ -225,10 +251,15 @@ target_link_libraries(${PROJECT_NAME} PRIVATE llama)
if(WIN32) if(WIN32)
target_link_libraries(${PROJECT_NAME} PRIVATE "$ENV{CUDA_PATH}/lib/x64/cudart.lib")#CUDA::cudart target_link_libraries(${PROJECT_NAME} PRIVATE "$ENV{CUDA_PATH}/lib/x64/cudart.lib")#CUDA::cudart
elseif(UNIX) elseif(UNIX)
if(NOT DEFINED ENV{CUDA_HOME} OR "$ENV{CUDA_HOME}" STREQUAL "") if(KTRANSFORMERS_USE_CUDA)
set(ENV{CUDA_HOME} "/usr/local/cuda") if(NOT DEFINED ENV{CUDA_HOME} OR "$ENV{CUDA_HOME}" STREQUAL "")
set(ENV{CUDA_HOME} "/usr/local/cuda")
endif()
target_link_libraries(${PROJECT_NAME} PRIVATE "$ENV{CUDA_HOME}/lib64/libcudart.so")
endif()
if(KTRANSFORMERS_USE_MUSA)
target_link_libraries(${PROJECT_NAME} PRIVATE MUSA::musart)
endif() endif()
target_link_libraries(${PROJECT_NAME} PRIVATE "$ENV{CUDA_HOME}/lib64/libcudart.so")
endif() endif()
# Define the USE_NUMA option # Define the USE_NUMA option
......
...@@ -54,7 +54,12 @@ void Backend::do_work_stealing_job(int task_num, ...@@ -54,7 +54,12 @@ void Backend::do_work_stealing_job(int task_num,
init_func_ = init_func; init_func_ = init_func;
compute_func_ = compute_func; compute_func_ = compute_func;
finalize_func_ = finalize_func; finalize_func_ = finalize_func;
#ifdef USE_NUMA
// numa node location will be calculated based on the number of threads
thread_num_ = max_thread_num_;
#else
thread_num_ = std::min(max_thread_num_, task_num); thread_num_ = std::min(max_thread_num_, task_num);
#endif
int base = task_num / thread_num_; int base = task_num / thread_num_;
int remain = task_num % thread_num_; int remain = task_num % thread_num_;
thread_state_[0].end = base + (0 < remain); thread_state_[0].end = base + (0 < remain);
...@@ -146,4 +151,4 @@ void Backend::worker_thread(int thread_id) { ...@@ -146,4 +151,4 @@ void Backend::worker_thread(int thread_id) {
return; return;
} }
} }
} }
\ No newline at end of file
...@@ -17,7 +17,11 @@ ...@@ -17,7 +17,11 @@
#include <queue> #include <queue>
#include <thread> #include <thread>
#include <vector> #include <vector>
#include "cuda_runtime.h" #ifdef KTRANSFORMERS_USE_CUDA
#include "vendors/cuda.h"
#elif KTRANSFORMERS_USE_MUSA
#include "vendors/musa.h"
#endif
#include "backend.h" #include "backend.h"
#include "task_queue.h" #include "task_queue.h"
......
## TODO
This directory can be removed after updating the version of `llama.cpp`.
\ No newline at end of file
#pragma once
#include <cuda_runtime.h>
\ No newline at end of file
#pragma once
#include <musa_runtime.h>
#include <musa_bf16.h>
#define cudaLaunchHostFunc musaLaunchHostFunc
#define cudaStream_t musaStream_t
#define cudaHostFn_t musaHostFn_t
#define nv_bfloat16 mt_bfloat16
\ No newline at end of file
/** /**
* @Description : * @Description :
* @Author : Azure-Tang * @Author : Azure-Tang, Boxin Zhang
* @Date : 2024-07-25 13:38:30 * @Date : 2024-07-25 13:38:30
* @Version : 1.0.0 * @Version : 0.2.2
* @LastEditors : kkk1nak0 * @Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
* @LastEditTime : 2024-08-12 03:05:04
* @Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
**/ **/
#include "custom_gguf/ops.h" #include "custom_gguf/ops.h"
#ifdef KTRANSFORMERS_USE_CUDA
#include "gptq_marlin/ops.h" #include "gptq_marlin/ops.h"
#endif
// Python bindings // Python bindings
#include <pybind11/pybind11.h> #include <pybind11/pybind11.h>
#include <pybind11/stl.h> #include <pybind11/stl.h>
...@@ -19,22 +19,53 @@ ...@@ -19,22 +19,53 @@
// namespace py = pybind11; // namespace py = pybind11;
PYBIND11_MODULE(KTransformersOps, m) { PYBIND11_MODULE(KTransformersOps, m) {
m.def("dequantize_q8_0", &dequantize_q8_0, "Function to dequantize q8_0 data.",
py::arg("data"), py::arg("blk_size"), py::arg("device")); m.def("dequantize_q8_0", [](const intptr_t data, int num_bytes, int blk_size, const int ele_per_blk, torch::Device device, py::object target_dtype) {
m.def("dequantize_q6_k", &dequantize_q6_k, "Function to dequantize q6_k data.", torch::Dtype dtype = torch::python::detail::py_object_to_dtype(target_dtype);
py::arg("data"), py::arg("blk_size"), py::arg("device")); return dequantize_q8_0((int8_t*)data, num_bytes, blk_size, ele_per_blk, device, dtype);
m.def("dequantize_q5_k", &dequantize_q5_k, "Function to dequantize q5_k data.", }, "Function to dequantize q8_0 data.",
py::arg("data"), py::arg("blk_size"), py::arg("device")); py::arg("data"), py::arg("num_bytes"), py::arg("blk_size"), py::arg("ele_per_blk"), py::arg("device"), py::arg("target_dtype"));
m.def("dequantize_q4_k", &dequantize_q4_k, "Function to dequantize q4_k data.",
py::arg("data"), py::arg("blk_size"), py::arg("device")); m.def("dequantize_q6_k", [](const intptr_t data, int num_bytes, int blk_size, const int ele_per_blk, torch::Device device, py::object target_dtype) {
m.def("dequantize_q3_k", &dequantize_q3_k, "Function to dequantize q3_k data.", torch::Dtype dtype = torch::python::detail::py_object_to_dtype(target_dtype);
py::arg("data"), py::arg("blk_size"), py::arg("device")); return dequantize_q6_k((int8_t*)data, num_bytes, blk_size, ele_per_blk, device, dtype);
m.def("dequantize_q2_k", &dequantize_q2_k, "Function to dequantize q2_k data.", }, "Function to dequantize q6_k data.",
py::arg("data"), py::arg("blk_size"), py::arg("device")); py::arg("data"), py::arg("num_bytes"), py::arg("blk_size"), py::arg("ele_per_blk"), py::arg("device"), py::arg("target_dtype"));
m.def("dequantize_iq4_xs", &dequantize_iq4_xs, "Function to dequantize iq4_xs data.",
py::arg("data"), py::arg("blk_size"), py::arg("device")); m.def("dequantize_q5_k", [](const intptr_t data, int num_bytes, int blk_size, const int ele_per_blk, torch::Device device, py::object target_dtype) {
m.def("gptq_marlin_gemm", &gptq_marlin_gemm, "Function to perform GEMM using Marlin quantization.", torch::Dtype dtype = torch::python::detail::py_object_to_dtype(target_dtype);
py::arg("a"), py::arg("b_q_weight"), py::arg("b_scales"), py::arg("g_idx"), return dequantize_q5_k((int8_t*)data, num_bytes, blk_size, ele_per_blk, device, dtype);
py::arg("perm"), py::arg("workspace"), py::arg("num_bits"), py::arg("size_m"), }, "Function to dequantize q5_k data.",
py::arg("size_n"), py::arg("size_k"), py::arg("is_k_full")); py::arg("data"), py::arg("num_bytes"), py::arg("blk_size"), py::arg("ele_per_blk"), py::arg("device"), py::arg("target_dtype"));
m.def("dequantize_q4_k", [](const intptr_t data, int num_bytes, int blk_size, const int ele_per_blk, torch::Device device, py::object target_dtype) {
torch::Dtype dtype = torch::python::detail::py_object_to_dtype(target_dtype);
return dequantize_q4_k((int8_t*)data, num_bytes, blk_size, ele_per_blk, device, dtype);
}, "Function to dequantize q4_k data.",
py::arg("data"), py::arg("num_bytes"), py::arg("blk_size"), py::arg("ele_per_blk"), py::arg("device"), py::arg("target_dtype"));
m.def("dequantize_q3_k", [](const intptr_t data, int num_bytes, int blk_size, const int ele_per_blk, torch::Device device, py::object target_dtype) {
torch::Dtype dtype = torch::python::detail::py_object_to_dtype(target_dtype);
return dequantize_q3_k((int8_t*)data, num_bytes, blk_size, ele_per_blk, device, dtype);
}, "Function to dequantize q3_k data.",
py::arg("data"), py::arg("num_bytes"), py::arg("blk_size"), py::arg("ele_per_blk"), py::arg("device"), py::arg("target_dtype"));
m.def("dequantize_q2_k", [](const intptr_t data, int num_bytes, int blk_size, const int ele_per_blk, torch::Device device, py::object target_dtype) {
torch::Dtype dtype = torch::python::detail::py_object_to_dtype(target_dtype);
return dequantize_q2_k((int8_t*)data, num_bytes, blk_size, ele_per_blk, device, dtype);
}, "Function to dequantize q2_k data.",
py::arg("data"), py::arg("num_bytes"), py::arg("blk_size"), py::arg("ele_per_blk"), py::arg("device"), py::arg("target_dtype"));
m.def("dequantize_iq4_xs", [](const intptr_t data, int num_bytes, int blk_size, const int ele_per_blk, torch::Device device, py::object target_dtype) {
torch::Dtype dtype = torch::python::detail::py_object_to_dtype(target_dtype);
return dequantize_iq4_xs((int8_t*)data, num_bytes, blk_size, ele_per_blk, device, dtype);
}, "Function to dequantize iq4_xs data.",
py::arg("data"), py::arg("num_bytes"), py::arg("blk_size"), py::arg("ele_per_blk"), py::arg("device"), py::arg("target_dtype"));
#ifdef KTRANSFORMERS_USE_CUDA
m.def("gptq_marlin_gemm", &gptq_marlin_gemm, "Function to perform GEMM using Marlin quantization.",
py::arg("a"), py::arg("b_q_weight"), py::arg("b_scales"), py::arg("g_idx"),
py::arg("perm"), py::arg("workspace"), py::arg("num_bits"), py::arg("size_m"),
py::arg("size_n"), py::arg("size_k"), py::arg("is_k_full"));
#endif
} }
#include "ops.h"
// Python bindings
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include <torch/library.h>
#include <torch/extension.h>
#include <torch/torch.h>
// namespace py = pybind11;
int test(){
return 5;
}
torch::Tensor dequantize_q6_k(torch::Tensor data, int blk_size, torch::Device device);
torch::Tensor dequantize_q5_k(torch::Tensor data, int blk_size, torch::Device device);
torch::Tensor dequantize_q2_k(torch::Tensor data, int blk_size, torch::Device device);
PYBIND11_MODULE(cudaops, m) {
m.def("dequantize_q8_0", &dequantize_q8_0, "Function to dequantize q8_0 data.",
py::arg("data"), py::arg("blk_size"), py::arg("device"));
m.def("dequantize_q6_k", &dequantize_q6_k, "Function to dequantize q6_k data.",
py::arg("data"), py::arg("blk_size"), py::arg("device"));
m.def("dequantize_q5_k", &dequantize_q5_k, "Function to dequantize q5_k data.",
py::arg("data"), py::arg("blk_size"), py::arg("device"));
m.def("dequantize_q4_k", &dequantize_q4_k, "Function to dequantize q4_k data.",
py::arg("data"), py::arg("blk_size"), py::arg("device"));
m.def("dequantize_q3_k", &dequantize_q3_k, "Function to dequantize q3_k data.",
py::arg("data"), py::arg("blk_size"), py::arg("device"));
m.def("dequantize_q2_k", &dequantize_q2_k, "Function to dequantize q2_k data.",
py::arg("data"), py::arg("blk_size"), py::arg("device"));
m.def("dequantize_iq4_xs", &dequantize_iq4_xs, "Function to dequantize iq4_xs data.",
py::arg("data"), py::arg("blk_size"), py::arg("device"));
m.def("test", &test, "Function to test.");
}
/** /**
* @Description : * @Description :
* @Author : Azure-Tang * @Author : Azure-Tang
* @Date : 2024-07-22 09:27:55 * @Date : 2024-07-22 09:27:55
* @Version : 1.0.0 * @Version : 1.0.0
* @LastEditors : kkk1nak0 * @LastEditors : kkk1nak0
* @LastEditTime : 2024-08-12 03:48:46 * @LastEditTime : 2024-08-12 03:48:46
* @Copyright (c) 2024 by KVCache.AI, All Rights Reserved. * @Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
**/ **/
#pragma once #pragma once
...@@ -13,10 +13,10 @@ ...@@ -13,10 +13,10 @@
#include <torch/extension.h> #include <torch/extension.h>
#include <torch/torch.h> #include <torch/torch.h>
torch::Tensor dequantize_q8_0(torch::Tensor data, int blk_size, torch::Device device); torch::Tensor dequantize_q8_0(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype);
torch::Tensor dequantize_q6_k(torch::Tensor data, int blk_size, torch::Device device); torch::Tensor dequantize_q6_k(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype);
torch::Tensor dequantize_q5_k(torch::Tensor data, int blk_size, torch::Device device); torch::Tensor dequantize_q5_k(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype);
torch::Tensor dequantize_q4_k(torch::Tensor data, int blk_size, torch::Device device); torch::Tensor dequantize_q4_k(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype);
torch::Tensor dequantize_q3_k(torch::Tensor data, int blk_size, torch::Device device); torch::Tensor dequantize_q3_k(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype);
torch::Tensor dequantize_q2_k(torch::Tensor data, int blk_size, torch::Device device); torch::Tensor dequantize_q2_k(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype);
torch::Tensor dequantize_iq4_xs(torch::Tensor data, int blk_size, torch::Device device); torch::Tensor dequantize_iq4_xs(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype);
import os
import sys
sys.path.insert(0,"/home/zbx/ktransformers")
from ktransformers.util.custom_gguf import GGUFLoader
import torch
gguf_loader_1 = GGUFLoader("/mnt/data/model/DeepseekV3-q4km-gguf")
gguf_loader_2 = GGUFLoader("/mnt/data/chenht/model/gguf_for_ktransformers/DeepSeek-V3-bf16/")
torch.set_default_dtype(torch.bfloat16)
tensor_1 = gguf_loader_1.load_gguf_tensor("blk.0.attn_kv_a_mqa.weight", "cuda")
tensor_2 = gguf_loader_2.load_gguf_tensor("blk.0.attn_kv_a_mqa.weight", "cuda")
print(tensor_1[0, -64:])
print(tensor_2[0, -64:])
\ No newline at end of file
...@@ -90,7 +90,7 @@ def marlin_quantize( ...@@ -90,7 +90,7 @@ def marlin_quantize(
assert group_size <= size_k assert group_size <= size_k
# Quantize (and apply act_order if provided) # Quantize (and apply act_order if provided)
w_ref, q_w, s, g_idx, rand_perm = quantize_weights(w, num_bits, group_size, q_w, s, g_idx, rand_perm = quantize_weights(w, num_bits, group_size,
act_order) act_order)
# For act_order, sort the "weights" and "g_idx" so that group ids are # For act_order, sort the "weights" and "g_idx" so that group ids are
...@@ -107,7 +107,7 @@ def marlin_quantize( ...@@ -107,7 +107,7 @@ def marlin_quantize(
marlin_scale_perm_single[num_bits]) marlin_scale_perm_single[num_bits])
# Create result # Create result
res_list = [w_ref, marlin_q_w, marlin_s, g_idx, sort_indices, rand_perm] res_list = [marlin_q_w, marlin_s, g_idx, sort_indices, rand_perm]
for i in range(len(res_list)): for i in range(len(res_list)):
res_list[i] = res_list[i].to(w.device) res_list[i] = res_list[i].to(w.device)
......
...@@ -11,8 +11,7 @@ def get_pack_factor(num_bits): ...@@ -11,8 +11,7 @@ def get_pack_factor(num_bits):
return 32 // num_bits return 32 // num_bits
def permute_rows(q_w: torch.Tensor, w_ref: torch.Tensor, group_size: int): def permute_rows(q_w: torch.Tensor, group_size: int):
assert q_w.shape == w_ref.shape
orig_device = q_w.device orig_device = q_w.device
k_size, _ = q_w.shape k_size, _ = q_w.shape
...@@ -26,10 +25,8 @@ def permute_rows(q_w: torch.Tensor, w_ref: torch.Tensor, group_size: int): ...@@ -26,10 +25,8 @@ def permute_rows(q_w: torch.Tensor, w_ref: torch.Tensor, group_size: int):
g_idx = g_idx[rand_perm].contiguous() g_idx = g_idx[rand_perm].contiguous()
q_w = q_w[rand_perm, :].contiguous() q_w = q_w[rand_perm, :].contiguous()
w_ref = w_ref[rand_perm, :].contiguous()
return ( return (
w_ref.to(device=orig_device),
q_w.to(device=orig_device), q_w.to(device=orig_device),
g_idx.to(device=orig_device), g_idx.to(device=orig_device),
rand_perm.to(device=orig_device), rand_perm.to(device=orig_device),
...@@ -69,9 +66,6 @@ def quantize_weights(w: torch.Tensor, num_bits: int, group_size: int, ...@@ -69,9 +66,6 @@ def quantize_weights(w: torch.Tensor, num_bits: int, group_size: int,
q_w += half_q_val q_w += half_q_val
q_w = torch.clamp(q_w, 0, max_q_val) q_w = torch.clamp(q_w, 0, max_q_val)
# Compute ref (dequantized)
w_ref = (q_w - half_q_val).half() * s
# Restore original shapes # Restore original shapes
if group_size < size_k: if group_size < size_k:
...@@ -82,7 +76,6 @@ def quantize_weights(w: torch.Tensor, num_bits: int, group_size: int, ...@@ -82,7 +76,6 @@ def quantize_weights(w: torch.Tensor, num_bits: int, group_size: int,
return w return w
q_w = reshape_w(q_w) q_w = reshape_w(q_w)
w_ref = reshape_w(w_ref)
s = s.reshape((-1, size_n)).contiguous() s = s.reshape((-1, size_n)).contiguous()
...@@ -95,10 +88,9 @@ def quantize_weights(w: torch.Tensor, num_bits: int, group_size: int, ...@@ -95,10 +88,9 @@ def quantize_weights(w: torch.Tensor, num_bits: int, group_size: int,
), "For act_order, groupsize = {} must be less than size_k = {}".format( ), "For act_order, groupsize = {} must be less than size_k = {}".format(
group_size, size_k) group_size, size_k)
w_ref, q_w, g_idx, rand_perm = permute_rows(q_w, w_ref, group_size) q_w, g_idx, rand_perm = permute_rows(q_w, group_size)
return ( return (
w_ref.to(device=orig_device),
q_w.to(device=orig_device), q_w.to(device=orig_device),
s.to(device=orig_device), s.to(device=orig_device),
g_idx.to(device=orig_device), g_idx.to(device=orig_device),
......
...@@ -10,6 +10,8 @@ ...@@ -10,6 +10,8 @@
#include "kvcache.h" #include "kvcache.h"
#include <chrono>
void KVCache::attention_kvhead_(const uint16_t *q_in_data, ggml_fp16_t *output, void KVCache::attention_kvhead_(const uint16_t *q_in_data, ggml_fp16_t *output,
float *attn_lse, int batch_size, float *attn_lse, int batch_size,
Backend *backend) { Backend *backend) {
......
...@@ -9,6 +9,9 @@ ...@@ -9,6 +9,9 @@
**/ **/
#include "kvcache.h" #include "kvcache.h"
#include <chrono>
void KVCache::load_kvcache(std::string tensor_file_path, Backend *backend) { void KVCache::load_kvcache(std::string tensor_file_path, Backend *backend) {
// Timer start // Timer start
auto start = std::chrono::high_resolution_clock::now(); auto start = std::chrono::high_resolution_clock::now();
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment