Commit 18c42e67 authored by chenxl's avatar chenxl
Browse files

Initial commit

parents
# Llamafile Operators Documentation
## Llamafile Sgemm
The Llamafile Sgemm module is an efficient implementation of general matrix multiplication (GEMM) extracted from the great [Llamafile project](https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/sgemm.cpp).
This module optimizes performance by utilizing various processor-specific instruction sets. For instance, it checks for different x86 instruction sets such as AVX, FMA, and AVX512, leveraging these advanced instructions to accelerate computation.
Additionally, the Llamafile Sgemm module supports multiple quantization types, including q8_0, q6_k, and q5_k, among others. This adaptability to different hardware capabilities ensures the most advanced instructions are used in any given computing environment, achieving high computational efficiency. For more information, you can view the [Llamafile Sgemm module](https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/sgemm.cpp) on GitHub.
## CPUInfer
To power Llamafile and many future CPU kernels without the original GGML framework, we developed a simple CPUInfer multi-threaded execution framework. It currently leverages the Llamafile Sgemm module to implement key operators such as linear layers, MLP, and MoE, and will be extended to support many other operators. These operators are fundamental components for building large models. CPUInfer features a backend work-stealing thread pool and asynchronous task queue execution logic to efficiently offload parts of model parameters to the CPU, thereby maintaining high inference performance. It supports adjustments based on hardware capabilities or user configurations, providing enhanced inference performance and making it an ideal tool for running deep learning models on CPUs.
## Expert-Parallel MoE
The MoE module's performance can be enhanced by using custom kernels that utilize **expert parallelism**. Since the routed experts are independently computable, we can utilize this inherent parallelism to speed up MoE computations. Specifically, we can allocate each expert MLP to a separate thread group, allowing for the simultaneous computation of all routed experts. This approach of expert parallelism significantly boosts MoE performance by minimizing the frequency of global synchronizations and reducing kernel launch overhead compared to sequential expert computation.
## Microbenchmark
Our evaluations were conducted on an Intel(R) Xeon(R) Gold 6454S processor, utilizing real parameters from the DeepSeek-Coder-V2-Instruct model.
### Linear Projection
The performance of the linear layer was assessed using an Attention Output Projection with dimensions of [5120, 16384]. Here, the input was a vector of 16384 dimensions, and the output was a vector of 5120 dimensions.
![Linear_projection_time](Linear_projection_time.png)
As we can see, in half-precision floating-point formats (fp16 and bf16), CPUInfer's performance exceeded that of Torch by 1.7 and 1.5 times, respectively. For 8-bit quantization, CPUInfer (supporting q8_0) and Torch (supporting qint8) demonstrated nearly equivalent performance. However, CPUInfer employs a more refined scaling approach, using different factors for each group (in q8_0 quantization, every 32 numbers form one group), whereas Torch uses a basic per-tensor quantization, potentially leading to significant precision loss. Furthermore, CPUInfer’s capability to use lower-bit quantization enhances inference speed in specific scenarios.
### MoE
In the MoE module, each token selected 6 experts out of 160 for computation, with input and output dimensions of 5120, and an intermediate dimension of 1536.
![Combined_MoE_time_per_layer](Combined_MoE_time_per_layer.png)
For half-precision floating points and 8-bit quantization formats, CPUInfer's generation performance was 2.5 and 3.2 times better than Torch, respectively. Moreover, using the 8-bit quantization format, CPUInfer achieved faster prefill speeds compared to Torch, with shorter prompts highlighting a more pronounced performance difference.
# API
- [OpenAI ChatCompletion](#openai-chatcompletion)
- [Ollama ChatCompletion](#ollama-chatcompletion)
- [OpenAI Assistant](#openai-assistant)
## OpenAI ChatCompletion
```bash
POST /v1/chat/completions
```
根据选定的模型生成回复。
### 参数
- `messages`:一个 `message` 的数组所有的历史消息。`message`:表示用户(user)或者模型(assistant)的消息。`message`包含:
- `role`: 取值`user``assistant`,代表这个 message 的创建者。
- `content`: 用户或者模型的消息。
- `model`:选定的模型名
- `stream`:取值 true 或者 false。表示是否使用流式返回。如果为 true,则以 http 的 event stream 的方式返回模型推理结果。
### 响应
- 流式返回:一个 event stream,每个 event 含有一个`chat.completion.chunk``chunk.choices[0].delta.content`是每次模型返回的增量输出。
- 非流式返回:还未支持。
### 例子
```bash
curl -X 'POST' \
'http://localhost:9112/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"messages": [
{
"content": "tell a joke",
"role": "user"
}
],
"model": "Meta-Llama-3-8B-Instruct",
"stream": true
}'
```
```bash
data:{"id":"c30445e8-1061-4149-a101-39b8222e79e1","object":"chat.completion.chunk","created":1720511671,"model":"not implmented","system_fingerprint":"not implmented","usage":null,"choices":[{"index":0,"delta":{"content":"Why ","role":"assistant","name":null},"logprobs":null,"finish_reason":null}]}
data:{"id":"c30445e8-1061-4149-a101-39b8222e79e1","object":"chat.completion.chunk","created":1720511671,"model":"not implmented","system_fingerprint":"not implmented","usage":null,"choices":[{"index":0,"delta":{"content":"","role":"assistant","name":null},"logprobs":null,"finish_reason":null}]}
data:{"id":"c30445e8-1061-4149-a101-39b8222e79e1","object":"chat.completion.chunk","created":1720511671,"model":"not implmented","system_fingerprint":"not implmented","usage":null,"choices":[{"index":0,"delta":{"content":"couldn't ","role":"assistant","name":null},"logprobs":null,"finish_reason":null}]}
...
data:{"id":"c30445e8-1061-4149-a101-39b8222e79e1","object":"chat.completion.chunk","created":1720511671,"model":"not implmented","system_fingerprint":"not implmented","usage":null,"choices":[{"index":0,"delta":{"content":"two-tired!","role":"assistant","name":null},"logprobs":null,"finish_reason":null}]}
event: done
data: [DONE]
```
## Ollama ChatCompletion
```bash
POST /api/generate
```
根据选定的模型生成回复。
### 参数
- `prompt`:一个字符串,代表输入的 prompt。
- `model`:选定的模型名
- `stream`:取值 true 或者 false。表示是否使用流式返回。如果为 true,则以 http 的 event stream 的方式返回模型推理结果。
### 响应
- 流式返回:一个流式的 json 返回,每行是一个 json。
- `response`:模型补全的增量结果。
- `done`:是否推理结束。
- 非流式返回:还未支持。
### 例子
```bash
curl -X 'POST' \
'http://localhost:9112/api/generate' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "Meta-Llama-3-8B-Instruct",
"prompt": "tell me a joke",
"stream": true
}'
```
```bash
{"model":"Meta-Llama-3-8B-Instruct","created_at":"2024-07-09 08:13:11.686513","response":"I'll ","done":false}
{"model":"Meta-Llama-3-8B-Instruct","created_at":"2024-07-09 08:13:11.729214","response":"give ","done":false}
...
{"model":"Meta-Llama-3-8B-Instruct","created_at":"2024-07-09 08:13:33.955475","response":"for","done":false}
{"model":"Meta-Llama-3-8B-Instruct","created_at":"2024-07-09 08:13:33.956795","response":"","done":true}
```
# 后端服务(Server)
Server 将 ktransformers 的快速异构推理能力通过 API 提供给外界调用。
<img src="server-arch.png" height="600" alt="Server架构">
## API
Server 通过 RESTful API 对外提供模型推理服务,提供 ChatCompletion 和 Assistant 两种调用方式。
- ChatCompletion 接口要求用户一次提供所有的历史对话,然后返回模型的回复。AI 服务提供商(例如[OpenAI](https://platform.openai.com/docs/api-reference/chat/create) )和本地推理框架(例如[Ollama](https://github.com/ollama/ollama/blob/main/docs/api.md) )都提供 ChatCompletion 接口。为了兼容 OpenAI 和 Ollama,Server 分别提供和它们一致的 API 接口。因此,当前使用 OpenAI 和 Ollama 的应用可以无缝切换到我们的 Server。例如: [如何使用 Tabby 和 ktransformers 在本地利用 236B 的大模型做代码补全?](tabby.md)
- Assistant 适用于应用需要复用一系列资源并调用模型的场景。例如,在教育应用场景中,应用开发者可以创建一个名为二年级数学老师的 Assistant,并设置初始prompt(“你是一个有经验的的二年级数学老师...”),上传相关的资料(二年级数学教材)。创建 Assistant 后,应用需要创建一个 Thread 来存储用户和模型的对话消息(Message)。调用模型时,应用需要创建一个 Run 来获得 Assistant 的回复。相对于 ChatCompletion,实现了 Assistant 的 Server 代替应用实现了对话背景复用和多轮对话,使得复杂场景下的模型的调用更加方便。 [OpenAI Assistant API](https://platform.openai.com/docs/api-reference/assistants/createAssistant) 提出了这样的 Assistant 接口,而 Server 也提供和它一致的 API 。
这些 API 定义在`server/api`中,它们的具体使用请见[这里](api.md)
## 对接模型推理框架
Server 通过 ktransformers 调用模型并进行推理。Server 也支持其他的推理框架,例如已经支持的 [transformers](https://huggingface.co/docs/transformers/index) ,并计划支持 [exllamav2](https://github.com/turboderp/exllamav2)。这些功能在`server/backend` 中实现。
Server 将模型推理框架的推理功能抽象成一个基类`BackendInterfaceBase`。这个基类包含一个函数:inference。它的输入是是历史的对话信息 messages,输出是模型返回的文字结果。inference 函数采用 async generator 的设计,这使得 Server 可以流式地返回模型的回复。
```python
class BackendInterfaceBase:
async def inference(self, messages, **kwargs)->AsyncIterator[str]:
...
```
这个 inference 函数,因为它的输入和输出分别是历史对话和模型回复,所以它自然地实现了 ChatCompletion 的功能。因此 ChatCompletion API 可以直接调用inference 函数完成模型推理。
而 Assistant 则比 ChatCompletion 复杂许多,需要 Server 存储 Assistant 的相关状态,并以合适的方式调用 inference 函数。Server 在数据库中维护了一套 Assistant 逻辑,存储应用创建的 Assistant,Thread 和 Message。在内存中,Server 为每个 Thread 维护一个 `ThreadContext`,集合每个Thread 相关的 Assistant 等信息。当用户发出新的 Message 时,Server 调用 ThreadContext 的get_local_messages函数,获得 messages,并调用 inference 函数获得推理结果。
```python
class MyThreadContext(ThreadContext):
def get_local_messages(self):
...
```
由于不同的模型推理框架有着不同的历史对话输入格式,所以 `ThreadContext``BackendInterface` 需要成对地使用。Server 除了自己的 ktransformers 之外,还支持 transformers。如果要对接其他的模型推理框架,可以参考在 [transformers.py](https://github.com/kvcache-ai/ktransformers-dev/blob/main/ktransformers/server/backend/interfaces/transformers.py)`TransformersInterface``TransformersThreadContext`的实现。
# 如何使用 Tabby 和 ktransformers 在本地利用 236B 的大模型做代码补全?
[Tabby](https://tabby.tabbyml.com/docs/welcome/) 是一个开源的代码助手,用户可以手动配置后端使用的框架及模型,并在多个 IDE/编辑器 上使用,例如 VSCode 和 InteliJ。因为 Tabby 在框架侧可以对接到 Ollama,并且 ktransformers server 提供和 Ollama 一致的 API 接口,所以我们可以将 Tabby 对接到 ktransformers server。并在代码补全的场景中体验到 ktransformers 快速的异构推理。
1. 启动 ktransformers。
```bash
./ktransformers --port 9112
```
2. 安装 Tabby:按照 Tabby 的官方教程在带有英伟达 GPU 的 Linux 服务器或者 Windows PC 上[安装 Tabby](https://tabby.tabbyml.com/docs/quick-start/installation/linux/)
3. 配置 Tabby:创建`~/.tabby/config.toml`,并加入以下配置。
```toml
[model.completion.http]
kind = "ollama/completion"
api_endpoint = "http://127.0.0.1:9112/"
model_name = "DeepSeek-Coder-V2-Instruct"
prompt_template = "<|fim▁begin|>{prefix}<|fim▁hole|>{suffix}<|fim▁end|>" # Prompt Template
```
在这个配置中,`kind` 指明 ktransformers 使用 Ollama 的标准 API 为 Tabby 提供服务;`api_endpoint` 与 ktransforer 启动时绑定的接口保持一致;`model_name` 设置为 ktransformers 使用的模型,这里使用 `DeepSeek-Coder-V2-Instruct` 作为后台推理的模型;`prompt_template` 是模型的提示词模板,针对不同的模型,使用相对应的模版才能正常使用模型 Fill In the Middle 的功能。
在这里演示的是 Tabby 使用 Ollama API 提供 Completion 功能的相关配置,有关 Tabby 其他可选功能的配置信息请参照[这里](https://tabby.tabbyml.com/docs/administration/model/)
4. 启动 Tabby 服务:`./tabby serve`
<img src="run-tabby.png" alt="image-20240709112329577" style="zoom:50%;" />
​ 启动之后,期望会在 ktransformers 的命令行界面看到对 `/api/tags` 接口的访问(在 Tabby 新版本 v0.13.0 中变为对 `/api/show/` 接口的访问)。
<img src="visit-api-tags.png" alt="image-20240709111648215" style="zoom:67%;" />
6. 注册 Tabby 账户,获取 Token:在启动 Tabby 服务后,在浏览器中打开相应的链接(如上图的 0.0.0.0:8080),并参照[教程](https://tabby.tabbyml.com/docs/quick-start/register-account/) 创建用户并获取 Token。
7. 启动 VScode 安装 Tabby 拓展插件,并在相关提示下,使用上一步获得的 Token 连接 Tabby Server,参照[这里](https://tabby.tabbyml.com/docs/extensions/installation/vscode/)
8. 打开任意代码文件,体验 ktransformers 的快速异构推理。
# Start with website
This document provides the necessary steps to set up and run the web service for this project.
## 1. Starting the Web Service
### 1.1. Compiling the Web Code
Before you can compile the web code, make sure you have installed [Node.js](https://nodejs.org) version 18.3 or higher
Once npm is installed, navigate to the `ktransformers/website` directory:
```bash
cd ktransformers/website
```
Next, install the Vue CLI with the following command:
```bash
npm install @vue/cli
```
Now you can build the project:
```bash
npm run build
```
Finally you can build ktransformers with website:
```
cd ../../
pip install .
```
#!/bin/bash
set -e
# clear build dirs
rm -rf ktransformers/ktransformers_ext/build
rm -rf ktransformers/ktransformers_ext/cuda/build
rm -rf ktransformers/ktransformers_ext/cuda/dist
rm -rf ktransformers/ktransformers_ext/cuda/*.egg-info
echo "Installing python dependencies from requirements.txt"
pip install -r requirements-local_chat.txt
echo "Installing ktransformers cpuinfer"
mkdir -p ktransformers/ktransformers_ext/build
cd ktransformers/ktransformers_ext/build
cmake ..
cmake --build . --config Release
echo "Installing ktransformers gpu kernel, this may take for a while, please wait"
sleep 3
cd ../cuda
python setup.py install
cd ../../..
echo "Installation completed successfully"
\ No newline at end of file
__version__ = "0.1.0"
\ No newline at end of file
log:
dir: "logs"
file: "lexllama.log"
#log level: debug, info, warn, error, crit
level: "debug"
backup_count: -1
server:
ip: 0.0.0.0
port: 12456
db:
type: "sqllite"
database: "server.db"
host: "./"
pool_size: 10
user:
secret_key: "981f1dd2a44e27d68759d0252a486568ed43480b4e616a26e3af3709c3a7ce73"
algorithm: "HS256"
model:
# type: transformers
type: ktransformers
name: DeepSeek-Coder-V2-Instruct
path: /mnt/data/model/DeepSeek-Coder-V2-Instruct/
gguf_path: /mnt/data/model/DeepSeek-Coder-V2-GGUF-WJH/
device: cuda:0
web:
mount: False
open_cross_domain: True
ext:
cpu_infer: 10
\ No newline at end of file
[loggers]
keys=root,uvicorn,uvicornError,uvicornAccess
[handlers]
keys=consoleHandler,fileHandler
[formatters]
keys=detailedFormatter
[logger_root]
level=INFO
handlers=consoleHandler
[logger_uvicorn]
level=INFO
handlers=consoleHandler,fileHandler
qualname=uvicorn
propagate=0
[logger_uvicornError]
level=ERROR
handlers=consoleHandler,fileHandler
qualname=uvicorn.error
propagate=0
[logger_uvicornAccess]
level=INFO
handlers=consoleHandler,fileHandler
qualname=uvicorn.access
propagate=0
[handler_consoleHandler]
class=StreamHandler
level=INFO
formatter=detailedFormatter
args=(sys.stdout,)
[handler_fileHandler]
class=logging.FileHandler
level=INFO
formatter=detailedFormatter
args=('uvicorn_logs.log', 'a')
[formatter_detailedFormatter]
format=%(asctime)s - %(name)s - %(levelname)s - [%(filename)s:%(lineno)d] - %(message)s
datefmt=%Y-%m-%d %H:%M:%S
cmake_minimum_required(VERSION 3.16)
project(cpuinfer_ext VERSION 0.1.0)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3")
set(CMAKE_BUILD_TYPE "Release")
include(CheckCXXCompilerFlag)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
option(LLAMA_NATIVE "llama: enable -march=native flag" ON)
# Architecture specific
# TODO: probably these flags need to be tweaked on some architectures
# feel free to update the Makefile for your architecture and send a pull request or issue
message(STATUS "CMAKE_SYSTEM_PROCESSOR: ${CMAKE_SYSTEM_PROCESSOR}")
if (MSVC)
string(TOLOWER "${CMAKE_GENERATOR_PLATFORM}" CMAKE_GENERATOR_PLATFORM_LWR)
message(STATUS "CMAKE_GENERATOR_PLATFORM: ${CMAKE_GENERATOR_PLATFORM}")
else ()
set(CMAKE_GENERATOR_PLATFORM_LWR "")
endif ()
if (NOT MSVC)
if (LLAMA_STATIC)
add_link_options(-static)
if (MINGW)
add_link_options(-static-libgcc -static-libstdc++)
endif()
endif()
if (LLAMA_GPROF)
add_compile_options(-pg)
endif()
endif()
set(ARCH_FLAGS "")
if (CMAKE_OSX_ARCHITECTURES STREQUAL "arm64" OR CMAKE_GENERATOR_PLATFORM_LWR STREQUAL "arm64" OR
(NOT CMAKE_OSX_ARCHITECTURES AND NOT CMAKE_GENERATOR_PLATFORM_LWR AND
CMAKE_SYSTEM_PROCESSOR MATCHES "^(aarch64|arm.*|ARM64)$"))
message(STATUS "ARM detected")
if (MSVC)
add_compile_definitions(__aarch64__) # MSVC defines _M_ARM64 instead
add_compile_definitions(__ARM_NEON)
add_compile_definitions(__ARM_FEATURE_FMA)
set(CMAKE_REQUIRED_FLAGS_PREV ${CMAKE_REQUIRED_FLAGS})
string(JOIN " " CMAKE_REQUIRED_FLAGS ${CMAKE_REQUIRED_FLAGS} "/arch:armv8.2")
check_cxx_source_compiles("#include <arm_neon.h>\nint main() { int8x16_t _a, _b; int32x4_t _s = vdotq_s32(_s, _a, _b); return 0; }" GGML_COMPILER_SUPPORT_DOTPROD)
if (GGML_COMPILER_SUPPORT_DOTPROD)
add_compile_definitions(__ARM_FEATURE_DOTPROD)
endif ()
check_cxx_source_compiles("#include <arm_neon.h>\nint main() { float16_t _a; float16x8_t _s = vdupq_n_f16(_a); return 0; }" GGML_COMPILER_SUPPORT_FP16_VECTOR_ARITHMETIC)
if (GGML_COMPILER_SUPPORT_FP16_VECTOR_ARITHMETIC)
add_compile_definitions(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)
endif ()
set(CMAKE_REQUIRED_FLAGS ${CMAKE_REQUIRED_FLAGS_PREV})
else()
check_cxx_compiler_flag(-mfp16-format=ieee COMPILER_SUPPORTS_FP16_FORMAT_I3E)
if (NOT "${COMPILER_SUPPORTS_FP16_FORMAT_I3E}" STREQUAL "")
list(APPEND ARCH_FLAGS -mfp16-format=ieee)
endif()
if (${CMAKE_SYSTEM_PROCESSOR} MATCHES "armv6")
# Raspberry Pi 1, Zero
list(APPEND ARCH_FLAGS -mfpu=neon-fp-armv8 -mno-unaligned-access)
endif()
if (${CMAKE_SYSTEM_PROCESSOR} MATCHES "armv7")
if ("${CMAKE_SYSTEM_NAME}" STREQUAL "Android")
# Android armeabi-v7a
list(APPEND ARCH_FLAGS -mfpu=neon-vfpv4 -mno-unaligned-access -funsafe-math-optimizations)
else()
# Raspberry Pi 2
list(APPEND ARCH_FLAGS -mfpu=neon-fp-armv8 -mno-unaligned-access -funsafe-math-optimizations)
endif()
endif()
if (${CMAKE_SYSTEM_PROCESSOR} MATCHES "armv8")
# Android arm64-v8a
# Raspberry Pi 3, 4, Zero 2 (32-bit)
list(APPEND ARCH_FLAGS -mno-unaligned-access)
endif()
endif()
elseif (CMAKE_OSX_ARCHITECTURES STREQUAL "x86_64" OR CMAKE_GENERATOR_PLATFORM_LWR MATCHES "^(x86_64|i686|amd64|x64|win32)$" OR
(NOT CMAKE_OSX_ARCHITECTURES AND NOT CMAKE_GENERATOR_PLATFORM_LWR AND
CMAKE_SYSTEM_PROCESSOR MATCHES "^(x86_64|i686|AMD64)$"))
message(STATUS "x86 detected")
if (MSVC)
# instruction set detection for MSVC only
if (LLAMA_NATIVE)
include(cmake/FindSIMD.cmake)
endif ()
if (LLAMA_AVX512)
list(APPEND ARCH_FLAGS /arch:AVX512)
# MSVC has no compile-time flags enabling specific
# AVX512 extensions, neither it defines the
# macros corresponding to the extensions.
# Do it manually.
if (LLAMA_AVX512_VBMI)
add_compile_definitions($<$<COMPILE_LANGUAGE:C>:__AVX512VBMI__>)
add_compile_definitions($<$<COMPILE_LANGUAGE:CXX>:__AVX512VBMI__>)
endif()
if (LLAMA_AVX512_VNNI)
add_compile_definitions($<$<COMPILE_LANGUAGE:C>:__AVX512VNNI__>)
add_compile_definitions($<$<COMPILE_LANGUAGE:CXX>:__AVX512VNNI__>)
endif()
elseif (LLAMA_AVX2)
list(APPEND ARCH_FLAGS /arch:AVX2)
elseif (LLAMA_AVX)
list(APPEND ARCH_FLAGS /arch:AVX)
endif()
else()
if (LLAMA_NATIVE)
list(APPEND ARCH_FLAGS -march=native)
endif()
if (LLAMA_F16C)
list(APPEND ARCH_FLAGS -mf16c)
endif()
if (LLAMA_FMA)
list(APPEND ARCH_FLAGS -mfma)
endif()
if (LLAMA_AVX)
list(APPEND ARCH_FLAGS -mavx)
endif()
if (LLAMA_AVX2)
list(APPEND ARCH_FLAGS -mavx2)
endif()
if (LLAMA_AVX512)
list(APPEND ARCH_FLAGS -mavx512f)
list(APPEND ARCH_FLAGS -mavx512bw)
endif()
if (LLAMA_AVX512_VBMI)
list(APPEND ARCH_FLAGS -mavx512vbmi)
endif()
if (LLAMA_AVX512_VNNI)
list(APPEND ARCH_FLAGS -mavx512vnni)
endif()
endif()
elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "ppc64")
message(STATUS "PowerPC detected")
if (${CMAKE_SYSTEM_PROCESSOR} MATCHES "ppc64le")
list(APPEND ARCH_FLAGS -mcpu=powerpc64le)
else()
list(APPEND ARCH_FLAGS -mcpu=native -mtune=native)
#TODO: Add targets for Power8/Power9 (Altivec/VSX) and Power10(MMA) and query for big endian systems (ppc64/le/be)
endif()
else()
message(STATUS "Unknown architecture")
endif()
find_package(CUDA REQUIRED)
add_compile_options("$<$<COMPILE_LANGUAGE:CXX>:${ARCH_FLAGS}>")
add_compile_options("$<$<COMPILE_LANGUAGE:C>:${ARCH_FLAGS}>")
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/../../third_party/pybind11 ${CMAKE_CURRENT_BINARY_DIR}/third_party/pybind11)
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/../../third_party/llama.cpp ${CMAKE_CURRENT_BINARY_DIR}/third_party/llama.cpp)
include_directories(${CMAKE_CURRENT_SOURCE_DIR}/../../third_party)
include_directories("${CUDA_INCLUDE_DIRS}")
aux_source_directory(${CMAKE_CURRENT_SOURCE_DIR} SOURCE_DIR1)
aux_source_directory(${CMAKE_CURRENT_SOURCE_DIR}/cpu_backend SOURCE_DIR2)
aux_source_directory(${CMAKE_CURRENT_SOURCE_DIR}/operators/llamafile SOURCE_DIR3)
aux_source_directory(${CMAKE_CURRENT_SOURCE_DIR}/../../third_party/llamafile SOURCE_DIR4)
set(ALL_SOURCES ${SOURCE_DIR1} ${SOURCE_DIR2} ${SOURCE_DIR3} ${SOURCE_DIR4})
message(STATUS "ALL_SOURCES: ${ALL_SOURCES}")
pybind11_add_module(${PROJECT_NAME} MODULE ${ALL_SOURCES})
target_link_libraries(${PROJECT_NAME} PRIVATE llama)
target_link_libraries(${PROJECT_NAME} PRIVATE "/usr/local/cuda/lib64/libcudart.so")
\ No newline at end of file
#!/usr/bin/env python
# coding=utf-8
'''
Description :
Author : chenht2022
Date : 2024-07-25 10:31:59
Version : 1.0.0
LastEditors : chenht2022
LastEditTime : 2024-07-25 10:32:51
Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
'''
import os, sys
import time
sys.path.append(os.path.dirname(__file__) + '/../build')
import cpuinfer_ext
import torch
def bench_linear(quant_mode: str):
with torch.inference_mode(mode=True):
input_size = 16384
output_size = 5120
stride = 16
layer_num = 10
CPUInfer = cpuinfer_ext.CPUInfer(64)
warm_up_iter = 1000
test_iter = 10000
hidden_type = 30 # ggml_type::GGML_TYPE_BF16
if quant_mode == "fp32":
proj_type = 0 # ggml_type::GGML_TYPE_F32
bytes_per_elem = 4.000000
elif quant_mode == "fp16":
proj_type = 1 # ggml_type::GGML_TYPE_F16
bytes_per_elem = 2.000000
elif quant_mode == "bf16":
proj_type = 30 # ggml_type::GGML_TYPE_BF16
bytes_per_elem = 2.000000
elif quant_mode == "q8_0":
proj_type = 8 # ggml_type::GGML_TYPE_Q8_0
bytes_per_elem = 1.062500
elif quant_mode == "q6_k":
proj_type = 14 # ggml_type::GGML_TYPE_Q6_K
bytes_per_elem = 0.820312
elif quant_mode == "q5_k_m":
proj_type = 13 # ggml_type::GGML_TYPE_Q5_K
bytes_per_elem = 0.687500
elif quant_mode == "q4_k_m":
proj_type = 12 # ggml_type::GGML_TYPE_Q4_K
bytes_per_elem = 0.562500
elif quant_mode == "q3_k_m":
proj_type = 11 # ggml_type::GGML_TYPE_Q3_K
bytes_per_elem = 0.429688
elif quant_mode == "q2_k":
proj_type = 10 # ggml_type::GGML_TYPE_Q2_K
bytes_per_elem = 0.328125
elif quant_mode == "iq3_xs":
proj_type = 21 # ggml_type::GGML_TYPE_IQ3_S
bytes_per_elem = 0.429688
elif quant_mode == "iq2_xxs":
proj_type = 16 # ggml_type::GGML_TYPE_IQ2_XXS
bytes_per_elem = 0.257812
else:
assert(False)
linears = []
projs = []
for _ in range(layer_num):
proj = torch.randn((output_size, input_size), dtype=torch.float32, device = "cuda").to("cpu").contiguous()
config = cpuinfer_ext.linear.LinearConfig(input_size, output_size, stride, proj.data_ptr(), proj_type, hidden_type)
linear = cpuinfer_ext.linear.Linear(config)
projs.append(proj)
linears.append(linear)
# warm up
for i in range(warm_up_iter):
linear = linears[i % layer_num]
input = torch.randn((1, input_size), dtype=torch.bfloat16).contiguous()
output = torch.empty((1, output_size), dtype=torch.bfloat16).contiguous()
CPUInfer.submit(linear.forward, input.data_ptr(), output.data_ptr())
CPUInfer.sync()
# test
total_time = 0
for i in range(test_iter):
linear = linears[i % layer_num]
input = torch.randn((1, input_size), dtype=torch.bfloat16).contiguous()
output = torch.empty((1, output_size), dtype=torch.bfloat16).contiguous()
start = time.perf_counter()
CPUInfer.submit(linear.forward, input.data_ptr(), output.data_ptr())
CPUInfer.sync()
end = time.perf_counter()
total_time += end - start
print('Quant mode: ', quant_mode)
print('Time(s): ', total_time)
print('Iteration: ', test_iter)
print('Time(us) per iteration: ', total_time / test_iter * 1000000)
print('Bandwidth: ', input_size * output_size * bytes_per_elem * test_iter / total_time / 1000 / 1000 / 1000, 'GB/s')
print('')
bench_linear("fp32")
bench_linear("fp16")
bench_linear("bf16")
bench_linear("q8_0")
bench_linear("q6_k")
bench_linear("q5_k_m")
bench_linear("q4_k_m")
bench_linear("q3_k_m")
bench_linear("q2_k")
# Not supported on __x86_64__
# bench_linear("iq3_xs")
# bench_linear("iq2_xxs")
#!/usr/bin/env python
# coding=utf-8
'''
Description :
Author : chenht2022
Date : 2024-07-25 10:31:59
Version : 1.0.0
LastEditors : chenht2022
LastEditTime : 2024-07-25 10:32:48
Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
'''
import os, sys
import time
import torch
import torch.nn.quantized as nnq
def bench_linear(quant_mode: str):
with torch.inference_mode(mode=True):
input_size = 16384
output_size = 5120
layer_num = 10
warm_up_iter = 1000
test_iter = 10000
if quant_mode == "fp32":
proj_type = torch.float32
bytes_per_elem = 4.000000
elif quant_mode == "fp16":
proj_type = torch.float16
bytes_per_elem = 2.000000
elif quant_mode == "bf16":
proj_type = torch.bfloat16
bytes_per_elem = 2.000000
elif quant_mode == "qint8":
proj_type = torch.qint8
bytes_per_elem = 1.000000
else:
assert(False)
projs = []
for _ in range(layer_num):
proj = torch.randn((output_size, input_size), dtype = torch.float32, device = "cuda").to("cpu").contiguous()
if quant_mode == "qint8":
scale, zero_point = 0.1, 0 # Adjust scale and zero_point based on your dataset
proj_q = torch.quantize_per_tensor(proj, scale, zero_point, torch.qint8)
quantized_layer = nnq.Linear(input_size, output_size)
quantized_layer.set_weight_bias(proj_q, None)
projs.append(quantized_layer)
else:
projs.append(proj.to(proj_type))
# warm up
for i in range(warm_up_iter):
input = torch.randn((1, input_size), dtype=torch.float32).contiguous()
if quant_mode == "qint8":
input_q = torch.quantize_per_tensor(input, scale, zero_point, torch.quint8)
quantized_layer = projs[i % layer_num]
t_output = quantized_layer(input_q)
else:
t_output = torch.mm(input.to(proj_type), projs[i % layer_num].t())
# test
total_time = 0
for i in range(test_iter):
input = torch.randn((1, input_size), dtype=torch.float32).contiguous()
start = time.perf_counter()
if quant_mode == "qint8":
input_q = torch.quantize_per_tensor(input, scale, zero_point, torch.quint8)
quantized_layer = projs[i % layer_num]
t_output = quantized_layer(input_q)
else:
t_output = torch.mm(input.to(proj_type), projs[i % layer_num].t())
end = time.perf_counter()
total_time += end - start
print('Quant mode: ', quant_mode)
print('Time(s): ', total_time)
print('Iteration: ', test_iter)
print('Time(us) per iteration: ', total_time / test_iter * 1000000)
print('Bandwidth: ', input_size * output_size * bytes_per_elem * test_iter / total_time / 1000 / 1000 / 1000, 'GB/s')
print('')
bench_linear("fp32")
bench_linear("fp16")
bench_linear("bf16")
bench_linear("qint8")
#!/usr/bin/env python
# coding=utf-8
'''
Description :
Author : chenht2022
Date : 2024-07-16 10:43:18
Version : 1.0.0
LastEditors : chenht2022
LastEditTime : 2024-07-25 10:32:55
Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
'''
import os, sys
import time
sys.path.append(os.path.dirname(__file__) + '/../build')
import cpuinfer_ext
import torch
def bench_mlp(quant_mode: str):
with torch.inference_mode(mode=True):
hidden_size = 5120
intermediate_size = 3072
stride = 16
layer_num = 10
CPUInfer = cpuinfer_ext.CPUInfer(64)
warm_up_iter = 1000
test_iter = 10000
hidden_type = 30 # ggml_type::GGML_TYPE_BF16
if quant_mode == "fp32":
gate_type = 0 # ggml_type::GGML_TYPE_F32
up_type = 0 # ggml_type::GGML_TYPE_F32
down_type = 0 # ggml_type::GGML_TYPE_F32
bytes_per_elem = 4.000000
elif quant_mode == "fp16":
gate_type = 1 # ggml_type::GGML_TYPE_F16
up_type = 1 # ggml_type::GGML_TYPE_F16
down_type = 1 # ggml_type::GGML_TYPE_F16
bytes_per_elem = 2.000000
elif quant_mode == "bf16":
gate_type = 30 # ggml_type::GGML_TYPE_BF16
up_type = 30 # ggml_type::GGML_TYPE_BF16
down_type = 30 # ggml_type::GGML_TYPE_BF16
bytes_per_elem = 2.000000
elif quant_mode == "q8_0":
gate_type = 8 # ggml_type::GGML_TYPE_Q8_0
up_type = 8 # ggml_type::GGML_TYPE_Q8_0
down_type = 8 # ggml_type::GGML_TYPE_Q8_0
bytes_per_elem = 1.062500
elif quant_mode == "q6_k":
gate_type = 14 # ggml_type::GGML_TYPE_Q6_K
up_type = 14 # ggml_type::GGML_TYPE_Q6_K
down_type = 14 # ggml_type::GGML_TYPE_Q6_K
bytes_per_elem = 0.820312
elif quant_mode == "q5_k_m":
gate_type = 13 # ggml_type::GGML_TYPE_Q5_K
up_type = 13 # ggml_type::GGML_TYPE_Q5_K
down_type = 14 # ggml_type::GGML_TYPE_Q6_K
bytes_per_elem = 0.731771
elif quant_mode == "q4_k_m":
gate_type = 12 # ggml_type::GGML_TYPE_Q4_K
up_type = 12 # ggml_type::GGML_TYPE_Q4_K
down_type = 14 # ggml_type::GGML_TYPE_Q6_K
bytes_per_elem = 0.648437
elif quant_mode == "q3_k_m":
gate_type = 11 # ggml_type::GGML_TYPE_Q3_K
up_type = 11 # ggml_type::GGML_TYPE_Q3_K
down_type = 13 # ggml_type::GGML_TYPE_Q5_K
bytes_per_elem = 0.515625
elif quant_mode == "q2_k":
gate_type = 10 # ggml_type::GGML_TYPE_Q2_K
up_type = 10 # ggml_type::GGML_TYPE_Q2_K
down_type = 11 # ggml_type::GGML_TYPE_Q3_K
bytes_per_elem = 0.328125
elif quant_mode == "iq3_xs":
gate_type = 21 # ggml_type::GGML_TYPE_IQ3_S
up_type = 21 # ggml_type::GGML_TYPE_IQ3_S
down_type = 21 # ggml_type::GGML_TYPE_IQ3_S
bytes_per_elem = 0.429688
elif quant_mode == "iq2_xxs":
gate_type = 16 # ggml_type::GGML_TYPE_IQ2_XXS
up_type = 16 # ggml_type::GGML_TYPE_IQ2_XXS
down_type = 16 # ggml_type::GGML_TYPE_IQ2_XXS
bytes_per_elem = 0.257812
else:
assert(False)
mlps = []
gate_projs = []
up_projs = []
down_projs = []
for _ in range(layer_num):
gate_proj = torch.randn((intermediate_size, hidden_size), dtype=torch.float32, device = "cuda").to("cpu").contiguous()
up_proj = torch.randn((intermediate_size, hidden_size), dtype=torch.float32, device = "cuda").to("cpu").contiguous()
down_proj = torch.randn((hidden_size, intermediate_size), dtype=torch.float32, device = "cuda").to("cpu").contiguous()
config = cpuinfer_ext.mlp.MLPConfig(hidden_size, intermediate_size, stride, gate_proj.data_ptr(), up_proj.data_ptr(), down_proj.data_ptr(), gate_type, up_type, down_type, hidden_type)
mlp = cpuinfer_ext.mlp.MLP(config)
gate_projs.append(gate_proj)
up_projs.append(up_proj)
down_projs.append(down_proj)
mlps.append(mlp)
# warm up
for i in range(warm_up_iter):
mlp = mlps[i % layer_num]
input = torch.randn((1, hidden_size), dtype=torch.bfloat16).contiguous()
output = torch.empty((1, hidden_size), dtype=torch.bfloat16).contiguous()
CPUInfer.submit(mlp.forward, input.data_ptr(), output.data_ptr())
CPUInfer.sync()
# test
total_time = 0
for i in range(test_iter):
mlp = mlps[i % layer_num]
input = torch.randn((1, hidden_size), dtype=torch.bfloat16).contiguous()
output = torch.empty((1, hidden_size), dtype=torch.bfloat16).contiguous()
start = time.perf_counter()
CPUInfer.submit(mlp.forward, input.data_ptr(), output.data_ptr())
CPUInfer.sync()
end = time.perf_counter()
total_time += end - start
print('Quant mode: ', quant_mode)
print('Time(s): ', total_time)
print('Iteration: ', test_iter)
print('Time(us) per iteration: ', total_time / test_iter * 1000000)
print('Bandwidth: ', hidden_size * intermediate_size * 3 * bytes_per_elem * test_iter / total_time / 1000 / 1000 / 1000, 'GB/s')
print('')
bench_mlp("fp32")
bench_mlp("fp16")
bench_mlp("bf16")
bench_mlp("q8_0")
bench_mlp("q6_k")
bench_mlp("q5_k_m")
bench_mlp("q4_k_m")
bench_mlp("q3_k_m")
bench_mlp("q2_k")
# Not supported on __x86_64__
# bench_linear("iq3_xs")
# bench_linear("iq2_xxs")
#!/usr/bin/env python
# coding=utf-8
'''
Description :
Author : chenht2022
Date : 2024-07-16 10:43:18
Version : 1.0.0
LastEditors : chenht2022
LastEditTime : 2024-07-25 10:32:53
Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
'''
import os, sys
import time
import torch
import torch.nn.quantized as nnq
def act_fn(x):
return x / (1.0 + torch.exp(-x))
def bench_mlp(quant_mode: str):
with torch.inference_mode(mode=True):
hidden_size = 5120
intermediate_size = 3072
layer_num = 10
warm_up_iter = 1000
test_iter = 10000
if quant_mode == "fp32":
proj_type = torch.float32
bytes_per_elem = 4.000000
elif quant_mode == "fp16":
proj_type = torch.float16
bytes_per_elem = 2.000000
elif quant_mode == "bf16":
proj_type = torch.bfloat16
bytes_per_elem = 2.000000
elif quant_mode == "qint8":
proj_type = torch.qint8
bytes_per_elem = 1.000000
else:
assert(False)
gate_projs = []
up_projs = []
down_projs = []
for _ in range(layer_num):
gate_proj = torch.randn((intermediate_size, hidden_size), dtype=torch.float32, device = "cuda").to("cpu").contiguous()
up_proj = torch.randn((intermediate_size, hidden_size), dtype=torch.float32, device = "cuda").to("cpu").contiguous()
down_proj = torch.randn((hidden_size, intermediate_size), dtype=torch.float32, device = "cuda").to("cpu").contiguous()
if quant_mode == "qint8":
scale, zero_point = 0.1, 0 # Adjust scale and zero_point based on your dataset
gate_proj_q = torch.quantize_per_tensor(gate_proj, scale, zero_point, torch.qint8)
quantized_gate = nnq.Linear(hidden_size, intermediate_size)
quantized_gate.set_weight_bias(gate_proj_q, None)
up_proj_q = torch.quantize_per_tensor(up_proj, scale, zero_point, torch.qint8)
quantized_up = nnq.Linear(hidden_size, intermediate_size)
quantized_up.set_weight_bias(up_proj_q, None)
down_proj_q = torch.quantize_per_tensor(down_proj, scale, zero_point, torch.qint8)
quantized_down = nnq.Linear(intermediate_size, hidden_size)
quantized_down.set_weight_bias(down_proj_q, None)
gate_projs.append(quantized_gate)
up_projs.append(quantized_up)
down_projs.append(quantized_down)
else:
gate_projs.append(gate_proj.to(proj_type))
up_projs.append(up_proj.to(proj_type))
down_projs.append(down_proj.to(proj_type))
# warm up
for i in range(warm_up_iter):
input = torch.randn((1, hidden_size), dtype=torch.float32).contiguous()
if quant_mode == "qint8":
input_q = torch.quantize_per_tensor(input, scale, zero_point, torch.quint8)
quantized_gate = gate_projs[i % layer_num]
gate_buf = quantized_gate(input_q)
quantized_up = up_projs[i % layer_num]
up_buf = quantized_gate(input_q)
gate_buf = gate_buf.dequantize()
up_buf = up_buf.dequantize()
intermediate = act_fn(gate_buf) * up_buf
intermediate_q = torch.quantize_per_tensor(intermediate, scale, zero_point, torch.quint8)
quantized_down = down_projs[i % layer_num]
t_output = quantized_down(intermediate_q)
else:
gate_proj = gate_projs[i%layer_num]
up_proj = up_projs[i%layer_num]
down_proj = down_projs[i%layer_num]
gate_buf = torch.mm(input.to(proj_type), gate_proj.t())
up_buf = torch.mm(input.to(proj_type), up_proj.t())
intermediate = act_fn(gate_buf) * up_buf
t_output = torch.mm(intermediate.to(proj_type), down_proj.t())
# test
total_time = 0
for i in range(test_iter):
input = torch.randn((1, hidden_size), dtype=torch.float32).contiguous()
start = time.perf_counter()
if quant_mode == "qint8":
input_q = torch.quantize_per_tensor(input, scale, zero_point, torch.quint8)
quantized_gate = gate_projs[i % layer_num]
gate_buf = quantized_gate(input_q)
quantized_up = up_projs[i % layer_num]
up_buf = quantized_gate(input_q)
gate_buf = gate_buf.dequantize()
up_buf = up_buf.dequantize()
intermediate = act_fn(gate_buf) * up_buf
intermediate_q = torch.quantize_per_tensor(intermediate, scale, zero_point, torch.quint8)
quantized_down = down_projs[i % layer_num]
t_output = quantized_down(intermediate_q)
else:
gate_proj = gate_projs[i%layer_num]
up_proj = up_projs[i%layer_num]
down_proj = down_projs[i%layer_num]
gate_buf = torch.mm(input.to(proj_type), gate_proj.t())
up_buf = torch.mm(input.to(proj_type), up_proj.t())
intermediate = act_fn(gate_buf) * up_buf
t_output = torch.mm(intermediate.to(proj_type), down_proj.t())
end = time.perf_counter()
total_time += end - start
print('Quant mode: ', quant_mode)
print('Time(s): ', total_time)
print('Iteration: ', test_iter)
print('Time(us) per iteration: ', total_time / test_iter * 1000000)
print('Bandwidth: ', hidden_size * intermediate_size * 3 * bytes_per_elem * test_iter / total_time / 1000 / 1000 / 1000, 'GB/s')
print('')
bench_mlp("fp32")
bench_mlp("fp16")
bench_mlp("bf16")
bench_mlp("qint8")
#!/usr/bin/env python
# coding=utf-8
'''
Description :
Author : chenht2022
Date : 2024-07-25 10:32:05
Version : 1.0.0
LastEditors : chenht2022
LastEditTime : 2024-07-25 10:33:00
Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
'''
import os, sys
import time
sys.path.append(os.path.dirname(__file__) + '/../build')
import cpuinfer_ext
import torch
def bench_moe(quant_mode: str):
with torch.inference_mode(mode=True):
expert_num = 10
hidden_size = 5120
intermediate_size = 1536
stride = 16
group_min_len = 10
group_max_len = 1024
n_routed_experts = 6
layer_num = 10
qlen = 1
CPUInfer = cpuinfer_ext.CPUInfer(64)
warm_up_iter = 1000
test_iter = 10000
hidden_type = 30 # ggml_type::GGML_TYPE_BF16
if quant_mode == "fp32":
gate_type = 0 # ggml_type::GGML_TYPE_F32
up_type = 0 # ggml_type::GGML_TYPE_F32
down_type = 0 # ggml_type::GGML_TYPE_F32
bytes_per_elem = 4.000000
elif quant_mode == "fp16":
gate_type = 1 # ggml_type::GGML_TYPE_F16
up_type = 1 # ggml_type::GGML_TYPE_F16
down_type = 1 # ggml_type::GGML_TYPE_F16
bytes_per_elem = 2.000000
elif quant_mode == "bf16":
gate_type = 30 # ggml_type::GGML_TYPE_BF16
up_type = 30 # ggml_type::GGML_TYPE_BF16
down_type = 30 # ggml_type::GGML_TYPE_BF16
bytes_per_elem = 2.000000
elif quant_mode == "q8_0":
gate_type = 8 # ggml_type::GGML_TYPE_Q8_0
up_type = 8 # ggml_type::GGML_TYPE_Q8_0
down_type = 8 # ggml_type::GGML_TYPE_Q8_0
bytes_per_elem = 1.062500
elif quant_mode == "q6_k":
gate_type = 14 # ggml_type::GGML_TYPE_Q6_K
up_type = 14 # ggml_type::GGML_TYPE_Q6_K
down_type = 14 # ggml_type::GGML_TYPE_Q6_K
bytes_per_elem = 0.820312
elif quant_mode == "q5_k_m":
gate_type = 13 # ggml_type::GGML_TYPE_Q5_K
up_type = 13 # ggml_type::GGML_TYPE_Q5_K
down_type = 14 # ggml_type::GGML_TYPE_Q6_K
bytes_per_elem = 0.731771
elif quant_mode == "q4_k_m":
gate_type = 12 # ggml_type::GGML_TYPE_Q4_K
up_type = 12 # ggml_type::GGML_TYPE_Q4_K
down_type = 14 # ggml_type::GGML_TYPE_Q6_K
bytes_per_elem = 0.648437
elif quant_mode == "q3_k_m":
gate_type = 11 # ggml_type::GGML_TYPE_Q3_K
up_type = 11 # ggml_type::GGML_TYPE_Q3_K
down_type = 13 # ggml_type::GGML_TYPE_Q5_K
bytes_per_elem = 0.515625
elif quant_mode == "q2_k":
gate_type = 10 # ggml_type::GGML_TYPE_Q2_K
up_type = 10 # ggml_type::GGML_TYPE_Q2_K
down_type = 11 # ggml_type::GGML_TYPE_Q3_K
bytes_per_elem = 0.328125
elif quant_mode == "iq3_xs":
gate_type = 21 # ggml_type::GGML_TYPE_IQ3_S
up_type = 21 # ggml_type::GGML_TYPE_IQ3_S
down_type = 21 # ggml_type::GGML_TYPE_IQ3_S
bytes_per_elem = 0.429688
elif quant_mode == "iq2_xxs":
gate_type = 16 # ggml_type::GGML_TYPE_IQ2_XXS
up_type = 16 # ggml_type::GGML_TYPE_IQ2_XXS
down_type = 16 # ggml_type::GGML_TYPE_IQ2_XXS
bytes_per_elem = 0.257812
else:
assert(False)
moes = []
gate_projs = []
up_projs = []
down_projs = []
for _ in range(layer_num):
gate_proj = torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device = "cuda").to("cpu").contiguous()
up_proj = torch.randn((expert_num, intermediate_size, hidden_size), dtype=torch.float32, device = "cuda").to("cpu").contiguous()
down_proj = torch.randn((expert_num, hidden_size, intermediate_size), dtype=torch.float32, device = "cuda").to("cpu").contiguous()
config = cpuinfer_ext.moe.MOEConfig(expert_num, n_routed_experts, hidden_size, intermediate_size, stride, group_min_len, group_max_len, gate_proj.data_ptr(), up_proj.data_ptr(), down_proj.data_ptr(), gate_type, up_type, down_type, hidden_type)
moe = cpuinfer_ext.moe.MOE(config)
gate_projs.append(gate_proj)
up_projs.append(up_proj)
down_projs.append(down_proj)
moes.append(moe)
expert_ids = torch.randint(0, expert_num, (layer_num, qlen, n_routed_experts), dtype=torch.int64, device = "cuda").to("cpu").contiguous()
weights = torch.rand((layer_num, qlen, n_routed_experts), dtype=torch.float32, device = "cuda").to("cpu").contiguous()
input = torch.randn((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device = "cuda").to("cpu").contiguous()
output = torch.empty((layer_num, qlen, hidden_size), dtype=torch.bfloat16, device = "cuda").to("cpu").contiguous()
# warm up
for i in range(warm_up_iter):
CPUInfer.submit(moes[i % layer_num].forward,
qlen,
n_routed_experts,
expert_ids[i % layer_num].data_ptr(),
weights[i % layer_num].data_ptr(),
input[i % layer_num].data_ptr(),
output[i % layer_num].data_ptr())
CPUInfer.sync()
# test
start = time.perf_counter()
for i in range(test_iter):
CPUInfer.submit(moes[i % layer_num].forward,
qlen,
n_routed_experts,
expert_ids[i % layer_num].data_ptr(),
weights[i % layer_num].data_ptr(),
input[i % layer_num].data_ptr(),
output[i % layer_num].data_ptr())
CPUInfer.sync()
end = time.perf_counter()
total_time = end - start
print('Quant mode: ', quant_mode)
print('Time(s): ', total_time)
print('Iteration: ', test_iter)
print('Time(us) per iteration: ', total_time / test_iter * 1000000)
print('Bandwidth: ', hidden_size * intermediate_size * 3 * n_routed_experts * bytes_per_elem * test_iter / total_time / 1000 / 1000 / 1000, 'GB/s')
print('')
bench_moe("fp32")
bench_moe("fp16")
bench_moe("bf16")
bench_moe("q8_0")
bench_moe("q6_k")
bench_moe("q5_k_m")
bench_moe("q4_k_m")
bench_moe("q3_k_m")
bench_moe("q2_k")
# Not supported on __x86_64__
# bench_linear("iq3_xs")
# bench_linear("iq2_xxs")
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment