Merge branch 'main' into issue/300

0166515c · PanZezhong1725 · GitHub · f0300ff3 · a23c4d13 · 0166515c
Unverified Commit 0166515c authored Aug 07, 2025 by PanZezhong1725 Committed by GitHub Aug 07, 2025
20 changed files
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -40,3 +40,13 @@ jobs:

    - name: Python Test
      run: python scripts/python_test.py --cpu
+
+    - name: run infinirt-test --cpu on Linux
+      if: matrix.os == 'ubuntu-latest'
+      run: |
+        ./build/linux/x86_64/release/infinirt-test --cpu
+
+    - name: run infinirt-test --cpu on Windows
+      if: matrix.os == 'windows-latest'
+      run: |
+        .\build\windows\x64\release\infinirt-test.exe --cpu
--- a/DEV.md
+++ b/DEV.md
+# InfiniCore 开发者手册
+
+Dear 开发者，感谢你参与 InfiniCore 开源项目的开发！本文档将帮助你了解如何向 InfiniCore 项目贡献代码。
+
+## 项目介绍
+
+### 项目模块体系
+
+- infini-utils：全模块通用工具代码。
+- infinirt：运行时库，依赖 infini-utils。
+- infiniop：算子库，依赖 infinirt。除了 C++ 算子实现之外，也包括使用九齿（triton）的算子实现，这部分算子需要在编译之前使用脚本生成源文件。安装后可以运行位于 `test/infiniop` 中的单测脚本进行测试。
+- infiniccl：通信库，依赖 infinirt。
+- utils-test：工具库测试代码，依赖 infini-utils。
+- infiniop-test：算子库测试框架代码。与单测不同，读取gguf测例文件进行测试（详见[`测例文档`](test/infiniop-test/README.md)）。使用前需要安装好 infiniop。
+- infiniccl-test：通信库测试代码，使用前需要安装好 infiniccl。
+
+### 文件目录结构
+
+```bash
+├── xmake.lua  # 总体 xmake 编译配置，包含所有平台的编译选项和宏定义
+├── xmake/*.lua  # 各平台 xmake 编译配置， 包含各平台特有的编译方式
+│    
+├── include/  # 对外暴露的头文件目录，安装时会被复制到安装目录
+│   ├── infiniop/*.h  # InfiniOP算子库子头文件
+│   ├── *.h  # 模块核心头文件
+│ 
+├── src/  # 各模块源代码目录，包含源代码文件以及不对外暴露的头文件
+│   ├── infiniop/ # InfiniOP算子库源代码目录
+│   │   ├── devices/  # 每个设备平台各自的通用代码目录
+│   │   ├── ops/ # 算子实现代码目录
+│   │   │   ├── [op]/
+│   │   │   │   ├── [device]/ # 各硬件平台算子实现代码目录
+│   │   │   │   ├── operator.cc # 算子C语言接口实现
+│   │   ├── reduce/ # 规约类算子通用代码目录
+│   │   ├── elementwise/  # 逐元素类算子通用代码目录
+│   │   ├── *.h  # 核心结构体定义
+│   │
+│   ├── infiniop-test/  # InfiniOP算子库测试框架
+│   ├── infinirt/ # InfiniRT运行时库源代码目录
+│   ├── infiniccl/ # InfiniCCL集合通信库源代码目录
+│  
+├── test/ # 测试源代码目录
+│   ├── infiniop/ # InfiniOP算子库单元测试目录
+│   │       ├── *.py     # 单测脚本（依赖各平台PyTorch）
+│   ├── infiniop-test/
+│   │       ├── test_generate/ # 算子库测试框架测例生成脚本
+│  
+├── scripts/ # 脚本目录
+│   ├── install.py # 安装编译脚本
+│   ├── python_test.py # 运行所有单测脚本
+```
+
+## 开发引导
+
+### 代码提交流程
+
+1. 在github仓库issue页面根据任务类型（开发或bug）创建 issue，所有commit必须有对应的 issue 编号。
+2. 外部人员需要通过 fork 代码仓库提交 PR。
+3. 根据 issue 编号建立分支，分支名字格式为 `issue/#` （# 为issue 编号）。如果出现重复，可在后面添加“-#”序号，或用“/”后增加说明。
+4. 所有 commit 信息必须以 `issue/#` 开头，
+5. 分支推到远程后，建 Pull Request，标题需要以 `issue/#` 开头。在原issue页面上将该PR关联。
+6. PR必须添加至少两位审核员（模块负责人和项目管理员等），PR中需附上最后一次修改后测试通过的截图。
+7. PR通过审核，通过自动测试，无代码冲突后方可合并。合并后，关闭原 issue。
+
+### 如何开发一个新算子
+
+1. 根据算子定义设计算子接口，在 [`InfiniCore文档`](https://github.com/InfiniTensor/InfiniCore-Documentation) 中添加算子文档。提交文档 PR 。
+2. 在 `include/infiniop/` 中添加算子头文件，并 include 到 `include/infiniop.h` 中。
+3. 在 `src/infiniop/ops/` 中添加算子实现目录，并在目录中创建 `operator.cc` 文件实现头文件中的接口。
+4. 在 `src/infiniop/ops/[op]/[device]/` 中添加平台算子实现。注意复用平台公共代码（比如逐元素计算和规约计算），开发过程中把未来可复用的代码写在相应公用代码目录里。比如 cuda kernel 可以多个平台公用，可以考虑在头文件中实现，并在多个源文件中使用。
+5. 算子实现可以成功编译安装后，在 `test/infiniop/` 中添加单测脚本，与 PyTorch 实现进行正确性和性能比较。测例应覆盖算子常用类型和形状。测试成功之后可以将测例添加至 `scripts/python_test.py` 一键测试脚本中（这样 Github 自动测试也会包含该算子）。
+6. 在 `test/infiniop-test/` 算子测试框架中添加该算子的测例脚本。脚本应该包含构建该算子 gguf 测例的类，并在 main 函数中添加几个随机测例。验证随机 gguf 测例可以通过测试框架的测试程序。
+7. 按照流程提交代码 PR 。
+
+### C++ 代码命名书写规范
+
+1. 类型
+
+    内部数据结构类型 `UpperCamelCase`
+
+    ```c++
+    // 尽量使用 Infinixx 开头
+    struct InfiniopMatmulCudaDescriptor;
+    
+    template <typename KeyType, typename ValueType>
+    class HashMap; 
+    
+    using ValueMap = std::unordered_map<int, std::string>;
+    ```
+
+    对外暴露的指针类型和枚举类型 `infinixx[XxxXxx]_t`
+
+    常量使用 `INFINI_UPPER_SNAKE_CASE`
+
+    ```c++
+    typedef struct InfiniopMatmulCudaDescriptor *infiniopMatmulCudaDescriptor_t;
+    
+    typedef enum {
+        // INFINI...
+        INFINI_DTYPE_INVALID = 0,
+    } infiniDtype_t;
+    ```
+
+2. 普通变量、形参、类数据成员，使用 `snake_case`
+
+    成员名前下划线特指private成员，其他情况应避免使用前下划线
+
+    ```c++
+    int max_count;
+    
+    class Example {
+    public:
+        std::string getUserName(std::string user_id);
+    private:
+        // private数据成员名字前加下划线
+        int _max_count;
+        std::string _user_name;
+    };
+    
+    struct UrlTableProperties {  
+        string name;
+        int num_entries;  
+        static Pool<UrlTableProperties>* pool;
+    };
+    ```
+
+    当形参与函数内部变量或成员变量重名，可选择其中一个名字后加下划线。当函数内部临时变量和成员重名时，临时变量名字后加下划线。后下划线表示“临时”
+
+    ```c++
+    void do(int count_){
+        int count = count_;
+    }
+    ```
+
+3. 函数，使用 lowerCamelCase
+
+    ```c++
+    int getMaxValue() const;
+    ```
+
+4. const/volatile修饰符写在类型前面
+
+    ```c++
+    const void *ptr;
+    const int num;
+    ```
+
+### 代码格式化
+
+本项目分别使用 `clang-format-16` 和 `black` 对 C/C++ 以及 Python 代码进行格式化。可以使用 [`scripts/format.py`](/scripts/format.py) 脚本实现代码格式化检查和操作。
+
+使用
+
+```shell
+python scripts/format.py -h
+```
+
+查看脚本帮助信息：
+
+```plaintext
+usage: format.py [-h] [--ref REF] [--path [PATH ...]] [--check] [--c C] [--py PY]
+
+options:
+  -h, --help         show this help message and exit
+  --ref REF          Git reference (commit hash) to compare against.
+  --path [PATH ...]  Files to format or check.
+  --check            Check files without modifying them.
+  --c C              C formatter (default: clang-format-16)
+  --py PY            Python formatter (default: black)
+```
+
+参数中：
+
+- `ref` 和 `path` 控制格式化的文件范围
+  - 若 `ref` 和 `path` 都为空，格式化当前暂存（git added）的文件；
+  - 否则
+    - 若 `ref` 非空，将比较指定 commit 和当前代码的差异，只格式化修改过的文件；
+    - 若 `path` 非空，可传入多个路径（`--path p0 p1 p2`），只格式化指定路径及其子目录中的文件；
+- 若设置 `--check`，将检查代码是否需要修改格式，不修改文件内容；
+- 通过 `--c` 指定 c/c++ 格式化器，默认为 `clang-format-16`；
+- 通过 `--python` 指定 python 格式化器 `black`；
+
+### vscode 开发配置
+
+基本配置见 [xmake 官方文档](https://xmake.io/#/zh-cn/plugin/more_plugins?id=%e9%85%8d%e7%bd%ae-intellsence)。
+
+- TL;DR
+  - clangd
+
+    打开 *xmake.lua*，保存一次以触发编译命令生成，将在工作路径下自动生成 *.vscode/compile_commands.json* 文件。然后在这个文件夹下创建 *settings.json*，填入：
+
+    > .vscode/settings.json
+
+    ```json
+    {
+        "clangd.arguments": [
+            "--compile-commands-dir=.vscode"
+        ],
+        "xmake.additionalConfigArguments": [
+            // 在这里配置 XMAKE_CONFIG_FLAGS
+            "--nv-gpu=y"
+        ],
+    }
+    ```
--- a/README.md
+++ b/README.md
@@ -24,6 +24,8 @@ InfiniCore 是一个跨平台统一编程工具集，为不同芯片平台的功
 - 寒武纪 MLU；
 - 昆仑芯 XPU；

+API 定义以及使用方式详见 [`InfiniCore文档`](https://github.com/InfiniTensor/InfiniCore-Documentation)。
+
 ## 配置和使用

 ### 一键安装
@@ -50,12 +52,18 @@ python scripts/install.py [XMAKE_CONFIG_FLAGS]
 | `--iluvatar-gpu=[y\|n]`  | 是否编译沐曦 GPU 接口实现         | n
 | `--sugon-dcu=[y\|n]`     | 是否编译曙光 DCU 接口实现         | n
 | `--kunlun-xpu=[y\|n]`    | 是否编译昆仑 XPU 接口实现         | n
+| `--ninetoothed=[y\|n]`   | 是否编译九齿实现                 | n
 | `--ccl=[y\|n]`           | 是否编译 InfiniCCL 通信库接口实现 | n

 ### 手动安装

+0. 生成九齿算子（可选）
+
+    参见[使用九齿](#使用九齿)章节。
+
 1. 项目配置

+   windows系统上，建议使用`xmake v2.8.9`编译项目。
   - 查看当前配置

     ```shell
@@ -73,6 +81,8 @@ python scripts/install.py [XMAKE_CONFIG_FLAGS]
     ```shell
     # 英伟达
     # 可以指定 CUDA 路径， 一般环境变量为 `CUDA_HOME` 或者 `CUDA_ROOT`
+     # window系统：--cuda="%CUDA_HOME%"
+     # linux系统：--cuda=$CUDA_HOME
     xmake f --nv-gpu=true --cuda=$CUDA_HOME -cv

     # 寒武纪
@@ -126,62 +136,32 @@ xmake build infiniccl-test
 infiniccl-test --nvidia
 ```

-## 开发指南
+### 使用九齿

-### 代码格式化
+[九齿](https://github.com/InfiniTensor/ninetoothed)是一门基于 Triton 但提供更高层抽象的领域特定语言（DSL）。使用九齿可以降低算子的开发门槛，并且提高开发效率。

-本项目使用 [`scripts/format.py`](/scripts/format.py) 脚本实现代码格式化检查和操作。
+InfiniCore 目前已经可以接入使用九齿实现的算子，但是这部分实现的编译是默认关闭的。如果选择编译库中的九齿实现，需要使用 `--ninetoothed=y`，并在运行一键安装脚本前完成以下准备工作：

-使用
+1. 安装九齿与[九齿算子库](https://github.com/InfiniTensor/ntops)：

 ```shell
-python scripts/format.py -h
+git clone https://github.com/InfiniTensor/ntops.git
+cd ntops
+pip install -e .
 ```

-查看脚本帮助信息：
+注：安装 `ntops` 时，`ninetoothed` 会被当成依赖也一并安装进来。

-```plaintext
-usage: format.py [-h] [--ref REF] [--path [PATH ...]] [--check] [--c C] [--py PY]
+2. 在 `InfiniCore` 文件夹下运行以下命令 AOT 编译库中的九齿算子：

-options:
-  -h, --help         show this help message and exit
-  --ref REF          Git reference (commit hash) to compare against.
-  --path [PATH ...]  Files to format or check.
-  --check            Check files without modifying them.
-  --c C              C formatter (default: clang-format-16)
-  --py PY            Python formatter (default: black)
+```shell
+PYTHONPATH=${PYTHONPATH}:src python scripts/build_ntops.py
 ```

-参数中：
-
- `ref` 和 `path` 控制格式化的文件范围
-  - 若 `ref` 和 `path` 都为空，格式化当前暂存（git added）的文件；
-  - 否则
-    - 若 `ref` 非空，将比较指定 commit 和当前代码的差异，只格式化修改过的文件；
-    - 若 `path` 非空，可传入多个路径（`--path p0 p1 p2`），只格式化指定路径及其子目录中的文件；
- 若设置 `--check`，将检查代码是否需要修改格式，不修改文件内容；
- 通过 `--c` 指定 c/c++ 格式化器，默认为 `clang-format-16`；
- 通过 `--python` 指定 python 格式化器 `black`；
+注：如果对九齿相关文件有修改，需要重新构建 InfiniCore 时，也需要同时运行以上命令进行重新生成。

-### vscode 开发配置
+3. 按照上面的指引进行[一键安装](#一键安装)或者[手动安装](#手动安装)。

-基本配置见 [xmake 官方文档](https://xmake.io/#/zh-cn/plugin/more_plugins?id=%e9%85%8d%e7%bd%ae-intellsence)。
+## 如何开源贡献

- TL;DR
-  - clangd
-
-    打开 *xmake.lua*，保存一次以触发编译命令生成，将在工作路径下自动生成 *.vscode/compile_commands.json* 文件。然后在这个文件夹下创建 *settings.json*，填入：
-
-    > .vscode/settings.json
-
-    ```json
-    {
-        "clangd.arguments": [
-            "--compile-commands-dir=.vscode"
-        ],
-        "xmake.additionalConfigArguments": [
-            // 在这里配置 XMAKE_CONFIG_FLAGS
-            "--nv-gpu=y"
-        ],
-    }
-    ```
+见 [`InfiniCore开发者手册`](DEV.md)。
--- a/include/infiniop.h
+++ b/include/infiniop.h
@@ -4,15 +4,10 @@
 #include "infiniop/handle.h"
 #include "infiniop/ops/add.h"
 #include "infiniop/ops/attention.h"
-#include "infiniop/ops/avg_pool.h"
 #include "infiniop/ops/causal_softmax.h"
 #include "infiniop/ops/clip.h"
 #include "infiniop/ops/conv.h"
-#include "infiniop/ops/expand.h"
 #include "infiniop/ops/gemm.h"
-#include "infiniop/ops/global_avg_pool.h"
-#include "infiniop/ops/max_pool.h"
-#include "infiniop/ops/mlp.h"
 #include "infiniop/ops/mul.h"
 #include "infiniop/ops/random_sample.h"
 #include "infiniop/ops/rearrange.h"

--- a/include/infiniop/ops/avg_pool.h
+++ b/include/infiniop/ops/avg_pool.h
-#ifndef __INFINIOP_AVG_POOL_API_H__
-#define __INFINIOP_AVG_POOL_API_H__
-
-#include "../operator_descriptor.h"
-
-typedef struct InfiniopDescriptor *infiniopAvgPoolDescriptor_t;
-
-__C __export infiniStatus_t infiniopCreateAvgPoolDescriptor(infiniopHandle_t handle,
-                                                            infiniopAvgPoolDescriptor_t *desc_ptr,
-                                                            infiniopTensorDescriptor_t y,
-                                                            infiniopTensorDescriptor_t x,
-                                                            size_t const *kernel_shape,
-                                                            size_t const *pads,
-                                                            ptrdiff_t const *strides,
-                                                            size_t n);
-
-__C __export infiniStatus_t infiniopGetAvgPoolWorkspaceSize(infiniopAvgPoolDescriptor_t desc, size_t *size);
-
-__C __export infiniStatus_t infiniopAvgPool(infiniopAvgPoolDescriptor_t desc,
-                                            void *workspace, size_t workspace_size,
-                                            void *y, void const *x, void *stream);
-
-__C __export infiniStatus_t infiniopDestroyAvgPoolDescriptor(infiniopAvgPoolDescriptor_t desc);
-#endif
--- a/include/infiniop/ops/conv.h
+++ b/include/infiniop/ops/conv.h
@@ -7,9 +7,10 @@ typedef struct InfiniopDescriptor *infiniopConvDescriptor_t;

 __C __export infiniStatus_t infiniopCreateConvDescriptor(infiniopHandle_t handle,
                                                         infiniopConvDescriptor_t *desc_ptr,
-                                                         infiniopTensorDescriptor_t y,
-                                                         infiniopTensorDescriptor_t x,
-                                                         infiniopTensorDescriptor_t w,
+                                                         infiniopTensorDescriptor_t y_desc,
+                                                         infiniopTensorDescriptor_t x_desc,
+                                                         infiniopTensorDescriptor_t w_desc,
+                                                         infiniopTensorDescriptor_t b_desc,
                                                         void *pads,
                                                         void *strides,
                                                         void *dilations,
@@ -17,7 +18,7 @@ __C __export infiniStatus_t infiniopCreateConvDescriptor(infiniopHandle_t handle

 __C __export infiniStatus_t infiniopGetConvWorkspaceSize(infiniopConvDescriptor_t desc, size_t *size);

-__C __export infiniStatus_t infiniopConv(infiniopConvDescriptor_t desc, void *workspace, size_t workspace_size, void *y, void const *x, void const *w, void *stream);
+__C __export infiniStatus_t infiniopConv(infiniopConvDescriptor_t desc, void *workspace, size_t workspace_size, void *y, const void *x, const void *w, const void *bias, void *stream);

 __C __export infiniStatus_t infiniopDestroyConvDescriptor(infiniopConvDescriptor_t desc);


--- a/include/infiniop/ops/expand.h
+++ b/include/infiniop/ops/expand.h
-#ifndef __INFINIOP_EXPAND_API_H__
-#define __INFINIOP_EXPAND_API_H__
-
-#include "../operator_descriptor.h"
-
-typedef struct InfiniopDescriptor *infiniopExpandDescriptor_t;
-
-__C __export infiniStatus_t infiniopCreateExpandDescriptor(infiniopHandle_t handle,
-                                                           infiniopExpandDescriptor_t *desc_ptr,
-                                                           infiniopTensorDescriptor_t y,
-                                                           infiniopTensorDescriptor_t x);
-
-__C __export infiniStatus_t infiniopExpand(infiniopExpandDescriptor_t desc,
-                                           void *y,
-                                           void const *x,
-                                           void *stream);
-
-__C __export infiniStatus_t infiniopDestroyExpandDescriptor(infiniopExpandDescriptor_t desc);
-
-#endif
--- a/include/infiniop/ops/global_avg_pool.h
+++ b/include/infiniop/ops/global_avg_pool.h
-#ifndef __INFINIOP_GLOBAL_AVG_POOL_API_H__
-#define __INFINIOP_GLOBAL_AVG_POOL_API_H__
-
-#include "../operator_descriptor.h"
-
-typedef struct InfiniopDescriptor *infiniopGlobalAvgPoolDescriptor_t;
-
-__C __export infiniStatus_t infiniopCreateGlobalAvgPoolDescriptor(infiniopHandle_t handle,
-                                                                  infiniopGlobalAvgPoolDescriptor_t *desc_ptr,
-                                                                  infiniopTensorDescriptor_t y,
-                                                                  infiniopTensorDescriptor_t x);
-
-__C __export infiniStatus_t infiniopGetGlobalAvgPoolWorkspaceSize(infiniopGlobalAvgPoolDescriptor_t desc, size_t *size);
-
-__C __export infiniStatus_t infiniopGlobalAvgPool(infiniopGlobalAvgPoolDescriptor_t desc,
-                                                  void *workspace, size_t workspace_size,
-                                                  void *y, void const *x, void *stream);
-
-__C __export infiniStatus_t infiniopDestroyGlobalAvgPoolDescriptor(infiniopGlobalAvgPoolDescriptor_t desc);
-
-#endif
--- a/include/infiniop/ops/max_pool.h
+++ b/include/infiniop/ops/max_pool.h
-#ifndef __INFINIOP_MAX_POOL_API_H__
-#define __INFINIOP_MAX_POOL_API_H__
-
-#include "../operator_descriptor.h"
-
-typedef struct InfiniopDescriptor *infiniopMaxPoolDescriptor_t;
-
-__C __export infiniStatus_t infiniopCreateMaxPoolDescriptor(infiniopHandle_t handle,
-                                                            infiniopMaxPoolDescriptor_t *desc_ptr,
-                                                            infiniopTensorDescriptor_t y,
-                                                            infiniopTensorDescriptor_t x,
-                                                            size_t const *kernel_shape,
-                                                            size_t const *pads,
-                                                            ptrdiff_t const *strides,
-                                                            size_t n);
-
-__C __export infiniStatus_t infiniopGetMaxPoolWorkspaceSize(infiniopMaxPoolDescriptor_t desc, size_t *size);
-
-__C __export infiniStatus_t infiniopMaxPool(infiniopMaxPoolDescriptor_t desc,
-                                            void *workspace, size_t workspace_size,
-                                            void *y, void const *x, void *stream);
-
-__C __export infiniStatus_t infiniopDestroyMaxPoolDescriptor(infiniopMaxPoolDescriptor_t desc);
-#endif
--- a/include/infiniop/ops/mlp.h
+++ b/include/infiniop/ops/mlp.h
-#ifndef __INFINIOP_MLP_API_H__
-#define __INFINIOP_MLP_API_H__
-
-#include "../operator_descriptor.h"
-#include "gemm.h"
-#include "swiglu.h"
-
-typedef struct InfiniopDescriptor *infiniopMLPDescriptor_t;
-
-__C __export infiniStatus_t infiniopCreateMLPDescriptor(infiniopHandle_t handle,
-                                                        infiniopMLPDescriptor_t *desc_ptr,
-                                                        infiniopTensorDescriptor_t y_desc,
-                                                        infiniopTensorDescriptor_t x_desc,
-                                                        infiniopTensorDescriptor_t w12_desc,
-                                                        infiniopTensorDescriptor_t w3_desc,
-                                                        float alpha,
-                                                        char residual);
-
-__C __export infiniStatus_t infiniopGetMLPWorkspaceSize(infiniopMLPDescriptor_t desc, size_t *size);
-
-__C __export infiniStatus_t infiniopMLP(infiniopMLPDescriptor_t desc,
-                                        void *workspace,
-                                        size_t workspace_size,
-                                        void *y,
-                                        const void *x,
-                                        const void *w12,
-                                        const void *w3,
-                                        void *stream);
-
-__C __export infiniStatus_t infiniopDestroyMLPDescriptor(infiniopMLPDescriptor_t desc);
-#endif
--- a/include/infiniop/ops/relu.h
+++ b/include/infiniop/ops/relu.h
@@ -11,8 +11,10 @@ __C __export infiniStatus_t infiniopCreateReluDescriptor(infiniopHandle_t handle
                                                         infiniopTensorDescriptor_t x);

 __C __export infiniStatus_t infiniopRelu(infiniopReluDescriptor_t desc,
+                                         void *workspace,
+                                         size_t workspace_size,
                                         void *y,
-                                         void const *x,
+                                         const void *x,
                                         void *stream);

 __C __export infiniStatus_t infiniopDestroyReluDescriptor(infiniopReluDescriptor_t desc);

--- a/scripts/build_ntops.py
+++ b/scripts/build_ntops.py
+import importlib
+import pathlib
+
+from infiniop.ninetoothed.build import BUILD_DIRECTORY_PATH
+
+CURRENT_FILE_PATH = pathlib.Path(__file__)
+
+SRC_DIR_PATH = CURRENT_FILE_PATH.parent.parent / "src"
+
+
+def _find_and_build_ops():
+    ops_path = SRC_DIR_PATH / "infiniop" / "ops"
+
+    for op_dir in ops_path.iterdir():
+        ninetoothed_path = op_dir / "ninetoothed"
+
+        if ninetoothed_path.is_dir():
+            module_path = ninetoothed_path / "build"
+            relative_path = module_path.relative_to(SRC_DIR_PATH)
+            import_name = ".".join(relative_path.parts)
+            module = importlib.import_module(import_name)
+
+            module.build()
+
+
+if __name__ == "__main__":
+    BUILD_DIRECTORY_PATH.mkdir(parents=True, exist_ok=True)
+
+    _find_and_build_ops()
--- a/scripts/format.py
+++ b/scripts/format.py
@@ -62,7 +62,7 @@ def format_file(file: Path, check: bool, formatter) -> bool:
                    text=True,
                    check=True,
                )
-                if process.stderr:
+                if process.returncode != 0:
                    print(f"{Fore.YELLOW}{file} is not formatted.{Style.RESET_ALL}")
                    print(
                        f"Use {Fore.CYAN}{formatter} {file}{Style.RESET_ALL} to format it."

--- a/scripts/python_test.py
+++ b/scripts/python_test.py
@@ -13,16 +13,17 @@ def run_tests(args):
    failed = []
    for test in [
        "add.py",
+        "attention.py",
+        "causal_softmax.py",
+        "clip.py",
        "gemm.py",
+        "mul.py",
        "random_sample.py",
+        "rearrange.py",
        "rms_norm.py",
        "rope.py",
        "sub.py",
        "swiglu.py",
-        "attention.py",
-        "causal_softmax.py",
-        "rearrange.py",
-        "mul.py"
    ]:
        result = subprocess.run(
            f"python {test} {args} --debug", text=True, encoding="utf-8", shell=True

--- a/src/infiniccl-test/infiniccl_test.cpp
+++ b/src/infiniccl-test/infiniccl_test.cpp
@@ -10,7 +10,7 @@
 #define TEST_INFINI(API__) CHECK_API_OR(API__, INFINI_STATUS_SUCCESS, return 1)
 #define TEST_INFINI_THREAD(API__) CHECK_API_OR(API__, INFINI_STATUS_SUCCESS, return nullptr)

-const size_t MAX_COUNT = 100ULL * 1024 * 1024;
+const size_t MAX_COUNT = 8ULL * 1024 * 1024;

 const size_t TEST_COUNTS[] = {
    128,

--- a/src/infiniccl/cuda/infiniccl_cuda.h
+++ b/src/infiniccl/cuda/infiniccl_cuda.h
@@ -4,7 +4,7 @@
 #include "../infiniccl_impl.h"

 // Windows does not support CUDA
-#if defined(ENABLE_CUDA_API) && defined(ENABLE_CCL) && !defined(_WIN32)
+#if (defined(ENABLE_NVIDIA_API) || defined(ENABLE_ILUVATAR_API)) && defined(ENABLE_CCL) && !defined(_WIN32)
 INFINICCL_DEVICE_API_IMPL(cuda)
 #else
 INFINICCL_DEVICE_API_NOOP(cuda)

--- a/src/infiniccl/infiniccl.cc
+++ b/src/infiniccl/infiniccl.cc
@@ -3,7 +3,7 @@
 #include "./ascend/infiniccl_ascend.h"
 #include "./cambricon/infiniccl_cambricon.h"
 #include "./cuda/infiniccl_cuda.h"
-#include "./maca/infiniccl_maca.h"
+#include "./metax/infiniccl_metax.h"

 __C infiniStatus_t infinicclCommInitAll(
    infiniDevice_t device_type,
@@ -13,13 +13,14 @@ __C infiniStatus_t infinicclCommInitAll(

 #define COMM_INIT_ALL(CASE_, NAMESPACE_) \
    case CASE_:                          \
-        return infiniccl::NAMESPACE_::commInitAll(comms, ndevice, device_ids);
+        return infiniccl::NAMESPACE_::commInitAll(comms, ndevice, device_ids)

    switch (device_type) {
-        COMM_INIT_ALL(INFINI_DEVICE_NVIDIA, cuda)
-        COMM_INIT_ALL(INFINI_DEVICE_ASCEND, ascend)
-        COMM_INIT_ALL(INFINI_DEVICE_CAMBRICON, cambricon)
-        COMM_INIT_ALL(INFINI_DEVICE_METAX, maca)
+        COMM_INIT_ALL(INFINI_DEVICE_NVIDIA, cuda);
+        COMM_INIT_ALL(INFINI_DEVICE_ILUVATAR, cuda);
+        COMM_INIT_ALL(INFINI_DEVICE_ASCEND, ascend);
+        COMM_INIT_ALL(INFINI_DEVICE_CAMBRICON, cambricon);
+        COMM_INIT_ALL(INFINI_DEVICE_METAX, metax);
    default:
        return INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED;
    }
@@ -34,13 +35,14 @@ __C infiniStatus_t infinicclCommDestroy(infinicclComm_t comm) {

 #define COMM_DESTROY(CASE_, NAMESPACE_) \
    case CASE_:                         \
-        return infiniccl::NAMESPACE_::commDestroy(comm);
+        return infiniccl::NAMESPACE_::commDestroy(comm)

    switch (comm->device_type) {
-        COMM_DESTROY(INFINI_DEVICE_NVIDIA, cuda)
-        COMM_DESTROY(INFINI_DEVICE_ASCEND, ascend)
-        COMM_DESTROY(INFINI_DEVICE_CAMBRICON, cambricon)
-        COMM_DESTROY(INFINI_DEVICE_METAX, maca)
+        COMM_DESTROY(INFINI_DEVICE_NVIDIA, cuda);
+        COMM_DESTROY(INFINI_DEVICE_ILUVATAR, cuda);
+        COMM_DESTROY(INFINI_DEVICE_ASCEND, ascend);
+        COMM_DESTROY(INFINI_DEVICE_CAMBRICON, cambricon);
+        COMM_DESTROY(INFINI_DEVICE_METAX, metax);

    default:
        return INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED;
@@ -63,13 +65,14 @@ __C infiniStatus_t infinicclAllReduce(

 #define ALL_REDUCE(CASE_, NAMESPACE_) \
    case CASE_:                       \
-        return infiniccl::NAMESPACE_::allReduce(sendbuf, recvbuf, count, dataype, op, comm, stream);
+        return infiniccl::NAMESPACE_::allReduce(sendbuf, recvbuf, count, dataype, op, comm, stream)

    switch (comm->device_type) {
-        ALL_REDUCE(INFINI_DEVICE_NVIDIA, cuda)
-        ALL_REDUCE(INFINI_DEVICE_ASCEND, ascend)
-        ALL_REDUCE(INFINI_DEVICE_CAMBRICON, cambricon)
-        ALL_REDUCE(INFINI_DEVICE_METAX, maca)
+        ALL_REDUCE(INFINI_DEVICE_NVIDIA, cuda);
+        ALL_REDUCE(INFINI_DEVICE_ILUVATAR, cuda);
+        ALL_REDUCE(INFINI_DEVICE_ASCEND, ascend);
+        ALL_REDUCE(INFINI_DEVICE_CAMBRICON, cambricon);
+        ALL_REDUCE(INFINI_DEVICE_METAX, metax);

    default:
        return INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED;

--- a/src/infiniccl/maca/infiniccl_maca.cc
+++ b/src/infiniccl/maca/infiniccl_maca.cc
-#include "infiniccl_maca.h"
+#include "infiniccl_metax.h"

 #include "../../utils.h"

@@ -51,7 +51,7 @@ inline hcclComm_t getHcclComm(infinicclComm_t comm) {
    return static_cast<hcclComm_t>(comm->comm);
 }

-namespace infiniccl::maca {
+namespace infiniccl::metax {

 infiniStatus_t commInitAll(
    infinicclComm_t *comms,
@@ -92,4 +92,4 @@ infiniStatus_t allReduce(

    return INFINI_STATUS_SUCCESS;
 }
-} // namespace infiniccl::maca
+} // namespace infiniccl::metax
--- a/src/infiniccl/maca/infiniccl_maca.h
+++ b/src/infiniccl/maca/infiniccl_maca.h
-#ifndef INFINICCL_MACA_H_
-#define INFINICCL_MACA_H_
+#ifndef INFINICCL_METAX_H_
+#define INFINICCL_METAX_H_

 #include "../infiniccl_impl.h"

 #if defined(ENABLE_METAX_API) && defined(ENABLE_CCL)
-INFINICCL_DEVICE_API_IMPL(maca)
+INFINICCL_DEVICE_API_IMPL(metax)
 #else
-INFINICCL_DEVICE_API_NOOP(maca)
+INFINICCL_DEVICE_API_NOOP(metax)
 #endif

-#endif /* INFINICCL_MACA_H_ */
+#endif /* INFINICCL_METAX_H_ */
--- a/src/infiniop-test/include/gguf.hpp
+++ b/src/infiniop-test/include/gguf.hpp
@@ -141,10 +141,8 @@ typedef enum {

 inline size_t ggmlTypeSize(GGML_TYPE ggml_type) {
    switch (ggml_type) {
-    case GGML_TYPE_F32:
-        return 4;
-    case GGML_TYPE_F16:
-        return 2;
+    case GGML_TYPE_Q8_K:
+        return 1;
    case GGML_TYPE_I8:
        return 1;
    case GGML_TYPE_I16:
@@ -153,10 +151,14 @@ inline size_t ggmlTypeSize(GGML_TYPE ggml_type) {
        return 4;
    case GGML_TYPE_I64:
        return 8;
-    case GGML_TYPE_F64:
-        return 8;
    case GGML_TYPE_BF16:
        return 2;
+    case GGML_TYPE_F16:
+        return 2;
+    case GGML_TYPE_F32:
+        return 4;
+    case GGML_TYPE_F64:
+        return 8;
    default:
        throw std::runtime_error("GGML_TYPE_SIZE: Unsupported GGML_TYPE");
    }