init

c0705977 · wangkaixiong · d3982d85 · c0705977 · c0705977 · c0705977
Commit c0705977 authored Apr 08, 2026 by wangkaixiong 🚴🏼
20 changed files
--- a/docs/DCU/docs/build/html/_images/disable_net.png
+++ b/docs/DCU/docs/build/html/_images/disable_net.png
--- a/docs/DCU/docs/build/html/_images/hy_smi.png
+++ b/docs/DCU/docs/build/html/_images/hy_smi.png
--- a/docs/DCU/docs/build/html/_images/render.png
+++ b/docs/DCU/docs/build/html/_images/render.png
--- a/docs/DCU/docs/build/html/_sources/Anaconda_Docker.md.txt
+++ b/docs/DCU/docs/build/html/_sources/Anaconda_Docker.md.txt
+# 1 基于Anaconda的DCU使用示例:
+## 1.1. 安装Anaconda;
+   [Anaconda地址](https://www.anaconda.com/download)
+## 1.2. 使用DCU在Pytorch推理Resnet50分类
+### 1.2.1. 创建虚拟环境, 设置pip下载源为国内:
+```bash
+conda create -n dcu_test python=3.10
+conda activate dcu_test
+pip3 config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
+```
+### 1.2.2. 从开发者社区的下载torch、torchvision;
+[**DAS生态包下载地址**](https://cancon.hpccube.com:65024/4/main/)
+1. 下载torch、torchvision的whl文件到本地; 
+2. `pip install *.whl`;
+3. 验证`torch`是否安装成功；
+    ```shell
+    python -c "import torch;print(torch.cuda.is_available());print(torch.cuda.device_count())"
+    ```
+### 1.2.3. 执行resetnet50分类的推理代码:
+```shell
+git clone http://developer.hpccube.com/codes/wangkx1/torch_inference_resnet50.git
+cd torch_inference_resnet50
+python torch_verify.py
+```
+# 2 基于Docker使用DCU
+DCU开发者社区光源镜像介绍:
+[https://sourcefind.cn/#/service-list](https://sourcefind.cn/#/service-list)
+光源可以查询到基于多种DTK版本的安装的深度学习基础镜像、大模型推理框架(vllm、lmdeploy、fastllm等)镜像、通用模型推理框架镜像(migraphx、AITemplate等)镜像；
+## 2.1. 安装Docker
+参考当前操作系统的版本号，自行安装docker
+查看当前操作系统版本号：
+```bash
+cat /etc/os-release
+```
+## 2.2. 启动容器
+基于镜像创建的容器可提供开箱即用的基于DCU的深度学习运行环境：
+### 2.2.1. 前置条件
+1. 安装DCU加速卡，并完成其对应驱动的安装；
+2. 正确安装docker；
+### 2.2.2. 拉取镜像
+```bash
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.0-ubuntu20.04-dtk24.04.1-py3.8
+```
+### 2.2.3. 启动容器命令
+```bash
+docker run -it \
+--network=host \
+--ipc=host \
+--shm-size=16G \
+--device=/dev/kfd \
+--device=/dev/mkfd \
+--device=/dev/dri \
+-v /opt/hyhal:/opt/hyhal \
+-v your_path:/workspace \
+--group-add video \
+--cap-add=SYS_PTRACE \
+--security-opt seccomp=unconfined \
+--name=dcu_test \
+image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.0-ubuntu20.04-dtk24.04.1-py3.8 \
+/bin/bash
+注： 
+（1）若出现libhsa-runtime相关报错，启动参数请加上-v /opt/hyhal:/opt/hyhal*；若物理机无/opt/hyhal，请下载hyhal并解压放置容器/opt/下；*
+（2）参数解释：
+     -it  # i:打开容器标准输入，t:分配一个伪终端
+     --network=host  # 连接网络（none|host|自定义网络...）
+     --ipc=host  # 设置IPC模式（none|shareable|host...）
+     --shm-size=24G  # 设置/dev/shm大小
+     --device=/dev/kfd  # 指定访问设备（DCU需要添加/dev/kfd、/dev/mkfd、/dev/dri）
+     --device=/dev/mkfd
+     --device=/dev/dri
+     -v /opt/hyhal:/opt/hyhal  # dtk23.10以上版本镜像需要-v挂载物理机目录/opt/hyhal
+     -v your_path:/workspace  # 挂载工作目录
+     --group-add video  # 设置用户附加组（普通用户使用DCU需要）
+     --cap-add=SYS_PTRACE  # 添加权限（SYS_PTRACE|NET_ADMIN...）
+     --security-opt seccomp=unconfined  # 安全配置（seccomp=unconfined|label=disable...）
+     --name=dcu_test   # 容器名称
+     image.sourcefind.cn:5000/dcu/admin/base/custom:alphafold2-2.3.2-dtk23.10-py38  # 所需镜像
+     /bin/bash  # 容器内启动bash
+```           
+## 2.3. 基于容器执行resetnet50分类的推理代码
+### 2.3.1. 验证`torch`是否安装成功;
+    ```shell
+    python -c "import torch;print(torch.cuda.is_available());print(torch.cuda.device_count())"
+    ```
+如果不可用, 从开[发者社区](https://cancon.hpccube.com:65024/4/main/)下载安装torch等你需要的深度学习依赖包;
+1. 下载torch、torchvision的whl文件到本地; 
+2. `pip install *.whl`;
+### 2.3.2. 执行resetnet50分类的推理代码:
+```shell
+git clone http://developer.hpccube.com/codes/wangkx1/torch_inference_resnet50.git
+cd torch_inference_resnet50
+python torch_verify.py
+```
--- a/docs/DCU/docs/build/html/_sources/Hy-SMI.md.txt
+++ b/docs/DCU/docs/build/html/_sources/Hy-SMI.md.txt
+# hy-smi 使用介绍
+## hy-smi 命令输出介绍
+系统终端输入`hy-smi`得到如下输出:
+![hy-smi输出](./imgs/hy_smi.png)
+输出内容的每一列的说明
+- DCU：0-7 是卡的序号索引
+- Temp：DCU卡当前运行的温度
+- AvgPwr：平均功耗
+- Perf：运行的性能模式
+- PwrCap：额定功耗
+- VARM%：显存占用率
+- DCU%：核心利用率
+- Mode：默认模式即为`Normal`,不建议设置其他模式(会影响性能)。
+## 常用用法:
+- 查看显卡名字：`hy-smi --showproductname`
+- 查看当前DCU卡上运行进程占用的资源情况：`hy-smi --showpids`
+- 查看指定进程在DCU卡上的资源占用情况: `hy-smi --showpiddcus`
+- 查看驱动版本号：`hy-smi --showdriverversion`
+- 查看`vbios`版本号：`hy-smi -v`
+- 查看具体显存占用情况: `hy-smi --showmeminfo vram`
+## 更多用法
+更多用法请使用 `hy-smi -h`查看学习使用；
\ No newline at end of file
--- a/docs/DCU/docs/build/html/_sources/NV_GPU_TO_DCU.md.txt
+++ b/docs/DCU/docs/build/html/_sources/NV_GPU_TO_DCU.md.txt
+# 1 从NV的GPU迁移到DCU
+## 1.1. 构建DCU基础环境
+参考第一部分 `构建DCU基础环境`, 完成 DCU 基础环境构建
+## 1.2. 替换深度学习算法包
+对于依赖cuda的深度学习算法包，需要替换为光合开发者社区的版本;
+开发者社区：[https://developer.hpccube.com/tool](https://developer.hpccube.com/tool)
+![AI生态包下载地址](./imgs/das.png)
+<!-- <center><img src="./imgs/das.png" alt="AI生态包下载地址" style="zoom:50%;" /></center> -->
+**手动下载其中对应算法包的whl文件到您的服务器，安装轮子**：
+参考如下步骤:
+> 注意: 替换算法包的时候，算法包的版本需要和DTK对应；
+1. pip 配置国内安装源
+```bash
+pip3 config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
+pip install pip -U
+```
+2. 安装轮子
+```bash
+pip install ***.whl 
+```
+## 1.3. DCU 适配案例：
+**DCU开发者社区光源ModelZoo介绍**(可以快速查询所需的DCU算法模型，根据Readme进行构建DCU环境，一键运行所需的算法模型):
+[https://sourcefind.cn/#/model-zoo/list](https://sourcefind.cn/#/model-zoo/list)
+光源可以查询到基于多种DCU适配的各个AI技术领域的算法模型以及算法框架，如bert、yolo、resnet、Qwen、Llama、vllm、lmdpeloy等等；
+**Examples：**
+- [YOLOv10](https://sourcefind.cn/#/model-zoo/1802637886774013954)
+- [Rep_Vit](https://sourcefind.cn/#/model-zoo/1805170476575846402)
+- [alphafold2_jax](https://sourcefind.cn/#/model-zoo/1712346117256200194)
+- [llama3](https://sourcefind.cn/#/model-zoo/1782218524112154626)
+- [stablediffusion_v2.1](https://sourcefind.cn/#/model-zoo/1793173002231443458)
+- [Qwen1.5](https://sourcefind.cn/#/model-zoo/1793160576505180161)
\ No newline at end of file
--- a/docs/DCU/docs/build/html/_sources/download.md.txt
+++ b/docs/DCU/docs/build/html/_sources/download.md.txt
+## 下载中心:
+- [**驱动下载地址**](https://cancon.hpccube.com:65024/6/main) → latest 驱动→ rock-xxx-xxx.aio.run
+- [**DTK下载地址**](https://cancon.hpccube.com:65024/1/main)  → latest → 对应的操作系统 → DTK-version-OS-version-x86_64.tar.gz
+- [**工具包地址(DCU直通、Kubernets插件、HyQual压力测试、工具包文档)**](https://cancon.hpccube.com:65024/5/main)
+- [**DAS生态包下载地址**](https://cancon.hpccube.com:65024/4/main/)
+- [**光源地址**](https://sourcefind.cn/#/main-page)
\ No newline at end of file
--- a/docs/DCU/docs/build/html/_sources/faq_cuda_hip.md.txt
+++ b/docs/DCU/docs/build/html/_sources/faq_cuda_hip.md.txt
+FAQ-cuda以及hip移植常见问题处理经验
+### 问题一、纹理内存报错：
+1.  /data/wkx/develop/ppl.cv/src/ppl/cv/cuda/warp.hpp:31:8: error: no
+    template named \'texture\'
+2.  static texture\<float4, cudaTextureType2D,
+3.  \^
+4.  /data/wkx/develop/ppl.cv/src/ppl/cv/cuda/warp.hpp/data/wkx/develop/ppl.cv/src/ppl/cv/cuda/warp.hpp::3131::88::
+5.  error: error: no template named \'texture\'no template named
+    \'texture\'
+6.  7.  static texture\<float4, cudaTextureType2D,
+8.  static texture\<float4, cudaTextureType2D,
+#### **解决方法：**
+**CUDA 的 texture 类型在较新版本（CUDA 12
+及以上）中已被弃用或移除**。旧版 CUDA（如 CUDA 10 或 11）中可以使用
+texture\<T, \...\> 这种全局变量声明方式，但在 **CUDA 12+
+中，这种语法不再支持**，必须改用 **cudaTextureObject\_t +
+cudaResourceDesc/cudaTextureDesc** 的方式来创建纹理对象
+使用DCU的cuda-11.8编译老旧代码即可顺利通过；
+### 问题二、 launch bounds (256) 报错：
+Launch params (1024, 1, 1) are larger than launch bounds (256) for
+kernel \_ZL12rms\_norm\_f32ILi1024EEvPKfPfif please add launch\_bounds
+to kernel define or use \--gpu-max-threads-per-block recompile program !
+#### 解决方法：
+解决方法1：
+1.  所有的核函数 \_\_global\_\_ 替换为 \_\_global\_\_
+    \_\_launch\_bounds\_\_(1024)
+解决方法2：
+nvcc或者hip编译增加： \--gpu-max-threads-per-block=1024
+### 问题三、asm 代码，内联汇编代码编译报错；
+![4XDLESZEAAQE6](media/image1.png){width="5.763888888888889in"
+height="1.8840080927384077in"}
+#### 解决方法：
+内嵌 PTX 功能开启需要主动加"-fnline-asm-ptx"选项。
+![LMVLGSZEABAFG](media/image2.png){width="5.763888888888889in"
+height="2.996674321959755in"}
+### 问题四、 cuda应用不转码适配找不到 math.h 头文件
+![CISLKSZEACABE](media/image3.png){width="5.763888888888889in"
+height="1.9728018372703413in"}
+#### 解决方法：
+cmake 编译中增加的 -isystem /usr/include 与 nvcc
+编译器同时使用会存在冲突。
+开启打印，关注编译过程的 完整头文件、库文件的依赖，去掉 -isystem
+/usr/include 即可编译成功。
+make VERBOSE=1 \<project\>
+### 问题五、使用开源的pycuda 无法编译 cu文件
+#### 解决方法：
+参考这个，更改下 compiler.py 适配 hip 编译；
+[[https://ontrack.hygon.cn/browse/CSD-10705]{.underline}](https://ontrack.hygon.cn/browse/CSD-10705)
+### 问题六、如何针对一个文件夹的cu代码进行转码
+详细可以参考：
+![ppt](media/image4.png){width="0.1527777777777778in"
+height="0.1527777777777778in"}[[DCU应用移植介绍-程顺延]{.underline}](https://www.kdocs.cn/l/cmD2M59DD2vk)
+#### 解决方法：
+1.  hipconvertinplace-perl.sh \<cuda代码文件夹\>
+cuda 文件夹下原有的代码，转码后以 org-name.h/cu.prehip
+形式存储在当前目录
+由于要使用hip编译, 因此所有的 cu 后缀, 修改为 hip 或者 cpp;
+### 问题七、hip转码后部分宏定义不规范不会被转换，可能导致出现问题：
+#### 解决方法：
+-   CublasHandleManager.h
+1.  \#if !defined(ROCM\_SYMLINK\_HIPBLAS\_H)
+2.  \#error hipblas.h must be included at the very top of any file
+    including CublasHandleManager.h
+3.  \#endif
+4.  5.  从 CUBLAS\_V2\_H\_ 更改为 ROCM\_SYMLINK\_HIPBLAS\_H
+### 问题八、 math\_constants.h 找不到：
+#### 解决方法：
+DTK的cuda下有 math\_constants.h 会被别的工程依赖；
+hip下不存在对应的代码，可以直接拷贝 math\_constants.h 到工程中使用；
+math\_constants.h 仅仅是一些数学值的定义；
+### 问题九、转码后部分hip核函数不识别 min：
+#### 解决方法：
+EddyMatrixKernels.cpp 中不支持 min 的问题解决
+1.  \_\_global\_\_ void QR(// Input
+2.  const float \*K, // Row-first matrices to decompose
+3.  unsigned int m, // Number of rows of K
+4.  unsigned int n, // Number of columns of K
+5.  unsigned int nmat, // Number of matrices
+6.  // Output
+7.  float \*Qt, // nmat mxm Q matrices
+8.  float \*R) // nmat mxn R matrices
+9.  {
+10. extern \_\_shared\_\_ float scratch\[\];
+11. 12. if (blockIdx.x \< nmat && threadIdx.x \< m) {
+13. unsigned int id = threadIdx.x;
+14. // unsigned int ntpm = min(m,blockDim.x); // Number of threads per
+    matrix
+15. unsigned int ntpm = (m \< blockDim.x) ? m : blockDim.x;
+16. float \*v = scratch;
+17. float \*w = &scratch\[m\];
+18. const float \*lK = &K\[blockIdx.x\*m\*n\];
+19. float \*lQt = &Qt\[blockIdx.x\*m\*m\];
+20. float \*lR = &R\[blockIdx.x\*m\*n\];
+21. qr\_single(lK,m,n,v,w,id,ntpm,lQt,lR);
+22. }
+23. return;
+24. }
+### 问题十、使用 DTK-25.04 之后的软件栈编译报头文件错：
+#### 解决方法：
+尽量尝试使用 -std=c++17\\-std=c++14
+### 问题十一、g++ 编译 hipRuntime（hipMalloc、hipMemcpy）等接口代码，编译报错：
+#### 解决方法：
+编译时增加宏定义，
+\_\_HIP\_PLATFORM\_AMD\_\_
+链接依赖增加 -l galaxyhip
--- a/docs/DCU/docs/build/html/_sources/get_started.md.txt
+++ b/docs/DCU/docs/build/html/_sources/get_started.md.txt
--- a/docs/DCU/docs/build/html/_sources/index.rst.txt
+++ b/docs/DCU/docs/build/html/_sources/index.rst.txt
--- a/docs/DCU/docs/build/html/_sources/install_dcu_on_os/base_install_intro.md.txt
+++ b/docs/DCU/docs/build/html/_sources/install_dcu_on_os/base_install_intro.md.txt
+## 1. 开发者社区 DCU 环境安装手册
+该文档主要针对 DCU 加速卡，提供基础软件环境安装部署以及基础测试的参考指导。
+建议参考如下文档进行安装DCU基础环境:
+[**点击，进入开发者社区环境搭建文档**](https://cancon.hpccube.com:65024/1/main/latest/Document) → DTK 开发环境安装部署手册.pdf
\ No newline at end of file
--- a/docs/DCU/docs/build/html/_sources/install_dcu_on_os/centos.md.txt
+++ b/docs/DCU/docs/build/html/_sources/install_dcu_on_os/centos.md.txt
--- a/docs/DCU/docs/build/html/_sources/install_dcu_on_os/ubuntu.md.txt
+++ b/docs/DCU/docs/build/html/_sources/install_dcu_on_os/ubuntu.md.txt
--- a/docs/DCU/docs/build/html/_static/_sphinx_javascript_frameworks_compat.js
+++ b/docs/DCU/docs/build/html/_static/_sphinx_javascript_frameworks_compat.js
--- a/docs/DCU/docs/build/html/_static/basic.css
+++ b/docs/DCU/docs/build/html/_static/basic.css
--- a/docs/DCU/docs/build/html/_static/check-solid.svg
+++ b/docs/DCU/docs/build/html/_static/check-solid.svg
+<svg xmlns="http://www.w3.org/2000/svg" class="icon icon-tabler icon-tabler-check" width="44" height="44" viewBox="0 0 24 24" stroke-width="2" stroke="#22863a" fill="none" stroke-linecap="round" stroke-linejoin="round">
+  <path stroke="none" d="M0 0h24v24H0z" fill="none"/>
+  <path d="M5 12l5 5l10 -10" />
+</svg>
--- a/docs/DCU/docs/build/html/_static/clipboard.min.js
+++ b/docs/DCU/docs/build/html/_static/clipboard.min.js
--- a/docs/DCU/docs/build/html/_static/copy-button.svg
+++ b/docs/DCU/docs/build/html/_static/copy-button.svg
+<svg xmlns="http://www.w3.org/2000/svg" class="icon icon-tabler icon-tabler-copy" width="44" height="44" viewBox="0 0 24 24" stroke-width="1.5" stroke="#000000" fill="none" stroke-linecap="round" stroke-linejoin="round">
+  <path stroke="none" d="M0 0h24v24H0z" fill="none"/>
+  <rect x="8" y="8" width="12" height="12" rx="2" />
+  <path d="M16 8v-2a2 2 0 0 0 -2 -2h-8a2 2 0 0 0 -2 2v8a2 2 0 0 0 2 2h2" />
+</svg>
--- a/docs/DCU/docs/build/html/_static/copybutton.css
+++ b/docs/DCU/docs/build/html/_static/copybutton.css
--- a/docs/DCU/docs/build/html/_static/copybutton.js
+++ b/docs/DCU/docs/build/html/_static/copybutton.js