2 DCU基础环境完整教程-Centos7.6:

2.1 非root用户安装注意事项:

  • 确保非root用户已加入video组,以便能够使用DCU。

    # 对于有sudo权限的非root用户
    sudo usermod -aG video $USER
    
    # 对于无sudo权限的用户,由root执行
    usermod -aG video <userid>
    

2.2. 操作系统设置

确保启动项中不包含nomodeset选项,如果内核以nomodeset选项启动,则驱动可能无法成功加载。

  • 需要保证系统纯净,命令行运行 lsmod | grep amdgpu 为空

  • 关闭 selinux(可选)

    修改/etc/selinux/config,设置SELINUX=disabled

  • 关闭 firewalld(可选)

    systemctl stop firewalld
    systemctl disable firewalld
    

2.3. 关闭内核自动更新:

  • 编辑/etc/yum.conf,在[main]部分添加:

    exclude=kernel* 
    exclude=centos-release*
    

2.4. 更新yum源:

  • 替换为中科大源,针对CentOS 7.6的示例, (注意使用双引号):

    sed -e "s|^mirrorlist=|#mirrorlist=|g" -e "s|^#baseurl=http://mirror.centos.org/centos/\$releasever|baseurl=https://mirrors.ustc.edu.cn/centos-vault/$minorver|g"  -i.bak  /etc/yum.repos.d/CentOS-*.repo
    
  • 替换 CentOS-CR.repo 并且 enable(安装python3需要)

    sed -i "s|^baseurl=http://mirror.centos.org/centos/\$releasever|baseurl=https://mirrors.ustc.edu.cn/centos-vault/$minorver|g;s|enabled=0|enabled=1|g"  /etc/yum.repos.d/CentOS-CR.repo
    
  • 对 Centos7 配置 SCLo 源(安装devtoolset需要):

    编辑 /etc/yum.repos.d/CentOS-SCLo.repo 配置文件;

    vi /etc/yum.repos.d/CentOS-SCLo.repo
    

    /etc/yum.repos.d/CentOS-SCLo.repo 写入以下内容, Esc + :wq 保存退出

    [centos-sclo-sclo]
    name=CentOS-7 - SCLo sclo
    baseurl=https://mirrors.ustc.edu.cn/centos/7/sclo/$basearch/sclo/
    gpgcheck=0
    enabled=1
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-SCLo
    [centos-sclo-rh]
    name=CentOS-7 - SCLo rh
    baseurl=https://mirrors.ustc.edu.cn/centos/7/sclo/$basearch/rh/
    gpgcheck=0
    enabled=1
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-SCLo
    
  • 对 Centos7 配置 epel 源(安装cmake3需要)

    编辑 /etc/yum.repos.d/epel-7.repo 配置文件;

    vi /etc/yum.repos.d/epel-7.repo
    

    在/etc/yum.repos.d/epel-7.repo写入以下内容, Esc + :wq 保存退出;

    [epel]
    name=Extra Packages for Enterprise Linux 7 - $basearch
    baseurl=http://mirrors.aliyun.com/epel/7/$basearch
    failovermethod=priority
    enabled=1
    gpgcheck=0
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
    
    [epel-debuginfo]
    name=Extra Packages for Enterprise Linux 7 - $basearch  Debug
    baseurl=http://mirrors.aliyun.com/epel/7/$basearch/debug
    failovermethod=priority
    enabled=0
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
    gpgcheck=0
    
    [epel-source]
    name=Extra Packages for Enterprise Linux 7 - $basearch  Source
    baseurl=http://mirrors.aliyun.com/epel/7/SRPMS
    failovermethod=priority
    enabled=0
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
    gpgcheck=0
    
  • 更新cache

    yum clean all 
    yum makecache 	
    

2.5. 安装相关依赖:

  • 联网执行以下命令安装必要的依赖包:

    #安装 DCU 加速卡驱动所需的依赖包命令
    yum install -y gcc gcc-c++ rpm-build autoconf kernel-devel-$(uname -r) kernel-headers-$(uname -r)
    #安装cmake3
    yum install -y cmake3 
    ln -s /usr/bin/cmake3 /usr/bin/cmake
    
    #安装 DTK 开发环境所需的依赖包命令
    yum install -y centos-release-scl
    yum install -y gcc gcc-c++ gcc-gfortran elfutils elfutils-devel make rpm-build devtoolset-7
    yum install -y libbabeltrace-devel libbabeltrace pciutils-devel libpciaccess-devel
    yum install -y numactl-devel elfutils-libelf-devel mesa-libGL-devel
    yum install -y epel-release cmake3 pciutils-libs pciutils-devel
    yum install -y perl-File-Which perl-File-BaseDir perl-File-Copy-Recursive perl-File-Listing java-1.8.0-openjdk
    yum install -y git python python-pip python-devel python-wheel python3 python3-pip python3-devel python3-wheel sqlite-devel libibverbs
    yum install -y redhat-lsb-core gettext gettext-devel protobuf
    yum install -y perl-Digest perl-Digest-MD5 perl-Data-Dumper vim-common curl libcurl libcurl-devel
    yum install -y doxygen graphviz texlive texlive-xtab texlive-multirow texlive-sectsty texlive-tocloft
    yum install -y texlive-adjustbox deltarpm tcl automake
    
  • 离线安装依赖:
    依赖离线包:rpm_DTK2404_centos7.6_3.10.0-957.tar
    链接:https://pan.baidu.com/s/1jnWfddL4lHWVQb3btD34Iw?pwd=wj6q
    提取码:wj6q
    下载好导入服务器,进行解压:

    1. vi /etc/yum.repos.d/localyum.repo,输入如下内容,请根据解压的路径,修改baseurl的内容,下面配置是在root路径下解压的依赖包;

      [local-repo]
      name=local-repo
      baseurl=file:///root/centos7-dtk24.04
      enabled=1
      gpgcheck=0
      
    2. 开始安装离线依赖包

      sudo yum clean all
      #以安装cmake3为例
      yum --disablerepo="*" --enablerepo="local-repo" install cmake3
      

2.6. 校验系统配置:

设备名称 设备码
Z100L 1d94:55b7
K100 1d94:62b7
K100_AI 1d94:6210
  • 查看DCU设备

# Z100L
root@sugontest79:/mnt#lspci -nn |grep -i 55b7
------------------------------------------------------------------------------------------------------
07:00.0 Display controller [0380]: Chengdu Haiguang IC Design Co., Ltd. ZIFANG [1d94:55b7] (rev 01)
0a:00.0 Display controller [0380]: Chengdu Haiguang IC Design Co., Ltd. ZIFANG [1d94:55b7] (rev 01)
------------------------------------------------------------------------------------------------------

# K100
root@sugontest79:/mnt#lspci -nn |grep -i 62b7
------------------------------------------------------------------------------------------------------
07:00.0 Co-processor [0b40]: Chengdu Haiguang IC Design Co., Ltd. KONGMING [1d94:62b7] (rev 01)
0a:00.0 Co-processor [0b40]: Chengdu Haiguang IC Design Co., Ltd. KONGMING [1d94:62b7] (rev 01)
------------------------------------------------------------------------------------------------------

# K100-AI
root@sugontest79:/mnt#lspci -nn |grep -i 6210
------------------------------------------------------------------------------------------------------
07:00.0 Co-processor [0b40]: Chengdu Haiguang IC Design Co., Ltd. KONGMING [1d94:6210] (rev 01)
0a:00.0 Co-processor [0b40]: Chengdu Haiguang IC Design Co., Ltd. KONGMING [1d94:6210] (rev 01)
------------------------------------------------------------------------------------------------------

注:输出信息可能和截图不完全一致。

  • 配置环境变量


    创建文件 `/etc/profile.d/devtoolset-7.sh`,内容如下:
    `source /opt/rh/devtoolset-7/enable`
    退出当前登录会话重新登录,或者执行source /etc/profile.d/devtoolset-7.sh。

2.7. 安装驱动:

注意:

DTK和rock驱动有对应关系,可参考dcu-环境安装手册,推荐安装最新的使用
-安装驱动之前需要安装基础包,包括cmake、gcc等多种基础依赖包,请先参考DCU环境安装手册完成基础环境包的安装。

驱动下载地址: https://cancon.hpccube.com:65024/6/main → latest 驱动→ rock-xxx-xxx.aio.run

  1. 安装 DCU 加速卡驱动

    chmod 755 rock-5.7.1-6.2.13-V1.0.1a.aio.run
    ./rock-5.7.1-6.2.13-V1.0.1a.aio.run
    
  2. 如果安装过程更新了vbios, 则需要重启机器

    reboot
    
  3. 查看验证是否安装成功

    # 出现如下类似结果则安装成功
    [root@b04r3n02 ~]# lsmod | grep hydcu
    hydcu           1435342     0
    hydcu_sched     34432       1   hydcu
    hyttm           61919       1   hydcu
    hykcl           46567       3   hydcu_sched,hydcu,hyttm
    hy_extra        32140       3   hydcu_sched,hydcu,hykcl
    amd_iommu_v2    18821       1   hydcu
    drm_kms_helper  179394      3   ast,hydcu,hykcl
    drm             429744      8   ast,ttm,hydcu,hykcl,hyttm,drm_kms_helper
    
  4. 卸载驱动步骤:


    如遇到异常情况或需要更新版本,先执行驱动卸载。
    rpm -qa | grep rock   #查询安装的驱动版本
    rmmod hydcu
    rpm -e rock-5.7.1-6.2.18-1.x86_64
    

2.8. 安装DTK:

DTK下载地址: https://cancon.hpccube.com:65024/1/main → latest → 对应的操作系统 → DTK-version-OS-version-x86_64.tar.gz

  1. 安装

    # 解压安装
    tar xvf DTK-24.04.1-CentOS7.6-x86_64.tar.gz -C /opt
    # 创建软连接
    ln -s /opt/dtk-24.04.1 /opt/dtk
    
  2. 设置 DTK 环境变量

    DTK 压缩文件中提供了设置环境变量脚本 env.sh。可以通过 source /opt/dtk/env.sh 的方式临时加载环境变量。为避免多次配置,常用以下方式加载环境变量

    echo "source /opt/dtk/env.sh">> ~/.bashrc
    # 激活环境变量
    source  ~/.bashrc
    
  3. 验证 DCU 环境

    # 查看并执行 hy-smi 或者 rocm-smi 指令查看 dcu 基本信息
    [root@h01r4n04~]# rocm-smi
    # 出现如下内容, 则安装成功
    ===================System Management Interface =================
    ==========================================================
    DCU Temp AvgPwr Fan Perf PwrCap VRAM% DCU%
    0 50.0c 55.0W 0.0% auto 450.0W 0% 0%
    1 50.0c 58.0W 0.0% auto 450.0W 0% 0%
    2 49.0c 58.0W 0.0% auto 450.0W 0% 0% 
    3 49.0c 55.0W 0.0% auto 450.0W 0% 0%
    ==========================================================
    ======================End of SMI Log========================
    

2.9. 验证安装结果:

  1. 使用rocminfo命令检查ROCm系统状态


    终端输入如下内容:
    rocminfo | grep gfx
    
    # 其中Z100/Z100L为gfx906,K100为gfx926,K100_AI为gfx928; 有输出即说明驱动和DTK安装成功
    ------------------------------------------------------------------------------------------------------------
        Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
        Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
        Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
        Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
        Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
        Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
        Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
        Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
    

  2. 运行hy-smirocm-smi来监控DCU的状态和性能指标;

    # 查看并执行 hy-smi 或者 rocm-smi 指令查看 dcu 基本信息
    [root@h01r4n04~]# rocm-smi
        # 出现如下内容, 则安装成功
    ============================ System Management Interface =============================
    ======================================================================================
    DCU     Temp     AvgPwr     Perf     PwrCap     VRAM%      DCU%      Mode     
    0       42.0C    39.0W      auto     280.0W     0%         0%        Normal   
    1       41.0C    39.0W      auto     280.0W     0%         0%        Normal   
    2       41.0C    36.0W      auto     280.0W     0%         0%        Normal   
    3       40.0C    38.0W      auto     280.0W     0%         0%        Normal   
    4       40.0C    39.0W      auto     280.0W     0%         0%        Normal   
    5       41.0C    41.0W      auto     280.0W     0%         0%        Normal   
    6       42.0C    37.0W      auto     280.0W     0%         0%        Normal   
    7       41.0C    36.0W      auto     280.0W     0%         0%        Normal   
    ======================================================================================
    =================================== End of SMI Log ===================================