README.md 6.73 KB
Newer Older
yuguo960516yuguo's avatar
yuguo960516yuguo committed
1
2
3
<p align="center">
<img align="center" src="doc/imgs/logo.png", width=1600>
<p>
yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
4

yuguo960516yuguo's avatar
yuguo960516yuguo committed
5
6
--------------------------------------------------------------------------------

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
7
8
9
10
11
12
13
14
15
16
# 飞桨框架 ROCm 版安装说明

飞桨框架 ROCm 版支持基于海光 CPU 和海光 DCU 的训练和预测,不仅支持 AMD ROCm,同样支持海光 DCUToolkit(DTK),当前支持的 ROCm 版本为 4.0.1,支持的 DTK 有多个版本。提供两种安装方式:

- 通过预编译的 wheel 包安装
- 通过源代码编译安装

**说明**:基于对应 DTK 版本的飞桨 wheel 包可在[光合开发者社区 ](https://developer.hpccube.com/tool/#sdk) AI 生态包中进行下载

## 安装方式一:通过 wheel 包安装
yuguo960516yuguo's avatar
yuguo960516yuguo committed
17

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
18
**注意**:当前提供基于 CentOS 7.8 & ROCm 4.0.1 的 docker 镜像,与 Python 3.7 的 wheel 安装包。同时提供基于 CentOS 7.6 & DTK 22.10.1 的 docker 镜像,镜像中包含 Python 3.7 的飞浆 2.3.2 wheel 安装包( image.sourcefind.cn:5000/dcu/admin/base/paddlepaddle:2.3.2-centos7.6-dtk-22.10.1-py37-latest )
yuguo960516yuguo's avatar
yuguo960516yuguo committed
19

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
20
**第一步**:准备 CentOS 7.6 & DTK 22.10.1 运行环境 (推荐使用 Paddle 镜像)
yuguo960516yuguo's avatar
yuguo960516yuguo committed
21

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
22
可以直接从 Paddle 的官方镜像库拉取预先装有 CentOS 7.6 & DTK 22.10.1 的 docker 镜像
yuguo960516yuguo's avatar
yuguo960516yuguo committed
23

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
24
25
26
```bash
# 拉取镜像
docker pull image.sourcefind.cn:5000/dcu/admin/base/paddlepaddle:2.3.2-centos7.6-dtk-22.10.1-py37-latest
yuguo960516yuguo's avatar
yuguo960516yuguo committed
27

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
28
29
# 启动容器,注意这里的参数,例如 shm-size, device 等都需要配置
docker run -it --network=host --name=oneflow_compile --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 -v /public/home/xxx:/home image.sourcefind.cn:5000/dcu/admin/base/paddlepaddle:2.3.2-centos7.6-dtk-22.10.1-py37-latest /bin/bash
yuguo960516yuguo's avatar
yuguo960516yuguo committed
30

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
31
32
# 检查容器是否可以正确识别海光 DCU 设备
rocm-smi
yuguo960516yuguo's avatar
yuguo960516yuguo committed
33

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
34
35
36
37
38
39
40
41
42
43
# 预期得到以下结果:
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp   AvgPwr  SCLK     MCLK    Fan   Perf  PwrCap  VRAM%  GPU%
0    50.0c  23.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%
1    48.0c  25.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%
2    48.0c  24.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%
3    49.0c  27.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%
================================================================================
============================= End of ROCm SMI Log ==============================
yuguo960516yuguo's avatar
yuguo960516yuguo committed
44
45
```

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
46
47
48
49
50
51
52
53
54
55
56
57
58
**第二步**:此镜像中已经集成 Python 3.7 的飞浆 2.3.2 版本,如果重新安装需要

```bash
pip3 uninstall paddlepaddle-rocm
pip3 install paddlepaddle-2.3.2_dtk2210_git0195561-cp37-cp37m-manylinux2014_x86_64.whl
```

**第三步**:验证安装包

安装完成之后,运行如下命令。如果出现 PaddlePaddle is installed successfully!,说明已经安装成功

```bash
python -c "import paddle; paddle.utils.run_check()"
yuguo960516yuguo's avatar
yuguo960516yuguo committed
59
60
```

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
61
## 安装方式二:通过源码编译安装
yuguo960516yuguo's avatar
yuguo960516yuguo committed
62

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
63
**注意**:可使用 Paddle 支持的 CentOS 7.8 & ROCm 4.0.1 编译镜像,且根据 ROCm 4.0.1 的需求,支持的编译器为 devtoolset-7
yuguo960516yuguo's avatar
yuguo960516yuguo committed
64

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
65
**第一步**:准备 ROCm 4.0.1 编译环境 (推荐使用 Paddle 镜像)
yuguo960516yuguo's avatar
yuguo960516yuguo committed
66

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
67
可以直接从 Paddle 的官方镜像库拉取预先装有 ROCm 4.0.1 的 docker 镜像,在[开发者社区](https://developer.hpccube.com/tool/#sdk) DCU Toolkit 中下载 DTK-22.10.1 解压至 /opt/ 路径下,更换/opt下的原有的  ROCm 4.0.1 文件夹。
yuguo960516yuguo's avatar
yuguo960516yuguo committed
68

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
69
70
71
```bash
# 拉取镜像
docker pull paddlepaddle/paddle:latest-dev-rocm4.0-miopen2.11
yuguo960516yuguo's avatar
yuguo960516yuguo committed
72

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
73
74
75
76
77
# 启动容器,注意这里的参数,例如 shm-size, device 等都需要配置
docker run -it --name paddle-rocm-dev --shm-size=128G \
     --device=/dev/kfd --device=/dev/dri --group-add video \
     --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
     paddlepaddle/paddle:latest-dev-rocm4.0-miopen2.11 /bin/bash
yuguo960516yuguo's avatar
yuguo960516yuguo committed
78

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
79
# 替换DTK
yuguo960516yuguo's avatar
yuguo960516yuguo committed
80

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
81
82
# 检查容器是否可以正确识别海光 DCU 设备
rocm-smi
yuguo960516yuguo's avatar
yuguo960516yuguo committed
83

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
# 预期得到以下结果:
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp   AvgPwr  SCLK     MCLK    Fan   Perf  PwrCap  VRAM%  GPU%
0    50.0c  23.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%
1    48.0c  25.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%
2    48.0c  24.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%
3    49.0c  27.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%
================================================================================
============================= End of ROCm SMI Log ==============================
```

请在编译之前,检查如下的环境变量是否正确,如果没有则需要安装相应的依赖库,并导出相应的环境变量。以 Paddle 官方的镜像举例,环境变量如下:

```bash
# PATH 与 LD_LIBRARY_PATH 中存在 devtoolset-7,如果没有运行以下命令
source /opt/rh/devtoolset-7/enable

# PATH 中存在 cmake 3.16.0
export PATH=/opt/cmake-3.16/bin:${PATH}

# PATH 与 LD_LIBRARY_PATH 中存在 rocm 4.0.1
export PATH=/opt/rocm/opencl/bin:/opt/rocm/bin:${PATH}
export LD_LIBRARY_PATH=/opt/rocm/lib:${LD_LIBRARY_PATH}
yuguo960516yuguo's avatar
yuguo960516yuguo committed
108

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
109
110
111
112
# PATH 中存在 Python 3.7
# 注意:镜像中的 python 3.7 通过 miniconda 安装,请通过 conda activate base 命令加载 Python 3.7 环境
export PATH=/opt/conda/bin:${PATH}
```
yuguo960516yuguo's avatar
yuguo960516yuguo committed
113

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
114
**第二步**:下载 Paddle 源码并编译,CMAKE 编译选项含义请参见[编译选项表](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/install/Tables.html#Compile),如果指定 Paddle 版本,需要在编译前指定环境变量 PADDLE_VERSION
yuguo960516yuguo's avatar
yuguo960516yuguo committed
115

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
116
117
118
119
```bash
# 下载源码,默认 develop 分支
git clone -b 2.3.2-dtk-22.10.1 http://developer.hpccube.com/codes/aicomponent/paddle.git
cd Paddle
yuguo960516yuguo's avatar
yuguo960516yuguo committed
120

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
121
122
# 创建编译目录
mkdir build && cd build
yuguo960516yuguo's avatar
yuguo960516yuguo committed
123

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
124
125
# 指定 Paddle 版本
export PADDLE_VERSION=2.3.2
yuguo960516yuguo's avatar
yuguo960516yuguo committed
126

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
127
128
# 执行 cmake
export ROCM_PATH=/opt/rocm
yuguo960516yuguo's avatar
yuguo960516yuguo committed
129

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
130
cmake .. -DPY_VERSION=3.7 -DWITH_GPU=OFF -DWITH_ROCM=ON -DWITH_RCCL=ON -DWITH_NCCL=OFF -DWITH_TESTING=ON -DWITH_DISTRIBUTE=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCMAKE_VERBOSE_MAKEFILE=OFF -DWITH_TP_CACHE=ON -DROCM_PATH=${ROCM_PATH} -DWITH_MKLDNN=OFF
yuguo960516yuguo's avatar
yuguo960516yuguo committed
131

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
132
133
134
# 使用以下命令来编译
make -j$(nproc)
```
yuguo960516yuguo's avatar
yuguo960516yuguo committed
135

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
136
**第三步**:安装与验证编译生成的 wheel 包
yuguo960516yuguo's avatar
yuguo960516yuguo committed
137

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
138
编译完成之后进入`Paddle/build/python/dist`目录即可找到编译生成的.whl 安装包,安装与验证命令如下:
yuguo960516yuguo's avatar
yuguo960516yuguo committed
139

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
140
141
142
```bash
# 安装命令
python -m pip install -U paddlepaddle_rocm-2.3.2-cp37-cp37m-linux_x86_64.whl
yuguo960516yuguo's avatar
yuguo960516yuguo committed
143

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
144
145
146
# 验证命令
python -c "import paddle; paddle.utils.run_check()"
```
yuguo960516yuguo's avatar
yuguo960516yuguo committed
147

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
148
## 如何卸载
yuguo960516yuguo's avatar
yuguo960516yuguo committed
149

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
150
请使用以下命令卸载 Paddle:
yuguo960516yuguo's avatar
yuguo960516yuguo committed
151

yuguo960516yuguo's avatar
readme  
yuguo960516yuguo committed
152
153
154
```
pip3 uninstall paddlepaddle-rocm
```
yuguo960516yuguo's avatar
yuguo960516yuguo committed
155