README.md 8.59 KB
Newer Older
songlinfeng's avatar
songlinfeng committed
1
# DTK Container Toolkit
songlinfeng's avatar
songlinfeng committed
2

songlinfeng's avatar
songlinfeng committed
3
4
## 简介

songlinfeng's avatar
songlinfeng committed
5
DTK Container Toolkit 使用户能够构建和运行使用DCU设备的容器,该toolkit包括以下工具包。
songlinfeng's avatar
songlinfeng committed
6

7
8
- ```dcu-container-toolkit``` - DTK容器运行时
- ```dcu-ctk``` - DTK容器工具集命令行
songlinfeng's avatar
songlinfeng committed
9

songlinfeng's avatar
songlinfeng committed
10
11
12
## 使用
- --gpus需要Docker version 19+
- cdi 需要Docker version 25+
songlinfeng's avatar
songlinfeng committed
13
14
15
16
首先确保已经安装好DTK。

### 安装

songlinfeng's avatar
songlinfeng committed
17
18
使用 dpkg/rpm -i 进行安装。安装后会自动执行以下命令
```sh
19
20
$ dcu-ctk --quiet config --config-file=/etc/dcu-container-runtime/config.toml --in-place #生成配置文件
$ dcu-ctk runtime configure --runtime=docker --set-as-default  #修改docker的config.json的runtime
songlinfeng's avatar
songlinfeng committed
21
```
songlinfeng's avatar
songlinfeng committed
22
23
24
25
26
27

重启docker服务
```sh
$ sudo systemctl restart docker
```

songlinfeng's avatar
songlinfeng committed
28
### 在docker中使用DCU
songlinfeng's avatar
songlinfeng committed
29
30
31
32
33
34

#### 通过 docker CLI
可以通过 docker run 添加参数 --gpus 给容器添加HCU设备。
```sh
$ docker run -it --gpus all ubuntu:18.04    # 添加所有HCU设备
$ docker run -it --gpus 1 ubuntu:18.04 # 添加一个HCU设备,HCU 0
songlinfeng's avatar
songlinfeng committed
35
$ docker run -it --gpus "device=0,2" ubuntu:18.04 #添加第0号和第2号GPU
songlinfeng's avatar
songlinfeng committed
36
37
```

38
39
#### 通过环境变量 `DCU_VISIBLE_DEVICES`
可以通过 docker run 添加环境变量 -e DCU_VISIBLE_DEVICES 给容器添加HCU设备。
songlinfeng's avatar
songlinfeng committed
40
```sh
41
42
43
docker run -it -e DCU_VISIBLE_DEVICES=all ubuntu:18.04 # 添加所有HCU设备
docker run -it -e DCU_VISIBLE_DEVICES=0 ubuntu:18.04 # 添加HCU设备0
docker run -it -e DCU_VISIBLE_DEVICES=0,1 ubuntu:18.04 # 添加HCU设备0、1
songlinfeng's avatar
songlinfeng committed
44
```
songlinfeng's avatar
songlinfeng committed
45
46
47
#### 通过CDI方式
- 首先,生成CDI spec文件
```sh
48
$ dcu-ctk cdi generate --output=/etc/cdi/dtk.json
songlinfeng's avatar
songlinfeng committed
49
50
51
```
- 配置docker开启CDI
```sh
52
$ dcu-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled
songlinfeng's avatar
songlinfeng committed
53
54
55
56
57
58
59
60
61
62
63
64
65
```
- 使用所有的DCU
```sh
$ docker run --rm --device c-3000.com/hcu=all -it ubuntu:18.04
```
- 使用第0块和第1块DCU
```sh
$ docker run --rm --device c-3000.com/hcu=0 --device c-3000.com/hcu=1 -it ubuntu:18.04
```
### 在podman中使用DCU
podman 需要version 2.0+
- 首先需要修改podman的runtime
```sh
66
$ dcu-ctk runtime configure --runtime=podman --set-as-default
songlinfeng's avatar
songlinfeng committed
67
68
69
```
- 通过环境变量方式使用
```sh
70
$ podman run -it -e DCU_VISIBLE_DEVICES=all ubuntu:18.04
songlinfeng's avatar
songlinfeng committed
71
```
songlinfeng's avatar
songlinfeng committed
72
### 列出可使用的DCU
songlinfeng's avatar
songlinfeng committed
73
```sh
74
$ dcu-ctk cdi list
songlinfeng's avatar
songlinfeng committed
75
76
77
78
79
80
81
82
83
INFO[0000] Found 3 CDI devices
c-3000.com/hcu=0
c-3000.com/hcu=1
c-3000.com/hcu=2
c-3000.com/hcu=all
c-3000.com/hcu=hcu-73873c7a6eb008a1
c-3000.com/hcu=hcu-73873c7a6eb02041
c-3000.com/hcu=hcu-73873c7a6eb040a1
```
songlinfeng's avatar
songlinfeng committed
84
### docker rootless下对文件读写权限的限制
songlinfeng's avatar
songlinfeng committed
85
非root用户在使用-v挂载目录时,保留原有权限,无法对ro目录添加w权限,执行该命令需要root权限
songlinfeng's avatar
songlinfeng committed
86
```sh
87
$ dcu-ctk rootless --runtime=docker
songlinfeng's avatar
songlinfeng committed
88
```
songlinfeng's avatar
songlinfeng committed
89
### DCU Tracker
90
DCU Tracker用来监控使用--gpus和-e DCU_VISIBLE_DEVICES启动容器的DCU使用情况,也可以设置DCU被独享或共享。默认情况下DCU Tracker是disable状态,用户可以使用dcu-ctk的命令行来enable.
songlinfeng's avatar
songlinfeng committed
91
92
93
94

DCU Tracker 提供命令行来控制容器对DCU的访问,可以被设置为shared或exclusive.
- shared 表示DCU可以同时被多个容器一起使用,这是默认状态
- exclusive 表示DCU同时只能被一个容器使用。
songlinfeng's avatar
songlinfeng committed
95

songlinfeng's avatar
songlinfeng committed
96
```sh
97
$ dcu-ctk dcu-tracker -h
songlinfeng's avatar
songlinfeng committed
98
99
100
101
NAME:
   C-3000 DTK Container Toolkit CLI dcu-tracker - DCU Tracker  related commands

USAGE:
102
   dcu-ctk dcu-tracker [dcu-ids] [accessibility]
songlinfeng's avatar
songlinfeng committed
103
104
105
106
107
       Arguments:
           dcu-ids          Comma-separated list of DCU IDs (comma separated list, range operator, all)
           accessibility    Must be either 'exclusive' or 'shared'

     Examples:
108
109
110
           dcu-ctk dcu-tracker 0,1,2 exclusive
           dcu-ctk dcu-tracker 0,1-2 shared
           dcu-ctk dcu-tracker all shared
songlinfeng's avatar
songlinfeng committed
111
112
113

   OR

114
   dcu-ctk dcu-tracker [command] [options]
songlinfeng's avatar
songlinfeng committed
115
116
117
118
119
120
121
122
123
124
125
126
127

COMMANDS:
   disable  Disable the DCU Tracker
   enable   Enable the DCU Tracker
   reset    Reset the DCU Tracker
   status   Show Status of DCUs
   help, h  Shows a list of commands or help for one command

OPTIONS:
   --help, -h  show help
```

###使用DCU Tracker
songlinfeng's avatar
songlinfeng committed
128

songlinfeng's avatar
songlinfeng committed
129
通过rocm-smi来查看节点上的DCUs
songlinfeng's avatar
songlinfeng committed
130

songlinfeng's avatar
songlinfeng committed
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
```sh
$ rocm-smi


========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
0    52.0c           56.0W   600Mhz  1000Mhz  0%   auto  400.0W    0%   0%
1    57.0c           48.0W   600Mhz  1000Mhz  0%   auto  400.0W    0%   0%
2    52.0c           66.0W   600Mhz  1000Mhz  0%   auto  400.0W    0%   0%
====================================================================================
=============================== End of ROCm SMI Log ================================

```
- 查看DCU Tracker Status
   如果DCU Tracker enabled,DCU默认被赋予 shared 权限
songlinfeng's avatar
songlinfeng committed
147

songlinfeng's avatar
songlinfeng committed
148
   ```sh
149
   $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
150
   ------------------------------------------------------------------------------------------------------------------------
songlinfeng's avatar
songlinfeng committed
151
152
153
154
155
   GPU Id    UUID                     Accessibility       Container Ids
   ------------------------------------------------------------------------------------------------------------------------
   0         0x73873C7A6EB02041       Shared              None
   1         0x73873C7A6EB008A1       Shared              None
   2         0x73873C7A6EB040A1       Shared              None
songlinfeng's avatar
songlinfeng committed
156
157
158
159
   ```

   如果DCU Tracker没有开启,则会有相应提示
   ```sh
160
   $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
161
162
   DCU Tracker is disabled
   ```
songlinfeng's avatar
songlinfeng committed
163

songlinfeng's avatar
songlinfeng committed
164
- 开启 DCU Tracker
songlinfeng's avatar
songlinfeng committed
165

songlinfeng's avatar
songlinfeng committed
166
167
  ```sh

168
  $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
169
170
  DCU Tracker is disabled

171
  $ dcu-ctk dcu-tracker enable
songlinfeng's avatar
songlinfeng committed
172
173
  DCU Tracker has been enabled

174
  $ dcu-ctk dcu-tracker enable
songlinfeng's avatar
songlinfeng committed
175
176
  DCU Tracker is already enabled
  ```
songlinfeng's avatar
songlinfeng committed
177

songlinfeng's avatar
songlinfeng committed
178
- 关闭 DCU Tracker
songlinfeng's avatar
songlinfeng committed
179

songlinfeng's avatar
songlinfeng committed
180
  ```sh
181
  $ dcu-ctk dcu-tracker disable
songlinfeng's avatar
songlinfeng committed
182
183
  DCU Tracker has been disabled
  ```
songlinfeng's avatar
songlinfeng committed
184

songlinfeng's avatar
songlinfeng committed
185
186
- 设置DCU的访问权限
   当DCU Tracker开启时,启动容器时会自动记录容器使用DCU的情况
songlinfeng's avatar
songlinfeng committed
187

songlinfeng's avatar
songlinfeng committed
188
   ```sh
189
   $ docker run --name slf_dmps -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23 
songlinfeng's avatar
songlinfeng committed
190

191
   $ docker run --name slf_dmp -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23
songlinfeng's avatar
songlinfeng committed
192

193
   $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
194
   ------------------------------------------------------------------------------------------------------------------------
songlinfeng's avatar
songlinfeng committed
195
196
197
198
199
200
201
   GPU Id    UUID                     Accessibility       Container Ids
   ------------------------------------------------------------------------------------------------------------------------
   0         0x73873C7A6EB02041       Shared              3d07700f961485c678999ea1a0fecaaf0b54f3be51f4a1e9a2f1ae61032b276d
                                                          dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
   1         0x73873C7A6EB008A1       Shared              3d07700f961485c678999ea1a0fecaaf0b54f3be51f4a1e9a2f1ae61032b276d
                                                          dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
   2         0x73873C7A6EB040A1       Shared              None
songlinfeng's avatar
songlinfeng committed
202
203
204

   $ docker rm slf_dmp 

205
   $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
206
207
208
209
210
211
   ------------------------------------------------------------------------------------------------------------------------
   GPU Id    UUID                     Accessibility       Container Ids
   ------------------------------------------------------------------------------------------------------------------------
   0         0x73873C7A6EB02041       Shared              dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
   1         0x73873C7A6EB008A1       Shared              dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
   2         0x73873C7A6EB040A1       Shared              None
songlinfeng's avatar
songlinfeng committed
212
213

   ```
songlinfeng's avatar
songlinfeng committed
214
   
songlinfeng's avatar
songlinfeng committed
215
216
217
- 设置DCU 为exclusive属性
 
  ```sh
218
  $ dcu-ctk dcu-tracker 1 exclusive
songlinfeng's avatar
songlinfeng committed
219
220
  DCUs [1] have been made exclusive

221
  $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
222
223
224
225
226
227
  ------------------------------------------------------------------------------------------------------------------------
  GPU Id    UUID                     Accessibility       Container Ids
  ------------------------------------------------------------------------------------------------------------------------
  0         0x73873C7A6EB02041       Shared              dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
  1         0x73873C7A6EB008A1       Exclusive           dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
  2         0x73873C7A6EB040A1       Shared              None
songlinfeng's avatar
songlinfeng committed
228

229
  $ docker run --name slf_dmp  -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23
songlinfeng's avatar
songlinfeng committed
230

songlinfeng's avatar
songlinfeng committed
231
232
  docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: failed to 
  create  DTK Container Runtime: failed to construct OCI spec modifier: failed to reserve DCUs: DCUs [1] are exclusive and already in use: unknown.
songlinfeng's avatar
songlinfeng committed
233
  ```