README.md 12.6 KB
Newer Older
songlinfeng's avatar
songlinfeng committed
1
# DCU Container Toolkit
songlinfeng's avatar
songlinfeng committed
2

songlinfeng's avatar
songlinfeng committed
3
4
## 简介

songlinfeng's avatar
songlinfeng committed
5
DCU Container Toolkit 使用户能够构建和运行使用DCU设备的容器,该toolkit包括以下工具包。
songlinfeng's avatar
songlinfeng committed
6

songlinfeng's avatar
songlinfeng committed
7
- ```dcu-container-runtime``` - DCU容器运行时
songlinfeng's avatar
songlinfeng committed
8
- ```dcu-ctk``` - DCU容器工具集命令行
songlinfeng's avatar
songlinfeng committed
9

songlinfeng's avatar
songlinfeng committed
10
11
12
## 使用
- --gpus需要Docker version 19+
- cdi 需要Docker version 25+
songlinfeng's avatar
songlinfeng committed
13
首先确保已经安装好驱动。
songlinfeng's avatar
songlinfeng committed
14

songlinfeng's avatar
songlinfeng committed
15
16
17
18
19
20
21
### 编译
若联网编译则可以使用如下命令编译,会启动一个docker容器进行编译
```sh
make rocky8

make ubuntu22.04
```
songlinfeng's avatar
songlinfeng committed
22
若要离线编译则执行如下命令,当前只支持rpm包
songlinfeng's avatar
songlinfeng committed
23
24
25
26
```sh
build.sh 
```

songlinfeng's avatar
songlinfeng committed
27
28
### 安装

songlinfeng's avatar
songlinfeng committed
29
30
使用 dpkg/rpm -i 进行安装。安装后会自动执行以下命令
```sh
31
32
$ dcu-ctk --quiet config --config-file=/etc/dcu-container-runtime/config.toml --in-place #生成配置文件
$ dcu-ctk runtime configure --runtime=docker --set-as-default  #修改docker的config.json的runtime
songlinfeng's avatar
songlinfeng committed
33
```
songlinfeng's avatar
songlinfeng committed
34
35
36
37
38
39

重启docker服务
```sh
$ sudo systemctl restart docker
```

songlinfeng's avatar
songlinfeng committed
40
### 在docker中使用DCU
songlinfeng's avatar
songlinfeng committed
41
42
43
44
45
46

#### 通过 docker CLI
可以通过 docker run 添加参数 --gpus 给容器添加HCU设备。
```sh
$ docker run -it --gpus all ubuntu:18.04    # 添加所有HCU设备
$ docker run -it --gpus 1 ubuntu:18.04 # 添加一个HCU设备,HCU 0
songlinfeng's avatar
songlinfeng committed
47
$ docker run -it --gpus '"device=0,2"' ubuntu:18.04 #添加第0号和第2号GPU
songlinfeng's avatar
songlinfeng committed
48
49
```

50
51
#### 通过环境变量 `DCU_VISIBLE_DEVICES`
可以通过 docker run 添加环境变量 -e DCU_VISIBLE_DEVICES 给容器添加HCU设备。
songlinfeng's avatar
songlinfeng committed
52
```sh
songlinfeng's avatar
songlinfeng committed
53
54
55
$ docker run -it -e DCU_VISIBLE_DEVICES=all ubuntu:18.04 # 添加所有HCU设备
$ docker run -it -e DCU_VISIBLE_DEVICES=0 ubuntu:18.04 # 添加HCU设备0
$ docker run -it -e DCU_VISIBLE_DEVICES=0,1 ubuntu:18.04 # 添加HCU设备0、1
songlinfeng's avatar
songlinfeng committed
56
```
songlinfeng's avatar
songlinfeng committed
57
58
59
#### 通过CDI方式
- 首先,生成CDI spec文件
```sh
songlinfeng's avatar
songlinfeng committed
60
$ dcu-ctk cdi generate --output=/etc/cdi/dcu.json
songlinfeng's avatar
songlinfeng committed
61
62
63
```
- 配置docker开启CDI
```sh
64
$ dcu-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled
songlinfeng's avatar
songlinfeng committed
65
66
67
68
69
70
71
72
73
74
75
76
77
```
- 使用所有的DCU
```sh
$ docker run --rm --device c-3000.com/hcu=all -it ubuntu:18.04
```
- 使用第0块和第1块DCU
```sh
$ docker run --rm --device c-3000.com/hcu=0 --device c-3000.com/hcu=1 -it ubuntu:18.04
```
### 在podman中使用DCU
podman 需要version 2.0+
- 首先需要修改podman的runtime
```sh
78
$ dcu-ctk runtime configure --runtime=podman --set-as-default
songlinfeng's avatar
songlinfeng committed
79
80
81
```
- 通过环境变量方式使用
```sh
82
$ podman run -it -e DCU_VISIBLE_DEVICES=all ubuntu:18.04
songlinfeng's avatar
songlinfeng committed
83
```
songlinfeng's avatar
songlinfeng committed
84
### 列出可使用的DCU
songlinfeng's avatar
songlinfeng committed
85
```sh
86
$ dcu-ctk cdi list
songlinfeng's avatar
songlinfeng committed
87
88
89
90
91
92
93
94
95
INFO[0000] Found 3 CDI devices
c-3000.com/hcu=0
c-3000.com/hcu=1
c-3000.com/hcu=2
c-3000.com/hcu=all
c-3000.com/hcu=hcu-73873c7a6eb008a1
c-3000.com/hcu=hcu-73873c7a6eb02041
c-3000.com/hcu=hcu-73873c7a6eb040a1
```
songlinfeng's avatar
songlinfeng committed
96
### docker rootless下对文件读写权限的限制
songlinfeng's avatar
songlinfeng committed
97
非root用户在使用-v挂载目录时可以赋予目录w权限,为了防止非root用户在容器内删除非用户目录下的文件,可以在挂载非用户目录时添加ro权限
songlinfeng's avatar
songlinfeng committed
98
```sh
99
$ dcu-ctk rootless --runtime=docker
songlinfeng's avatar
songlinfeng committed
100
```
songlinfeng's avatar
songlinfeng committed
101
### DCU Tracker
102
DCU Tracker用来监控使用--gpus和-e DCU_VISIBLE_DEVICES启动容器的DCU使用情况,也可以设置DCU被独享或共享。默认情况下DCU Tracker是disable状态,用户可以使用dcu-ctk的命令行来enable.
songlinfeng's avatar
songlinfeng committed
103
104
105
106

DCU Tracker 提供命令行来控制容器对DCU的访问,可以被设置为shared或exclusive.
- shared 表示DCU可以同时被多个容器一起使用,这是默认状态
- exclusive 表示DCU同时只能被一个容器使用。
songlinfeng's avatar
songlinfeng committed
107

songlinfeng's avatar
songlinfeng committed
108
```sh
109
$ dcu-ctk dcu-tracker -h
songlinfeng's avatar
songlinfeng committed
110
NAME:
songlinfeng's avatar
songlinfeng committed
111
   C-3000 DCU Container Toolkit CLI dcu-tracker - DCU Tracker  related commands
songlinfeng's avatar
songlinfeng committed
112
113

USAGE:
114
   dcu-ctk dcu-tracker [dcu-ids] [accessibility]
songlinfeng's avatar
songlinfeng committed
115
116
117
118
119
       Arguments:
           dcu-ids          Comma-separated list of DCU IDs (comma separated list, range operator, all)
           accessibility    Must be either 'exclusive' or 'shared'

     Examples:
120
121
122
           dcu-ctk dcu-tracker 0,1,2 exclusive
           dcu-ctk dcu-tracker 0,1-2 shared
           dcu-ctk dcu-tracker all shared
songlinfeng's avatar
songlinfeng committed
123
124
125

   OR

126
   dcu-ctk dcu-tracker [command] [options]
songlinfeng's avatar
songlinfeng committed
127
128
129
130
131
132
133
134
135
136
137
138
139

COMMANDS:
   disable  Disable the DCU Tracker
   enable   Enable the DCU Tracker
   reset    Reset the DCU Tracker
   status   Show Status of DCUs
   help, h  Shows a list of commands or help for one command

OPTIONS:
   --help, -h  show help
```

###使用DCU Tracker
songlinfeng's avatar
songlinfeng committed
140

songlinfeng's avatar
songlinfeng committed
141
通过rocm-smi来查看节点上的DCUs
songlinfeng's avatar
songlinfeng committed
142

songlinfeng's avatar
songlinfeng committed
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
```sh
$ rocm-smi


========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
0    52.0c           56.0W   600Mhz  1000Mhz  0%   auto  400.0W    0%   0%
1    57.0c           48.0W   600Mhz  1000Mhz  0%   auto  400.0W    0%   0%
2    52.0c           66.0W   600Mhz  1000Mhz  0%   auto  400.0W    0%   0%
====================================================================================
=============================== End of ROCm SMI Log ================================

```
- 查看DCU Tracker Status
   如果DCU Tracker enabled,DCU默认被赋予 shared 权限
songlinfeng's avatar
songlinfeng committed
159

songlinfeng's avatar
songlinfeng committed
160
   ```sh
161
   $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
162
   ------------------------------------------------------------------------------------------------------------------------
songlinfeng's avatar
songlinfeng committed
163
164
165
166
167
   GPU Id    UUID                     Accessibility       Container Ids
   ------------------------------------------------------------------------------------------------------------------------
   0         0x73873C7A6EB02041       Shared              None
   1         0x73873C7A6EB008A1       Shared              None
   2         0x73873C7A6EB040A1       Shared              None
songlinfeng's avatar
songlinfeng committed
168
169
170
171
   ```

   如果DCU Tracker没有开启,则会有相应提示
   ```sh
172
   $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
173
174
   DCU Tracker is disabled
   ```
songlinfeng's avatar
songlinfeng committed
175

songlinfeng's avatar
songlinfeng committed
176
- 开启 DCU Tracker
songlinfeng's avatar
songlinfeng committed
177

songlinfeng's avatar
songlinfeng committed
178
179
  ```sh

180
  $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
181
182
  DCU Tracker is disabled

183
  $ dcu-ctk dcu-tracker enable
songlinfeng's avatar
songlinfeng committed
184
185
  DCU Tracker has been enabled

186
  $ dcu-ctk dcu-tracker enable
songlinfeng's avatar
songlinfeng committed
187
188
  DCU Tracker is already enabled
  ```
songlinfeng's avatar
songlinfeng committed
189

songlinfeng's avatar
songlinfeng committed
190
- 关闭 DCU Tracker
songlinfeng's avatar
songlinfeng committed
191

songlinfeng's avatar
songlinfeng committed
192
  ```sh
193
  $ dcu-ctk dcu-tracker disable
songlinfeng's avatar
songlinfeng committed
194
195
  DCU Tracker has been disabled
  ```
songlinfeng's avatar
songlinfeng committed
196

songlinfeng's avatar
songlinfeng committed
197
198
- 设置DCU的访问权限
   当DCU Tracker开启时,启动容器时会自动记录容器使用DCU的情况
songlinfeng's avatar
songlinfeng committed
199

songlinfeng's avatar
songlinfeng committed
200
   ```sh
201
   $ docker run --name slf_dmps -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23 
songlinfeng's avatar
songlinfeng committed
202

203
   $ docker run --name slf_dmp -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23
songlinfeng's avatar
songlinfeng committed
204

205
   $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
206
   ------------------------------------------------------------------------------------------------------------------------
songlinfeng's avatar
songlinfeng committed
207
208
209
210
211
212
213
   GPU Id    UUID                     Accessibility       Container Ids
   ------------------------------------------------------------------------------------------------------------------------
   0         0x73873C7A6EB02041       Shared              3d07700f961485c678999ea1a0fecaaf0b54f3be51f4a1e9a2f1ae61032b276d
                                                          dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
   1         0x73873C7A6EB008A1       Shared              3d07700f961485c678999ea1a0fecaaf0b54f3be51f4a1e9a2f1ae61032b276d
                                                          dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
   2         0x73873C7A6EB040A1       Shared              None
songlinfeng's avatar
songlinfeng committed
214
215
216

   $ docker rm slf_dmp 

217
   $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
218
219
220
221
222
223
   ------------------------------------------------------------------------------------------------------------------------
   GPU Id    UUID                     Accessibility       Container Ids
   ------------------------------------------------------------------------------------------------------------------------
   0         0x73873C7A6EB02041       Shared              dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
   1         0x73873C7A6EB008A1       Shared              dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
   2         0x73873C7A6EB040A1       Shared              None
songlinfeng's avatar
songlinfeng committed
224
225

   ```
songlinfeng's avatar
songlinfeng committed
226
   
songlinfeng's avatar
songlinfeng committed
227
228
229
- 设置DCU 为exclusive属性
 
  ```sh
230
  $ dcu-ctk dcu-tracker 1 exclusive
songlinfeng's avatar
songlinfeng committed
231
232
  DCUs [1] have been made exclusive

233
  $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
234
235
236
237
238
239
  ------------------------------------------------------------------------------------------------------------------------
  GPU Id    UUID                     Accessibility       Container Ids
  ------------------------------------------------------------------------------------------------------------------------
  0         0x73873C7A6EB02041       Shared              dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
  1         0x73873C7A6EB008A1       Exclusive           dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
  2         0x73873C7A6EB040A1       Shared              None
songlinfeng's avatar
songlinfeng committed
240

241
  $ docker run --name slf_dmp  -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23
songlinfeng's avatar
songlinfeng committed
242

songlinfeng's avatar
songlinfeng committed
243
244
  docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: failed to 
  create  DTK Container Runtime: failed to construct OCI spec modifier: failed to reserve DCUs: DCUs [1] are exclusive and already in use: unknown.
songlinfeng's avatar
songlinfeng committed
245
  ```
songlinfeng's avatar
songlinfeng committed
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298

### Docker Swarm 

DCU UUID 适配Docker Swarm的部署能力,使其能够在集群节点之间进行精确的GPU资源分配

#### Docker Daemon 配置 Swarm

```json
{
    "default-runtime": "dtk",
    "node-generic-resources": [
        "HY_DCU=0x73873c7a6eb02041",
        "HY_DCU=0x73873c7a6eb008a1",
        "HY_DCU=0x73873c7a6eb040a1"
    ],
    "runtimes": {
        "dtk": {
            "args": [],
            "path": "dcu-container-runtime"
        }
    }
}
```
配置完之后,需要重启Docker daemon
```sh
sudo systemctl restart docker
```

#### Service Definition
使用 docker-compose 部署具有特定 DCU 需求的服务:
```yaml
# docker-compose.yml for Swarm deployment
version: '3.8'
services:
  rocm-service:
    image: image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
    tty: true
    stdin_open: true
    command: /bin/bash
    deploy:
      replicas: 1
      resources:
        reservations:
          generic_resources:
            - discrete_resource_spec:
                kind: 'HY_DCU'  # Matches daemon.json key
                value: 2
```

部署服务
```sh
docker stack deploy -c docker-compose.yml rocm-stack
```
songlinfeng's avatar
songlinfeng committed
299
300
301
302
303
304
305

### containerd 配置
配置 containerd 需要先配置 containerd 的配置文件 `/etc/containerd/config.toml`
```shell
$ dcu-ctk runtime configure --runtime=containerd --set-as-default --cdi.enabled
$ systemctl restart containerd
```
songlinfeng's avatar
songlinfeng committed
306
若用于kubernetes,则需要配合[dcu-device-plugin](https://download.sourcefind.cn:65024/5/main/Kubernetes%E6%8F%92%E4%BB%B6)使用,若使用nerdctl命令行工具,则需要使用--runtime
songlinfeng's avatar
songlinfeng committed
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
```shell
$ nerdctl run --rm --runtime dcu -e DCU_VISIBLE_DEVICES=0,1 ubuntu:18.04 bash
```
### 查询DCU被docker容器使用情况
若开启了dcu-tracker, 则可以使用dcu-tracker查询DCU被docker容器使用情况,若没有开启dcu-tracker,则可以使用该功能查询被docker容器使用中的DCU
情况,若只是被挂载到容器而没有被使用,则不会显示在该列表中。
```shell
$ dcu-ctk docker
------------------------------------------------------------------------------------------------------------------------
DCU Id                                  UUID                                              Container Names
------------------------------------------------------------------------------------------------------------------------
0                                       0x73873C7A6EB02041                                peaceful_hawking
------------------------------------------------------------------------------------------------------------------------
```

songlinfeng's avatar
songlinfeng committed
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
### 支持vDCU挂载docker
支持vDCU挂载docker容器,在启动时使用 -e VDCU_VISIBLE_DEVICES环境变量来启动容器,vDCU的约束条件请查看[DCU虚拟化用户指南](https://download.sourcefind.cn:65024/directlink/5/Kubernetes%E6%8F%92%E4%BB%B6/DCU%E8%99%9A%E6%8B%9F%E5%8C%96%E7%94%A8%E6%88%B7%E6%8C%87%E5%8D%97.pdf)
```
#对第0块DCU分成4块vDCU,每个vDCU显存8G,每个vDCU计算单元30
$ hy-smi virtual -d 0 -create-vdevices 4 -vdevice-compute-units 30,30,30,30 -vdevice-memory-size 8192,8192,8192,8192

#查看分配好的vDCU
$ hy-smi virtual --show-vdevice-info
Virtual Device 0:
        Physical Device: 0
        Compute units: 30
        Global memory: 8589934592 bytes
Virtual Device 1:
        Physical Device: 0
        Compute units: 30
        Global memory: 8589934592 bytes
Virtual Device 2:
        Physical Device: 0
        Compute units: 30
        Global memory: 8589934592 bytes
Virtual Device 3:
        Physical Device: 0
        Compute units: 30
        Global memory: 8589934592 bytes

#使用docker run启动容器
 $ docker run --rm -e VDCU_VISIBLE_DEVICES=1,2 -it a4dd5be0ca23 /bin/bash

```