"vscode:/vscode.git/clone" did not exist on "1d166e211ceb3221cde9698f107e58596e6aeab8"
README.md 9.75 KB
Newer Older
songlinfeng's avatar
songlinfeng committed
1
# DCU Container Toolkit
songlinfeng's avatar
songlinfeng committed
2

songlinfeng's avatar
songlinfeng committed
3
4
## 简介

songlinfeng's avatar
songlinfeng committed
5
DCU Container Toolkit 使用户能够构建和运行使用DCU设备的容器,该toolkit包括以下工具包。
songlinfeng's avatar
songlinfeng committed
6

songlinfeng's avatar
songlinfeng committed
7
8
- ```dcu-container-toolkit``` - DCU容器运行时
- ```dcu-ctk``` - DCU容器工具集命令行
songlinfeng's avatar
songlinfeng committed
9

songlinfeng's avatar
songlinfeng committed
10
11
12
## 使用
- --gpus需要Docker version 19+
- cdi 需要Docker version 25+
songlinfeng's avatar
songlinfeng committed
13
14
15
16
首先确保已经安装好DTK。

### 安装

songlinfeng's avatar
songlinfeng committed
17
18
使用 dpkg/rpm -i 进行安装。安装后会自动执行以下命令
```sh
19
20
$ dcu-ctk --quiet config --config-file=/etc/dcu-container-runtime/config.toml --in-place #生成配置文件
$ dcu-ctk runtime configure --runtime=docker --set-as-default  #修改docker的config.json的runtime
songlinfeng's avatar
songlinfeng committed
21
```
songlinfeng's avatar
songlinfeng committed
22
23
24
25
26
27

重启docker服务
```sh
$ sudo systemctl restart docker
```

songlinfeng's avatar
songlinfeng committed
28
### 在docker中使用DCU
songlinfeng's avatar
songlinfeng committed
29
30
31
32
33
34

#### 通过 docker CLI
可以通过 docker run 添加参数 --gpus 给容器添加HCU设备。
```sh
$ docker run -it --gpus all ubuntu:18.04    # 添加所有HCU设备
$ docker run -it --gpus 1 ubuntu:18.04 # 添加一个HCU设备,HCU 0
songlinfeng's avatar
songlinfeng committed
35
$ docker run -it --gpus '"device=0,2"' ubuntu:18.04 #添加第0号和第2号GPU
songlinfeng's avatar
songlinfeng committed
36
37
```

38
39
#### 通过环境变量 `DCU_VISIBLE_DEVICES`
可以通过 docker run 添加环境变量 -e DCU_VISIBLE_DEVICES 给容器添加HCU设备。
songlinfeng's avatar
songlinfeng committed
40
```sh
41
42
43
docker run -it -e DCU_VISIBLE_DEVICES=all ubuntu:18.04 # 添加所有HCU设备
docker run -it -e DCU_VISIBLE_DEVICES=0 ubuntu:18.04 # 添加HCU设备0
docker run -it -e DCU_VISIBLE_DEVICES=0,1 ubuntu:18.04 # 添加HCU设备0、1
songlinfeng's avatar
songlinfeng committed
44
```
songlinfeng's avatar
songlinfeng committed
45
46
47
#### 通过CDI方式
- 首先,生成CDI spec文件
```sh
48
$ dcu-ctk cdi generate --output=/etc/cdi/dtk.json
songlinfeng's avatar
songlinfeng committed
49
50
51
```
- 配置docker开启CDI
```sh
52
$ dcu-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled
songlinfeng's avatar
songlinfeng committed
53
54
55
56
57
58
59
60
61
62
63
64
65
```
- 使用所有的DCU
```sh
$ docker run --rm --device c-3000.com/hcu=all -it ubuntu:18.04
```
- 使用第0块和第1块DCU
```sh
$ docker run --rm --device c-3000.com/hcu=0 --device c-3000.com/hcu=1 -it ubuntu:18.04
```
### 在podman中使用DCU
podman 需要version 2.0+
- 首先需要修改podman的runtime
```sh
66
$ dcu-ctk runtime configure --runtime=podman --set-as-default
songlinfeng's avatar
songlinfeng committed
67
68
69
```
- 通过环境变量方式使用
```sh
70
$ podman run -it -e DCU_VISIBLE_DEVICES=all ubuntu:18.04
songlinfeng's avatar
songlinfeng committed
71
```
songlinfeng's avatar
songlinfeng committed
72
### 列出可使用的DCU
songlinfeng's avatar
songlinfeng committed
73
```sh
74
$ dcu-ctk cdi list
songlinfeng's avatar
songlinfeng committed
75
76
77
78
79
80
81
82
83
INFO[0000] Found 3 CDI devices
c-3000.com/hcu=0
c-3000.com/hcu=1
c-3000.com/hcu=2
c-3000.com/hcu=all
c-3000.com/hcu=hcu-73873c7a6eb008a1
c-3000.com/hcu=hcu-73873c7a6eb02041
c-3000.com/hcu=hcu-73873c7a6eb040a1
```
songlinfeng's avatar
songlinfeng committed
84
### docker rootless下对文件读写权限的限制
songlinfeng's avatar
songlinfeng committed
85
非root用户在使用-v挂载目录时,保留原有权限,无法对ro目录添加w权限,执行该命令需要root权限
songlinfeng's avatar
songlinfeng committed
86
```sh
87
$ dcu-ctk rootless --runtime=docker
songlinfeng's avatar
songlinfeng committed
88
```
songlinfeng's avatar
songlinfeng committed
89
### DCU Tracker
90
DCU Tracker用来监控使用--gpus和-e DCU_VISIBLE_DEVICES启动容器的DCU使用情况,也可以设置DCU被独享或共享。默认情况下DCU Tracker是disable状态,用户可以使用dcu-ctk的命令行来enable.
songlinfeng's avatar
songlinfeng committed
91
92
93
94

DCU Tracker 提供命令行来控制容器对DCU的访问,可以被设置为shared或exclusive.
- shared 表示DCU可以同时被多个容器一起使用,这是默认状态
- exclusive 表示DCU同时只能被一个容器使用。
songlinfeng's avatar
songlinfeng committed
95

songlinfeng's avatar
songlinfeng committed
96
```sh
97
$ dcu-ctk dcu-tracker -h
songlinfeng's avatar
songlinfeng committed
98
NAME:
songlinfeng's avatar
songlinfeng committed
99
   C-3000 DCU Container Toolkit CLI dcu-tracker - DCU Tracker  related commands
songlinfeng's avatar
songlinfeng committed
100
101

USAGE:
102
   dcu-ctk dcu-tracker [dcu-ids] [accessibility]
songlinfeng's avatar
songlinfeng committed
103
104
105
106
107
       Arguments:
           dcu-ids          Comma-separated list of DCU IDs (comma separated list, range operator, all)
           accessibility    Must be either 'exclusive' or 'shared'

     Examples:
108
109
110
           dcu-ctk dcu-tracker 0,1,2 exclusive
           dcu-ctk dcu-tracker 0,1-2 shared
           dcu-ctk dcu-tracker all shared
songlinfeng's avatar
songlinfeng committed
111
112
113

   OR

114
   dcu-ctk dcu-tracker [command] [options]
songlinfeng's avatar
songlinfeng committed
115
116
117
118
119
120
121
122
123
124
125
126
127

COMMANDS:
   disable  Disable the DCU Tracker
   enable   Enable the DCU Tracker
   reset    Reset the DCU Tracker
   status   Show Status of DCUs
   help, h  Shows a list of commands or help for one command

OPTIONS:
   --help, -h  show help
```

###使用DCU Tracker
songlinfeng's avatar
songlinfeng committed
128

songlinfeng's avatar
songlinfeng committed
129
通过rocm-smi来查看节点上的DCUs
songlinfeng's avatar
songlinfeng committed
130

songlinfeng's avatar
songlinfeng committed
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
```sh
$ rocm-smi


========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
0    52.0c           56.0W   600Mhz  1000Mhz  0%   auto  400.0W    0%   0%
1    57.0c           48.0W   600Mhz  1000Mhz  0%   auto  400.0W    0%   0%
2    52.0c           66.0W   600Mhz  1000Mhz  0%   auto  400.0W    0%   0%
====================================================================================
=============================== End of ROCm SMI Log ================================

```
- 查看DCU Tracker Status
   如果DCU Tracker enabled,DCU默认被赋予 shared 权限
songlinfeng's avatar
songlinfeng committed
147

songlinfeng's avatar
songlinfeng committed
148
   ```sh
149
   $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
150
   ------------------------------------------------------------------------------------------------------------------------
songlinfeng's avatar
songlinfeng committed
151
152
153
154
155
   GPU Id    UUID                     Accessibility       Container Ids
   ------------------------------------------------------------------------------------------------------------------------
   0         0x73873C7A6EB02041       Shared              None
   1         0x73873C7A6EB008A1       Shared              None
   2         0x73873C7A6EB040A1       Shared              None
songlinfeng's avatar
songlinfeng committed
156
157
158
159
   ```

   如果DCU Tracker没有开启,则会有相应提示
   ```sh
160
   $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
161
162
   DCU Tracker is disabled
   ```
songlinfeng's avatar
songlinfeng committed
163

songlinfeng's avatar
songlinfeng committed
164
- 开启 DCU Tracker
songlinfeng's avatar
songlinfeng committed
165

songlinfeng's avatar
songlinfeng committed
166
167
  ```sh

168
  $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
169
170
  DCU Tracker is disabled

171
  $ dcu-ctk dcu-tracker enable
songlinfeng's avatar
songlinfeng committed
172
173
  DCU Tracker has been enabled

174
  $ dcu-ctk dcu-tracker enable
songlinfeng's avatar
songlinfeng committed
175
176
  DCU Tracker is already enabled
  ```
songlinfeng's avatar
songlinfeng committed
177

songlinfeng's avatar
songlinfeng committed
178
- 关闭 DCU Tracker
songlinfeng's avatar
songlinfeng committed
179

songlinfeng's avatar
songlinfeng committed
180
  ```sh
181
  $ dcu-ctk dcu-tracker disable
songlinfeng's avatar
songlinfeng committed
182
183
  DCU Tracker has been disabled
  ```
songlinfeng's avatar
songlinfeng committed
184

songlinfeng's avatar
songlinfeng committed
185
186
- 设置DCU的访问权限
   当DCU Tracker开启时,启动容器时会自动记录容器使用DCU的情况
songlinfeng's avatar
songlinfeng committed
187

songlinfeng's avatar
songlinfeng committed
188
   ```sh
189
   $ docker run --name slf_dmps -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23 
songlinfeng's avatar
songlinfeng committed
190

191
   $ docker run --name slf_dmp -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23
songlinfeng's avatar
songlinfeng committed
192

193
   $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
194
   ------------------------------------------------------------------------------------------------------------------------
songlinfeng's avatar
songlinfeng committed
195
196
197
198
199
200
201
   GPU Id    UUID                     Accessibility       Container Ids
   ------------------------------------------------------------------------------------------------------------------------
   0         0x73873C7A6EB02041       Shared              3d07700f961485c678999ea1a0fecaaf0b54f3be51f4a1e9a2f1ae61032b276d
                                                          dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
   1         0x73873C7A6EB008A1       Shared              3d07700f961485c678999ea1a0fecaaf0b54f3be51f4a1e9a2f1ae61032b276d
                                                          dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
   2         0x73873C7A6EB040A1       Shared              None
songlinfeng's avatar
songlinfeng committed
202
203
204

   $ docker rm slf_dmp 

205
   $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
206
207
208
209
210
211
   ------------------------------------------------------------------------------------------------------------------------
   GPU Id    UUID                     Accessibility       Container Ids
   ------------------------------------------------------------------------------------------------------------------------
   0         0x73873C7A6EB02041       Shared              dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
   1         0x73873C7A6EB008A1       Shared              dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
   2         0x73873C7A6EB040A1       Shared              None
songlinfeng's avatar
songlinfeng committed
212
213

   ```
songlinfeng's avatar
songlinfeng committed
214
   
songlinfeng's avatar
songlinfeng committed
215
216
217
- 设置DCU 为exclusive属性
 
  ```sh
218
  $ dcu-ctk dcu-tracker 1 exclusive
songlinfeng's avatar
songlinfeng committed
219
220
  DCUs [1] have been made exclusive

221
  $ dcu-ctk dcu-tracker status
songlinfeng's avatar
songlinfeng committed
222
223
224
225
226
227
  ------------------------------------------------------------------------------------------------------------------------
  GPU Id    UUID                     Accessibility       Container Ids
  ------------------------------------------------------------------------------------------------------------------------
  0         0x73873C7A6EB02041       Shared              dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
  1         0x73873C7A6EB008A1       Exclusive           dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
  2         0x73873C7A6EB040A1       Shared              None
songlinfeng's avatar
songlinfeng committed
228

229
  $ docker run --name slf_dmp  -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23
songlinfeng's avatar
songlinfeng committed
230

songlinfeng's avatar
songlinfeng committed
231
232
  docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: failed to 
  create  DTK Container Runtime: failed to construct OCI spec modifier: failed to reserve DCUs: DCUs [1] are exclusive and already in use: unknown.
songlinfeng's avatar
songlinfeng committed
233
  ```
songlinfeng's avatar
songlinfeng committed
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286

### Docker Swarm 

DCU UUID 适配Docker Swarm的部署能力,使其能够在集群节点之间进行精确的GPU资源分配

#### Docker Daemon 配置 Swarm

```json
{
    "default-runtime": "dtk",
    "node-generic-resources": [
        "HY_DCU=0x73873c7a6eb02041",
        "HY_DCU=0x73873c7a6eb008a1",
        "HY_DCU=0x73873c7a6eb040a1"
    ],
    "runtimes": {
        "dtk": {
            "args": [],
            "path": "dcu-container-runtime"
        }
    }
}
```
配置完之后,需要重启Docker daemon
```sh
sudo systemctl restart docker
```

#### Service Definition
使用 docker-compose 部署具有特定 DCU 需求的服务:
```yaml
# docker-compose.yml for Swarm deployment
version: '3.8'
services:
  rocm-service:
    image: image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
    tty: true
    stdin_open: true
    command: /bin/bash
    deploy:
      replicas: 1
      resources:
        reservations:
          generic_resources:
            - discrete_resource_spec:
                kind: 'HY_DCU'  # Matches daemon.json key
                value: 2
```

部署服务
```sh
docker stack deploy -c docker-compose.yml rocm-stack
```