README.md 8.5 KB
Newer Older
songlinfeng's avatar
songlinfeng committed
1
# DTK Container Toolkit
songlinfeng's avatar
songlinfeng committed
2

songlinfeng's avatar
songlinfeng committed
3
4
## 简介

songlinfeng's avatar
songlinfeng committed
5
DTK Container Toolkit 使用户能够构建和运行使用DCU设备的容器,该toolkit包括以下工具包。
songlinfeng's avatar
songlinfeng committed
6

songlinfeng's avatar
songlinfeng committed
7
8
- ```dtk-container-toolkit``` - DTK容器运行时
- ```dtk-ctk``` - DTK容器工具集命令行
songlinfeng's avatar
songlinfeng committed
9

songlinfeng's avatar
songlinfeng committed
10
11
12
## 使用
- --gpus需要Docker version 19+
- cdi 需要Docker version 25+
songlinfeng's avatar
songlinfeng committed
13
14
15
16
首先确保已经安装好DTK。

### 安装

songlinfeng's avatar
songlinfeng committed
17
18
19
20
21
使用 dpkg/rpm -i 进行安装。安装后会自动执行以下命令
```sh
$ dtk-ctk --quiet config --config-file=/etc/dtk-container-runtime/config.toml --in-place #生成配置文件
$ dtk-ctk runtime configure --runtime=docker --set-as-default  #修改docker的config.json的runtime
```
songlinfeng's avatar
songlinfeng committed
22
23
24
25
26
27

重启docker服务
```sh
$ sudo systemctl restart docker
```

songlinfeng's avatar
songlinfeng committed
28
### 在docker中使用DCU
songlinfeng's avatar
songlinfeng committed
29
30
31
32
33
34

#### 通过 docker CLI
可以通过 docker run 添加参数 --gpus 给容器添加HCU设备。
```sh
$ docker run -it --gpus all ubuntu:18.04    # 添加所有HCU设备
$ docker run -it --gpus 1 ubuntu:18.04 # 添加一个HCU设备,HCU 0
songlinfeng's avatar
songlinfeng committed
35
$ docker run -it --gpus "device=0,2" ubuntu:18.04 #添加第0号和第2号GPU
songlinfeng's avatar
songlinfeng committed
36
37
38
39
40
41
42
43
44
```

#### 通过环境变量 `DTK_VISIBLE_DEVICES`
可以通过 docker run 添加环境变量 -e DTK_VISIBLE_DEVICES 给容器添加HCU设备。
```sh
docker run -it -e DTK_VISIBLE_DEVICES=all ubuntu:18.04 # 添加所有HCU设备
docker run -it -e DTK_VISIBLE_DEVICES=0 ubuntu:18.04 # 添加HCU设备0
docker run -it -e DTK_VISIBLE_DEVICES=0,1 ubuntu:18.04 # 添加HCU设备0、1
```
songlinfeng's avatar
songlinfeng committed
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
#### 通过CDI方式
- 首先,生成CDI spec文件
```sh
$ dtk-ctk cdi generate --output=/etc/cdi/dtk.json
```
- 配置docker开启CDI
```sh
$ dtk-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled
```
- 使用所有的DCU
```sh
$ docker run --rm --device c-3000.com/hcu=all -it ubuntu:18.04
```
- 使用第0块和第1块DCU
```sh
$ docker run --rm --device c-3000.com/hcu=0 --device c-3000.com/hcu=1 -it ubuntu:18.04
```
### 在podman中使用DCU
podman 需要version 2.0+
- 首先需要修改podman的runtime
```sh
$ dtk-ctk runtime configure --runtime=podman --set-as-default
```
- 通过环境变量方式使用
```sh
$ podman run -it -e DTK_VISIBLE_DEVICES=all ubuntu:18.04
```
songlinfeng's avatar
songlinfeng committed
72
### 列出可使用的DCU
songlinfeng's avatar
songlinfeng committed
73
74
75
76
77
78
79
80
81
82
83
```sh
$ dtk-ctk cdi list
INFO[0000] Found 3 CDI devices
c-3000.com/hcu=0
c-3000.com/hcu=1
c-3000.com/hcu=2
c-3000.com/hcu=all
c-3000.com/hcu=hcu-73873c7a6eb008a1
c-3000.com/hcu=hcu-73873c7a6eb02041
c-3000.com/hcu=hcu-73873c7a6eb040a1
```
songlinfeng's avatar
songlinfeng committed
84
### docker rootless下对文件读写权限的限制
songlinfeng's avatar
songlinfeng committed
85
非root用户在使用-v挂载目录时,保留原有权限,无法对ro目录添加w权限,执行该命令需要root权限
songlinfeng's avatar
songlinfeng committed
86
```sh
songlinfeng's avatar
songlinfeng committed
87
$ dtk-ctk rootless --runtime=docker
songlinfeng's avatar
songlinfeng committed
88
```
songlinfeng's avatar
songlinfeng committed
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
### DCU Tracker
DCU Tracker用来监控使用--gpus和-e DTK_VISIBLE_DEVICES启动容器的DCU使用情况,也可以设置DCU被独享或共享。默认情况下DCU Tracker是disable状态,用户可以使用dtk-ctk的命令行来enable.

DCU Tracker 提供命令行来控制容器对DCU的访问,可以被设置为shared或exclusive.
- shared 表示DCU可以同时被多个容器一起使用,这是默认状态
- exclusive 表示DCU同时只能被一个容器使用。
```sh
$ dtk-ctk dcu-tracker -h
NAME:
   C-3000 DTK Container Toolkit CLI dcu-tracker - DCU Tracker  related commands

USAGE:
   dtk-ctk dcu-tracker [dcu-ids] [accessibility]
       Arguments:
           dcu-ids          Comma-separated list of DCU IDs (comma separated list, range operator, all)
           accessibility    Must be either 'exclusive' or 'shared'

     Examples:
           dtk-ctk dcu-tracker 0,1,2 exclusive
           dtk-ctk dcu-tracker 0,1-2 shared
           dtk-ctk dcu-tracker all shared

   OR

   dtk-ctk dcu-tracker [command] [options]

COMMANDS:
   disable  Disable the DCU Tracker
   enable   Enable the DCU Tracker
   reset    Reset the DCU Tracker
   status   Show Status of DCUs
   help, h  Shows a list of commands or help for one command

OPTIONS:
   --help, -h  show help
```

###使用DCU Tracker
通过rocm-smi来查看节点上的DCUs
```sh
$ rocm-smi


========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
0    52.0c           56.0W   600Mhz  1000Mhz  0%   auto  400.0W    0%   0%
1    57.0c           48.0W   600Mhz  1000Mhz  0%   auto  400.0W    0%   0%
2    52.0c           66.0W   600Mhz  1000Mhz  0%   auto  400.0W    0%   0%
====================================================================================
=============================== End of ROCm SMI Log ================================

```
- 查看DCU Tracker Status
   如果DCU Tracker enabled,DCU默认被赋予 shared 权限
   ```sh
   $ dtk-ctk dcu-tracker status
   ------------------------------------------------------------------------------------------------------------------------
GPU Id    UUID                     Accessibility       Container Ids
------------------------------------------------------------------------------------------------------------------------
0         0x73873C7A6EB02041       Shared              None
1         0x73873C7A6EB008A1       Shared              None
2         0x73873C7A6EB040A1       Shared              None
   ```

   如果DCU Tracker没有开启,则会有相应提示
   ```sh
   $ dtk-ctk dcu-tracker status
   DCU Tracker is disabled
   ```
- 开启 DCU Tracker
  ```sh

  $ dtk-ctk dcu-tracker status
  DCU Tracker is disabled

  $ dtk-ctk dcu-tracker enable
  DCU Tracker has been enabled

  $ dtk-ctk dcu-tracker enable
  DCU Tracker is already enabled
  ```
- 关闭 DCU Tracker
  ```sh
  $ dtk-ctk dcu-tracker disable
  DCU Tracker has been disabled
  ```
- 设置DCU的访问权限
   当DCU Tracker开启时,启动容器时会自动记录容器使用DCU的情况
   ```sh
   $ docker run --name slf_dmps -e DTK_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23 

   $ docker run --name slf_dmp -e DTK_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23

   $ dtk-ctk dcu-tracker status
   ------------------------------------------------------------------------------------------------------------------------
GPU Id    UUID                     Accessibility       Container Ids
------------------------------------------------------------------------------------------------------------------------
0         0x73873C7A6EB02041       Shared              3d07700f961485c678999ea1a0fecaaf0b54f3be51f4a1e9a2f1ae61032b276d
                                                       dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
1         0x73873C7A6EB008A1       Shared              3d07700f961485c678999ea1a0fecaaf0b54f3be51f4a1e9a2f1ae61032b276d
                                                       dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
2         0x73873C7A6EB040A1       Shared              None

   $ docker rm slf_dmp 

   $ dtk-ctk dcu-tracker status
------------------------------------------------------------------------------------------------------------------------
GPU Id    UUID                     Accessibility       Container Ids
------------------------------------------------------------------------------------------------------------------------
0         0x73873C7A6EB02041       Shared              dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
1         0x73873C7A6EB008A1       Shared              dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
2         0x73873C7A6EB040A1       Shared              None

   ```
- 设置DCU 为exclusive属性
 
  ```sh
  $ dtk-ctk dcu-tracker 1 exclusive
  DCUs [1] have been made exclusive

  $ dtk-ctk dcu-tracker status
------------------------------------------------------------------------------------------------------------------------
GPU Id    UUID                     Accessibility       Container Ids
------------------------------------------------------------------------------------------------------------------------
0         0x73873C7A6EB02041       Shared              dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
1         0x73873C7A6EB008A1       Exclusive           dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
2         0x73873C7A6EB040A1       Shared              None

$ docker run --name slf_dmp  -e DTK_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: failed to create DTK Container Runtime: failed to construct OCI spec modifier: failed to reserve DCUs: DCUs [1] are exclusive and already in use: unknown.
  ```