# DTK Container Toolkit ## 简介 DTK Container Toolkit 使用户能够构建和运行使用DCU设备的容器,该toolkit包括以下工具包。 - ```dcu-container-toolkit``` - DTK容器运行时 - ```dcu-ctk``` - DTK容器工具集命令行 ## 使用 - --gpus需要Docker version 19+ - cdi 需要Docker version 25+ 首先确保已经安装好DTK。 ### 安装 使用 dpkg/rpm -i 进行安装。安装后会自动执行以下命令 ```sh $ dcu-ctk --quiet config --config-file=/etc/dcu-container-runtime/config.toml --in-place #生成配置文件 $ dcu-ctk runtime configure --runtime=docker --set-as-default #修改docker的config.json的runtime ``` 重启docker服务 ```sh $ sudo systemctl restart docker ``` ### 在docker中使用DCU #### 通过 docker CLI 可以通过 docker run 添加参数 --gpus 给容器添加HCU设备。 ```sh $ docker run -it --gpus all ubuntu:18.04 # 添加所有HCU设备 $ docker run -it --gpus 1 ubuntu:18.04 # 添加一个HCU设备,HCU 0 $ docker run -it --gpus "device=0,2" ubuntu:18.04 #添加第0号和第2号GPU ``` #### 通过环境变量 `DCU_VISIBLE_DEVICES` 可以通过 docker run 添加环境变量 -e DCU_VISIBLE_DEVICES 给容器添加HCU设备。 ```sh docker run -it -e DCU_VISIBLE_DEVICES=all ubuntu:18.04 # 添加所有HCU设备 docker run -it -e DCU_VISIBLE_DEVICES=0 ubuntu:18.04 # 添加HCU设备0 docker run -it -e DCU_VISIBLE_DEVICES=0,1 ubuntu:18.04 # 添加HCU设备0、1 ``` #### 通过CDI方式 - 首先,生成CDI spec文件 ```sh $ dcu-ctk cdi generate --output=/etc/cdi/dtk.json ``` - 配置docker开启CDI ```sh $ dcu-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled ``` - 使用所有的DCU ```sh $ docker run --rm --device c-3000.com/hcu=all -it ubuntu:18.04 ``` - 使用第0块和第1块DCU ```sh $ docker run --rm --device c-3000.com/hcu=0 --device c-3000.com/hcu=1 -it ubuntu:18.04 ``` ### 在podman中使用DCU podman 需要version 2.0+ - 首先需要修改podman的runtime ```sh $ dcu-ctk runtime configure --runtime=podman --set-as-default ``` - 通过环境变量方式使用 ```sh $ podman run -it -e DCU_VISIBLE_DEVICES=all ubuntu:18.04 ``` ### 列出可使用的DCU ```sh $ dcu-ctk cdi list INFO[0000] Found 3 CDI devices c-3000.com/hcu=0 c-3000.com/hcu=1 c-3000.com/hcu=2 c-3000.com/hcu=all c-3000.com/hcu=hcu-73873c7a6eb008a1 c-3000.com/hcu=hcu-73873c7a6eb02041 c-3000.com/hcu=hcu-73873c7a6eb040a1 ``` ### docker rootless下对文件读写权限的限制 非root用户在使用-v挂载目录时,保留原有权限,无法对ro目录添加w权限,执行该命令需要root权限 ```sh $ dcu-ctk rootless --runtime=docker ``` ### DCU Tracker DCU Tracker用来监控使用--gpus和-e DCU_VISIBLE_DEVICES启动容器的DCU使用情况,也可以设置DCU被独享或共享。默认情况下DCU Tracker是disable状态,用户可以使用dcu-ctk的命令行来enable. DCU Tracker 提供命令行来控制容器对DCU的访问,可以被设置为shared或exclusive. - shared 表示DCU可以同时被多个容器一起使用,这是默认状态 - exclusive 表示DCU同时只能被一个容器使用。 ```sh $ dcu-ctk dcu-tracker -h NAME: C-3000 DTK Container Toolkit CLI dcu-tracker - DCU Tracker related commands USAGE: dcu-ctk dcu-tracker [dcu-ids] [accessibility] Arguments: dcu-ids Comma-separated list of DCU IDs (comma separated list, range operator, all) accessibility Must be either 'exclusive' or 'shared' Examples: dcu-ctk dcu-tracker 0,1,2 exclusive dcu-ctk dcu-tracker 0,1-2 shared dcu-ctk dcu-tracker all shared OR dcu-ctk dcu-tracker [command] [options] COMMANDS: disable Disable the DCU Tracker enable Enable the DCU Tracker reset Reset the DCU Tracker status Show Status of DCUs help, h Shows a list of commands or help for one command OPTIONS: --help, -h show help ``` ###使用DCU Tracker 通过rocm-smi来查看节点上的DCUs ```sh $ rocm-smi ========================= ROCm System Management Interface ========================= =================================== Concise Info =================================== GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 52.0c 56.0W 600Mhz 1000Mhz 0% auto 400.0W 0% 0% 1 57.0c 48.0W 600Mhz 1000Mhz 0% auto 400.0W 0% 0% 2 52.0c 66.0W 600Mhz 1000Mhz 0% auto 400.0W 0% 0% ==================================================================================== =============================== End of ROCm SMI Log ================================ ``` - 查看DCU Tracker Status 如果DCU Tracker enabled,DCU默认被赋予 shared 权限 ```sh $ dcu-ctk dcu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0x73873C7A6EB02041 Shared None 1 0x73873C7A6EB008A1 Shared None 2 0x73873C7A6EB040A1 Shared None ``` 如果DCU Tracker没有开启,则会有相应提示 ```sh $ dcu-ctk dcu-tracker status DCU Tracker is disabled ``` - 开启 DCU Tracker ```sh $ dcu-ctk dcu-tracker status DCU Tracker is disabled $ dcu-ctk dcu-tracker enable DCU Tracker has been enabled $ dcu-ctk dcu-tracker enable DCU Tracker is already enabled ``` - 关闭 DCU Tracker ```sh $ dcu-ctk dcu-tracker disable DCU Tracker has been disabled ``` - 设置DCU的访问权限 当DCU Tracker开启时,启动容器时会自动记录容器使用DCU的情况 ```sh $ docker run --name slf_dmps -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23 $ docker run --name slf_dmp -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23 $ dcu-ctk dcu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0x73873C7A6EB02041 Shared 3d07700f961485c678999ea1a0fecaaf0b54f3be51f4a1e9a2f1ae61032b276d dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d 1 0x73873C7A6EB008A1 Shared 3d07700f961485c678999ea1a0fecaaf0b54f3be51f4a1e9a2f1ae61032b276d dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d 2 0x73873C7A6EB040A1 Shared None $ docker rm slf_dmp $ dcu-ctk dcu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0x73873C7A6EB02041 Shared dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d 1 0x73873C7A6EB008A1 Shared dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d 2 0x73873C7A6EB040A1 Shared None ``` - 设置DCU 为exclusive属性 ```sh $ dcu-ctk dcu-tracker 1 exclusive DCUs [1] have been made exclusive $ dcu-ctk dcu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0x73873C7A6EB02041 Shared dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d 1 0x73873C7A6EB008A1 Exclusive dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d 2 0x73873C7A6EB040A1 Shared None $ docker run --name slf_dmp -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23 docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: failed to create DTK Container Runtime: failed to construct OCI spec modifier: failed to reserve DCUs: DCUs [1] are exclusive and already in use: unknown. ```