# DCU Container Toolkit ## 简介 DCU Container Toolkit 使用户能够构建和运行使用DCU设备的容器,该toolkit包括以下工具包。 - ```dcu-container-runtime``` - DCU容器运行时 - ```dcu-ctk``` - DCU容器工具集命令行 ## 使用 - --gpus需要Docker version 19+ - cdi 需要Docker version 25+ 首先确保已经安装好驱动。 ### 编译 若联网编译则可以使用如下命令编译,会启动一个docker容器进行编译 ```sh make rocky8 或 make ubuntu22.04 ``` 若要离线编译则执行如下命令,当前只支持rpm包 ```sh build.sh ``` ### 安装 使用 dpkg/rpm -i 进行安装。安装后会自动执行以下命令 ```sh $ dcu-ctk --quiet config --config-file=/etc/dcu-container-runtime/config.toml --in-place #生成配置文件 $ dcu-ctk runtime configure --runtime=docker --set-as-default #修改docker的config.json的runtime ``` 重启docker服务 ```sh $ sudo systemctl restart docker ``` ### 在docker中使用DCU #### 通过 docker CLI 可以通过 docker run 添加参数 --gpus 给容器添加HCU设备。 ```sh $ docker run -it --gpus all ubuntu:18.04 # 添加所有HCU设备 $ docker run -it --gpus 1 ubuntu:18.04 # 添加一个HCU设备,HCU 0 $ docker run -it --gpus '"device=0,2"' ubuntu:18.04 #添加第0号和第2号GPU ``` #### 通过环境变量 `DCU_VISIBLE_DEVICES` 可以通过 docker run 添加环境变量 -e DCU_VISIBLE_DEVICES 给容器添加HCU设备。 ```sh $ docker run -it -e DCU_VISIBLE_DEVICES=all ubuntu:18.04 # 添加所有HCU设备 $ docker run -it -e DCU_VISIBLE_DEVICES=0 ubuntu:18.04 # 添加HCU设备0 $ docker run -it -e DCU_VISIBLE_DEVICES=0,1 ubuntu:18.04 # 添加HCU设备0、1 ``` #### 通过CDI方式 - 首先,生成CDI spec文件 ```sh $ dcu-ctk cdi generate --output=/etc/cdi/dcu.json ``` - 配置docker开启CDI ```sh $ dcu-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled ``` - 使用所有的DCU ```sh $ docker run --rm --device c-3000.com/hcu=all -it ubuntu:18.04 ``` - 使用第0块和第1块DCU ```sh $ docker run --rm --device c-3000.com/hcu=0 --device c-3000.com/hcu=1 -it ubuntu:18.04 ``` ### 在podman中使用DCU podman 需要version 2.0+ - 首先需要修改podman的runtime ```sh $ dcu-ctk runtime configure --runtime=podman --set-as-default ``` - 通过环境变量方式使用 ```sh $ podman run -it -e DCU_VISIBLE_DEVICES=all ubuntu:18.04 ``` ### 列出可使用的DCU ```sh $ dcu-ctk cdi list INFO[0000] Found 3 CDI devices c-3000.com/hcu=0 c-3000.com/hcu=1 c-3000.com/hcu=2 c-3000.com/hcu=all c-3000.com/hcu=hcu-73873c7a6eb008a1 c-3000.com/hcu=hcu-73873c7a6eb02041 c-3000.com/hcu=hcu-73873c7a6eb040a1 ``` ### docker rootless下对文件读写权限的限制 非root用户在使用-v挂载目录时可以赋予目录w权限,为了防止非root用户在容器内删除非用户目录下的文件,可以在挂载非用户目录时添加ro权限 ```sh $ dcu-ctk rootless --runtime=docker ``` ### DCU Tracker DCU Tracker用来监控使用--gpus和-e DCU_VISIBLE_DEVICES启动容器的DCU使用情况,也可以设置DCU被独享或共享。默认情况下DCU Tracker是disable状态,用户可以使用dcu-ctk的命令行来enable. DCU Tracker 提供命令行来控制容器对DCU的访问,可以被设置为shared或exclusive. - shared 表示DCU可以同时被多个容器一起使用,这是默认状态 - exclusive 表示DCU同时只能被一个容器使用。 ```sh $ dcu-ctk dcu-tracker -h NAME: C-3000 DCU Container Toolkit CLI dcu-tracker - DCU Tracker related commands USAGE: dcu-ctk dcu-tracker [dcu-ids] [accessibility] Arguments: dcu-ids Comma-separated list of DCU IDs (comma separated list, range operator, all) accessibility Must be either 'exclusive' or 'shared' Examples: dcu-ctk dcu-tracker 0,1,2 exclusive dcu-ctk dcu-tracker 0,1-2 shared dcu-ctk dcu-tracker all shared OR dcu-ctk dcu-tracker [command] [options] COMMANDS: disable Disable the DCU Tracker enable Enable the DCU Tracker reset Reset the DCU Tracker status Show Status of DCUs help, h Shows a list of commands or help for one command OPTIONS: --help, -h show help ``` ###使用DCU Tracker 通过rocm-smi来查看节点上的DCUs ```sh $ rocm-smi ========================= ROCm System Management Interface ========================= =================================== Concise Info =================================== GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 52.0c 56.0W 600Mhz 1000Mhz 0% auto 400.0W 0% 0% 1 57.0c 48.0W 600Mhz 1000Mhz 0% auto 400.0W 0% 0% 2 52.0c 66.0W 600Mhz 1000Mhz 0% auto 400.0W 0% 0% ==================================================================================== =============================== End of ROCm SMI Log ================================ ``` - 查看DCU Tracker Status 如果DCU Tracker enabled,DCU默认被赋予 shared 权限 ```sh $ dcu-ctk dcu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0x73873C7A6EB02041 Shared None 1 0x73873C7A6EB008A1 Shared None 2 0x73873C7A6EB040A1 Shared None ``` 如果DCU Tracker没有开启,则会有相应提示 ```sh $ dcu-ctk dcu-tracker status DCU Tracker is disabled ``` - 开启 DCU Tracker ```sh $ dcu-ctk dcu-tracker status DCU Tracker is disabled $ dcu-ctk dcu-tracker enable DCU Tracker has been enabled $ dcu-ctk dcu-tracker enable DCU Tracker is already enabled ``` - 关闭 DCU Tracker ```sh $ dcu-ctk dcu-tracker disable DCU Tracker has been disabled ``` - 设置DCU的访问权限 当DCU Tracker开启时,启动容器时会自动记录容器使用DCU的情况 ```sh $ docker run --name slf_dmps -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23 $ docker run --name slf_dmp -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23 $ dcu-ctk dcu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0x73873C7A6EB02041 Shared 3d07700f961485c678999ea1a0fecaaf0b54f3be51f4a1e9a2f1ae61032b276d dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d 1 0x73873C7A6EB008A1 Shared 3d07700f961485c678999ea1a0fecaaf0b54f3be51f4a1e9a2f1ae61032b276d dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d 2 0x73873C7A6EB040A1 Shared None $ docker rm slf_dmp $ dcu-ctk dcu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0x73873C7A6EB02041 Shared dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d 1 0x73873C7A6EB008A1 Shared dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d 2 0x73873C7A6EB040A1 Shared None ``` - 设置DCU 为exclusive属性 ```sh $ dcu-ctk dcu-tracker 1 exclusive DCUs [1] have been made exclusive $ dcu-ctk dcu-tracker status ------------------------------------------------------------------------------------------------------------------------ GPU Id UUID Accessibility Container Ids ------------------------------------------------------------------------------------------------------------------------ 0 0x73873C7A6EB02041 Shared dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d 1 0x73873C7A6EB008A1 Exclusive dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d 2 0x73873C7A6EB040A1 Shared None $ docker run --name slf_dmp -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23 docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: failed to create DTK Container Runtime: failed to construct OCI spec modifier: failed to reserve DCUs: DCUs [1] are exclusive and already in use: unknown. ``` ### Docker Swarm DCU UUID 适配Docker Swarm的部署能力,使其能够在集群节点之间进行精确的GPU资源分配 #### Docker Daemon 配置 Swarm ```json { "default-runtime": "dtk", "node-generic-resources": [ "HY_DCU=0x73873c7a6eb02041", "HY_DCU=0x73873c7a6eb008a1", "HY_DCU=0x73873c7a6eb040a1" ], "runtimes": { "dtk": { "args": [], "path": "dcu-container-runtime" } } } ``` 配置完之后,需要重启Docker daemon ```sh sudo systemctl restart docker ``` #### Service Definition 使用 docker-compose 部署具有特定 DCU 需求的服务: ```yaml # docker-compose.yml for Swarm deployment version: '3.8' services: rocm-service: image: image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10 tty: true stdin_open: true command: /bin/bash deploy: replicas: 1 resources: reservations: generic_resources: - discrete_resource_spec: kind: 'HY_DCU' # Matches daemon.json key value: 2 ``` 部署服务 ```sh docker stack deploy -c docker-compose.yml rocm-stack ``` ### containerd 配置 配置 containerd 需要先配置 containerd 的配置文件 `/etc/containerd/config.toml` ```shell $ dcu-ctk runtime configure --runtime=containerd --set-as-default --cdi.enabled $ systemctl restart containerd ``` 若用于kubernetes,则需要配合[dcu-device-plugin](https://download.sourcefind.cn:65024/5/main/Kubernetes%E6%8F%92%E4%BB%B6)使用,若使用nerdctl命令行工具,则需要使用--runtime ```shell $ nerdctl run --rm --runtime dcu -e DCU_VISIBLE_DEVICES=0,1 ubuntu:18.04 bash ``` ### 查询DCU被docker容器使用情况 若开启了dcu-tracker, 则可以使用dcu-tracker查询DCU被docker容器使用情况,若没有开启dcu-tracker,则可以使用该功能查询被docker容器使用中的DCU 情况,若只是被挂载到容器而没有被使用,则不会显示在该列表中。 ```shell $ dcu-ctk docker ------------------------------------------------------------------------------------------------------------------------ DCU Id UUID Container Names ------------------------------------------------------------------------------------------------------------------------ 0 0x73873C7A6EB02041 peaceful_hawking ------------------------------------------------------------------------------------------------------------------------ ``` ### 支持vDCU挂载docker 支持vDCU挂载docker容器,在启动时使用 -e VDCU_VISIBLE_DEVICES环境变量来启动容器,vDCU的约束条件请查看[DCU虚拟化用户指南](https://download.sourcefind.cn:65024/directlink/5/Kubernetes%E6%8F%92%E4%BB%B6/DCU%E8%99%9A%E6%8B%9F%E5%8C%96%E7%94%A8%E6%88%B7%E6%8C%87%E5%8D%97.pdf) ``` #对第0块DCU分成4块vDCU,每个vDCU显存8G,每个vDCU计算单元30 $ hy-smi virtual -d 0 -create-vdevices 4 -vdevice-compute-units 30,30,30,30 -vdevice-memory-size 8192,8192,8192,8192 #查看分配好的vDCU $ hy-smi virtual --show-vdevice-info Virtual Device 0: Physical Device: 0 Compute units: 30 Global memory: 8589934592 bytes Virtual Device 1: Physical Device: 0 Compute units: 30 Global memory: 8589934592 bytes Virtual Device 2: Physical Device: 0 Compute units: 30 Global memory: 8589934592 bytes Virtual Device 3: Physical Device: 0 Compute units: 30 Global memory: 8589934592 bytes #使用docker run启动容器 $ docker run --rm -e VDCU_VISIBLE_DEVICES=1,2 -it a4dd5be0ca23 /bin/bash ```