# DCU Container Toolkit

## 简介

DCU Container Toolkit 使用户能够构建和运行使用DCU设备的容器,该toolkit包括以下工具包。

- ```dcu-container-toolkit``` - DCU容器运行时
- ```dcu-ctk``` - DCU容器工具集命令行

## 使用
- --gpus需要Docker version 19+
- cdi 需要Docker version 25+
首先确保已经安装好DTK。

### 安装

使用 dpkg/rpm -i 进行安装。安装后会自动执行以下命令
```sh
$ dcu-ctk --quiet config --config-file=/etc/dcu-container-runtime/config.toml --in-place #生成配置文件
$ dcu-ctk runtime configure --runtime=docker --set-as-default  #修改docker的config.json的runtime
```

重启docker服务
```sh
$ sudo systemctl restart docker
```

### 在docker中使用DCU

#### 通过 docker CLI
可以通过 docker run 添加参数 --gpus 给容器添加HCU设备。
```sh
$ docker run -it --gpus all ubuntu:18.04    # 添加所有HCU设备
$ docker run -it --gpus 1 ubuntu:18.04 # 添加一个HCU设备，HCU 0
$ docker run -it --gpus '"device=0,2"' ubuntu:18.04 #添加第0号和第2号GPU
```

#### 通过环境变量 `DCU_VISIBLE_DEVICES`
可以通过 docker run 添加环境变量 -e DCU_VISIBLE_DEVICES 给容器添加HCU设备。
```sh
docker run -it -e DCU_VISIBLE_DEVICES=all ubuntu:18.04 # 添加所有HCU设备
docker run -it -e DCU_VISIBLE_DEVICES=0 ubuntu:18.04 # 添加HCU设备0
docker run -it -e DCU_VISIBLE_DEVICES=0,1 ubuntu:18.04 # 添加HCU设备0、1
```
#### 通过CDI方式
- 首先，生成CDI spec文件
```sh
$ dcu-ctk cdi generate --output=/etc/cdi/dtk.json
```
- 配置docker开启CDI
```sh
$ dcu-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled
```
- 使用所有的DCU
```sh
$ docker run --rm --device c-3000.com/hcu=all -it ubuntu:18.04
```
- 使用第0块和第1块DCU
```sh
$ docker run --rm --device c-3000.com/hcu=0 --device c-3000.com/hcu=1 -it ubuntu:18.04
```
### 在podman中使用DCU
podman 需要version 2.0+
- 首先需要修改podman的runtime
```sh
$ dcu-ctk runtime configure --runtime=podman --set-as-default
```
- 通过环境变量方式使用
```sh
$ podman run -it -e DCU_VISIBLE_DEVICES=all ubuntu:18.04
```
### 列出可使用的DCU
```sh
$ dcu-ctk cdi list
INFO[0000] Found 3 CDI devices
c-3000.com/hcu=0
c-3000.com/hcu=1
c-3000.com/hcu=2
c-3000.com/hcu=all
c-3000.com/hcu=hcu-73873c7a6eb008a1
c-3000.com/hcu=hcu-73873c7a6eb02041
c-3000.com/hcu=hcu-73873c7a6eb040a1
```
### docker rootless下对文件读写权限的限制
非root用户在使用-v挂载目录时，保留原有权限，无法对ro目录添加w权限，执行该命令需要root权限
```sh
$ dcu-ctk rootless --runtime=docker
```
### DCU Tracker
DCU Tracker用来监控使用--gpus和-e DCU_VISIBLE_DEVICES启动容器的DCU使用情况，也可以设置DCU被独享或共享。默认情况下DCU Tracker是disable状态，用户可以使用dcu-ctk的命令行来enable.

DCU Tracker 提供命令行来控制容器对DCU的访问，可以被设置为shared或exclusive.
- shared 表示DCU可以同时被多个容器一起使用，这是默认状态
- exclusive 表示DCU同时只能被一个容器使用。

```sh
$ dcu-ctk dcu-tracker -h
NAME:
   C-3000 DCU Container Toolkit CLI dcu-tracker - DCU Tracker  related commands

USAGE:
   dcu-ctk dcu-tracker [dcu-ids] [accessibility]
       Arguments:
           dcu-ids          Comma-separated list of DCU IDs (comma separated list, range operator, all)
           accessibility    Must be either 'exclusive' or 'shared'

     Examples:
           dcu-ctk dcu-tracker 0,1,2 exclusive
           dcu-ctk dcu-tracker 0,1-2 shared
           dcu-ctk dcu-tracker all shared

   OR

   dcu-ctk dcu-tracker [command] [options]

COMMANDS:
   disable  Disable the DCU Tracker
   enable   Enable the DCU Tracker
   reset    Reset the DCU Tracker
   status   Show Status of DCUs
   help, h  Shows a list of commands or help for one command

OPTIONS:
   --help, -h  show help
```

###使用DCU Tracker

通过rocm-smi来查看节点上的DCUs

```sh
$ rocm-smi


========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
0    52.0c           56.0W   600Mhz  1000Mhz  0%   auto  400.0W    0%   0%
1    57.0c           48.0W   600Mhz  1000Mhz  0%   auto  400.0W    0%   0%
2    52.0c           66.0W   600Mhz  1000Mhz  0%   auto  400.0W    0%   0%
====================================================================================
=============================== End of ROCm SMI Log ================================

```
- 查看DCU Tracker Status
   如果DCU Tracker enabled，DCU默认被赋予 shared 权限

   ```sh
   $ dcu-ctk dcu-tracker status
   ------------------------------------------------------------------------------------------------------------------------
   GPU Id    UUID                     Accessibility       Container Ids
   ------------------------------------------------------------------------------------------------------------------------
   0         0x73873C7A6EB02041       Shared              None
   1         0x73873C7A6EB008A1       Shared              None
   2         0x73873C7A6EB040A1       Shared              None
   ```

   如果DCU Tracker没有开启，则会有相应提示
   ```sh
   $ dcu-ctk dcu-tracker status
   DCU Tracker is disabled
   ```

- 开启 DCU Tracker

  ```sh

  $ dcu-ctk dcu-tracker status
  DCU Tracker is disabled

  $ dcu-ctk dcu-tracker enable
  DCU Tracker has been enabled

  $ dcu-ctk dcu-tracker enable
  DCU Tracker is already enabled
  ```

- 关闭 DCU Tracker

  ```sh
  $ dcu-ctk dcu-tracker disable
  DCU Tracker has been disabled
  ```

- 设置DCU的访问权限
   当DCU Tracker开启时，启动容器时会自动记录容器使用DCU的情况

   ```sh
   $ docker run --name slf_dmps -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23 

   $ docker run --name slf_dmp -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23

   $ dcu-ctk dcu-tracker status
   ------------------------------------------------------------------------------------------------------------------------
   GPU Id    UUID                     Accessibility       Container Ids
   ------------------------------------------------------------------------------------------------------------------------
   0         0x73873C7A6EB02041       Shared              3d07700f961485c678999ea1a0fecaaf0b54f3be51f4a1e9a2f1ae61032b276d
                                                          dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
   1         0x73873C7A6EB008A1       Shared              3d07700f961485c678999ea1a0fecaaf0b54f3be51f4a1e9a2f1ae61032b276d
                                                          dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
   2         0x73873C7A6EB040A1       Shared              None

   $ docker rm slf_dmp 

   $ dcu-ctk dcu-tracker status
   ------------------------------------------------------------------------------------------------------------------------
   GPU Id    UUID                     Accessibility       Container Ids
   ------------------------------------------------------------------------------------------------------------------------
   0         0x73873C7A6EB02041       Shared              dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
   1         0x73873C7A6EB008A1       Shared              dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
   2         0x73873C7A6EB040A1       Shared              None

   ```
   
- 设置DCU 为exclusive属性
 
  ```sh
  $ dcu-ctk dcu-tracker 1 exclusive
  DCUs [1] have been made exclusive

  $ dcu-ctk dcu-tracker status
  ------------------------------------------------------------------------------------------------------------------------
  GPU Id    UUID                     Accessibility       Container Ids
  ------------------------------------------------------------------------------------------------------------------------
  0         0x73873C7A6EB02041       Shared              dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
  1         0x73873C7A6EB008A1       Exclusive           dc3c3153ab2e1cde5013a5e5d116cf467894949d1ef4b29ba8caa23a40f66d8d
  2         0x73873C7A6EB040A1       Shared              None

  $ docker run --name slf_dmp  -e DCU_VISIBLE_DEVICES=0,1 -it a4dd5be0ca23

  docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: failed to 
  create  DTK Container Runtime: failed to construct OCI spec modifier: failed to reserve DCUs: DCUs [1] are exclusive and already in use: unknown.
  ```

### Docker Swarm 

DCU UUID 适配Docker Swarm的部署能力，使其能够在集群节点之间进行精确的GPU资源分配

#### Docker Daemon 配置 Swarm

```json
{
    "default-runtime": "dtk",
    "node-generic-resources": [
        "HY_DCU=0x73873c7a6eb02041",
        "HY_DCU=0x73873c7a6eb008a1",
        "HY_DCU=0x73873c7a6eb040a1"
    ],
    "runtimes": {
        "dtk": {
            "args": [],
            "path": "dcu-container-runtime"
        }
    }
}
```
配置完之后，需要重启Docker daemon
```sh
sudo systemctl restart docker
```

#### Service Definition
使用 docker-compose 部署具有特定 DCU 需求的服务:
```yaml
# docker-compose.yml for Swarm deployment
version: '3.8'
services:
  rocm-service:
    image: image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
    tty: true
    stdin_open: true
    command: /bin/bash
    deploy:
      replicas: 1
      resources:
        reservations:
          generic_resources:
            - discrete_resource_spec:
                kind: 'HY_DCU'  # Matches daemon.json key
                value: 2
```

部署服务
```sh
docker stack deploy -c docker-compose.yml rocm-stack
```
