Commit 0d096a6e authored by songlinfeng's avatar songlinfeng 💬
Browse files

Update README.md

parent 6b929d34
...@@ -231,3 +231,56 @@ GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% ...@@ -231,3 +231,56 @@ GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: failed to docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: failed to
create DTK Container Runtime: failed to construct OCI spec modifier: failed to reserve DCUs: DCUs [1] are exclusive and already in use: unknown. create DTK Container Runtime: failed to construct OCI spec modifier: failed to reserve DCUs: DCUs [1] are exclusive and already in use: unknown.
``` ```
### Docker Swarm
DCU UUID 适配Docker Swarm的部署能力,使其能够在集群节点之间进行精确的GPU资源分配
#### Docker Daemon 配置 Swarm
```json
{
"default-runtime": "dtk",
"node-generic-resources": [
"HY_DCU=0x73873c7a6eb02041",
"HY_DCU=0x73873c7a6eb008a1",
"HY_DCU=0x73873c7a6eb040a1"
],
"runtimes": {
"dtk": {
"args": [],
"path": "dcu-container-runtime"
}
}
}
```
配置完之后,需要重启Docker daemon
```sh
sudo systemctl restart docker
```
#### Service Definition
使用 docker-compose 部署具有特定 DCU 需求的服务:
```yaml
# docker-compose.yml for Swarm deployment
version: '3.8'
services:
rocm-service:
image: image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
tty: true
stdin_open: true
command: /bin/bash
deploy:
replicas: 1
resources:
reservations:
generic_resources:
- discrete_resource_spec:
kind: 'HY_DCU' # Matches daemon.json key
value: 2
```
部署服务
```sh
docker stack deploy -c docker-compose.yml rocm-stack
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment