README.md 5.37 KB
Newer Older
chenpangpang's avatar
chenpangpang committed
1
# gpu-base-image-build
chenpangpang's avatar
chenpangpang committed
2
3
4
5
6
7
8
9
## 流程
包括准备工作、镜像构建、镜像验证、打包镜像四个模块
## 准备工作
1. 准备一台裸机器,安装[nvidia-docker2](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)、git
2. 下载镜像验证中需要的代码和模型(或从陈宜航处拷贝),放在项目根目录下
   1. 下载代码:`git clone http://developer.hpccube.com/codes/chenpangpang/gpu-base-image-test.git`
   2. 下载模型: `cd gpu-base-image-test && python hf_down.py`
3. 确认要构建的镜像
chenpangpang's avatar
chenpangpang committed
10
   - 镜像制作进度:https://bvjoh3z2qoz.feishu.cn/base/BKl6birVbarmzJsnznkcEDFTnV9?table=tbl3bCdS7qfjPn6j&view=vewww0URg8
chenpangpang's avatar
chenpangpang committed
11
## 镜像构建
chenpangpang's avatar
chenpangpang committed
12
13
- 基于[pytorch官方镜像](https://hub.docker.com/r/pytorch/pytorch)构建镜像
  ```bash
chenpangpang's avatar
chenpangpang committed
14
  cd build_space && \
chenpangpang's avatar
chenpangpang committed
15
  ./build_ubuntu.sh jupyterlab \
chenpangpang's avatar
chenpangpang committed
16
17
                    juypterlab-pytorch:2.3.1-py3.10-cuda12.1-ubuntu22.04-devel \
                    pytorch/pytorch:2.3.1-cuda12.1-cudnn8-devel
chenpangpang's avatar
chenpangpang committed
18
19
20
21
22
23
  ```
  - 参数1: framework,不需要改动
  - 参数2: 输出镜像名称
  - 参数3: 基础镜像
- 基于[nvidia官方镜像](https://hub.docker.com/r/nvidia/cuda)构建镜像
    ```bash
chenpangpang's avatar
chenpangpang committed
24
  cd build_space && \
chenpangpang's avatar
chenpangpang committed
25
  ./build_ubuntu.sh jupyterlab \
chenpangpang's avatar
chenpangpang committed
26
27
                    juypterlab-pytorch:2.3.1-py3.8-cuda12.1-ubuntu22.04-devel \
                    nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 \
chenpangpang's avatar
chenpangpang committed
28
29
30
31
32
33
34
35
                    TORCH_VERSION="2.3.1" \
                    TORCHVISION_VERSION="0.18.1" \
                    TORCHAUDIO_VERSION="2.3.1" \
                    CONDA_URL="https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-py38_22.11.1-1-Linux-x86_64.sh"
  ```
  - 参数1: framework,不需要改动
  - 参数2: 输出镜像名称
  - 参数3: 基础镜像
chenpangpang's avatar
chenpangpang committed
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
  - TORCH_VERSION:torch版本
  - TORCHVISION_VERSION:torchvision版本
  - TORCHAUDIO_VERSION:torchaudio版本
  - CONDA_URL:安装conda的url

### 相关链接
- pytorch镜像:https://hub.docker.com/r/pytorch/pytorch/tags
- nvidia镜像:https://hub.docker.com/r/nvidia/cuda/tags
- torch、torchvision、torchaudio、cuda版本对应:https://pytorch.org/get-started/previous-versions/
- conda安装:https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/

## 镜像验证
1. 版本验证:运行:`sh script/1_base_test.sh $IMAGE_NAME`,输出:
  ```
  
==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Ubuntu 22.04.3 LTS \n \l

python version:  3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]
torch version:  2.3.1
torch cuda available:  True
torch cuda version:  12.1
torch cudnn version:  8902
torchvision version:  0.18.1
torchaudio version:  2.3.1
  ```
chenpangpang's avatar
chenpangpang committed
75
确认`输出的版本信息``镜像名称`是否匹配,确认`torch cuda`是否可用。<br>
chenpangpang's avatar
chenpangpang committed
76
77
78
79
80
81
82
2. 文本生成验证:运行:`sh script/2_text_generate_test.sh $IMAGE_NAME`,输出:
  ```
  Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  Hello, I'm a language model, to be honest." (Hooker)
  
  "Let's start an internal test now, and then
  ```
chenpangpang's avatar
chenpangpang committed
83
确认`输出信息`是否符合预期。<br>
chenpangpang's avatar
chenpangpang committed
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
3. 图像生成验证:运行`sh script/3_image_generate_test.sh $IMAGE_NAME`,输出:
```

==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]/opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
Loading pipeline components...: 100%|██████████| 7/7 [00:02<00:00,  3.14it/s]
100%|██████████| 50/50 [00:03<00:00, 13.03it/s]
The output image has been saved as output.png
```
![output.png](assert/output.png)<br>
确认`输出图片`是否符合预期。
## 打包镜像
运行`docker save -o $FILENAME $IMAGE_NAME`打包镜像。

## TODO:
- 从docker hub获取镜像tag
- 根据计划表和镜像tag表,获取base镜像名称、输出镜像名称
- 飞书机器人-自动统计镜像制作情况
- 飞书机器人-待推送镜像提醒
- CI/CD流程