README.md 6.05 KB
Newer Older
chenpangpang's avatar
chenpangpang committed
1
# gpu-base-image-build
chenpangpang's avatar
chenpangpang committed
2
3
4
5
6
7
## 流程
包括准备工作、镜像构建、镜像验证、打包镜像四个模块
## 准备工作
1. 准备一台裸机器,安装[nvidia-docker2](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)、git
2. 下载镜像验证中需要的代码和模型(或从陈宜航处拷贝),放在项目根目录下
   1. 下载代码:`git clone http://developer.hpccube.com/codes/chenpangpang/gpu-base-image-test.git`
chenpangpang's avatar
chenpangpang committed
8
   2. 下载模型(pytorch): `cd gpu-base-image-test/pytorch && python hf_down.py`
chenpangpang's avatar
chenpangpang committed
9
3. 确认要构建的镜像
chenpangpang's avatar
chenpangpang committed
10
   - 镜像制作进度:https://bvjoh3z2qoz.feishu.cn/base/BKl6birVbarmzJsnznkcEDFTnV9?table=tbl3bCdS7qfjPn6j&view=vewww0URg8
chenpangpang's avatar
chenpangpang committed
11
## 镜像构建
chenpangpang's avatar
chenpangpang committed
12
13
- 基于[pytorch官方镜像](https://hub.docker.com/r/pytorch/pytorch)构建镜像
  ```bash
chenpangpang's avatar
chenpangpang committed
14
  cd build_space && \
chenpangpang's avatar
chenpangpang committed
15
  ./build_ubuntu.sh jupyterlab \
chenpangpang's avatar
chenpangpang committed
16
17
                    juypterlab-pytorch:2.3.1-py3.10-cuda12.1-ubuntu22.04-devel \
                    pytorch/pytorch:2.3.1-cuda12.1-cudnn8-devel
chenpangpang's avatar
chenpangpang committed
18
  ```
chenpangpang's avatar
chenpangpang committed
19
20
  - 参数1: ide,不需要改动
  - 参数2: 输出镜像名
chenpangpang's avatar
chenpangpang committed
21
22
  - 参数3: 基础镜像
- 基于[nvidia官方镜像](https://hub.docker.com/r/nvidia/cuda)构建镜像
chenpangpang's avatar
chenpangpang committed
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
  - pytorch
      ```bash
    cd build_space && \
    ./build_ubuntu.sh jupyterlab \
                      juypterlab-pytorch:2.3.1-py3.8-cuda12.1-ubuntu22.04-devel \
                      nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 \
                      TORCH_VERSION="2.3.1" \
                      TORCHVISION_VERSION="0.18.1" \
                      TORCHAUDIO_VERSION="2.3.1" \
                      CONDA_URL="https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-py38_22.11.1-1-Linux-x86_64.sh"
    ```
    - 参数1: ide,不需要改动
    - 参数2: 输出镜像名
    - 参数3: 基础镜像
    - TORCH_VERSION:torch版本
    - TORCHVISION_VERSION:torchvision版本
    - TORCHAUDIO_VERSION:torchaudio版本
    - CONDA_URL:安装conda的url
  - tensorflow
      ```bash
    cd build_space && \
    ./build_ubuntu.sh jupyterlab \
                      jupyterlab-tensorflow:2.17.0-py3.11-cuda12.3-ubuntu22.04-devel \
                      nvidia/cuda:12.3.2-cudnn9-devel-ubuntu22.04 \
                      TENSORFLOW_VERSION="2.17.0" \
                      CONDA_URL="https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-py311_24.7.1-0-Linux-x86_64.sh"
    ```
    - 参数1: ide,不需要改动
    - 参数2: 输出镜像名
    - 参数3: 基础镜像
    - TENSORFLOW_VERSION:tensorflow版本
    - CONDA_URL:安装conda的url

chenpangpang's avatar
chenpangpang committed
56
57

### 相关链接
chenpangpang's avatar
chenpangpang committed
58
59
- pytorch镜像(**选择devel镜像**):https://hub.docker.com/r/pytorch/pytorch/tags
- nvidia镜像(**选择devel镜像**):https://hub.docker.com/r/nvidia/cuda/tags
chenpangpang's avatar
chenpangpang committed
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
- torch、torchvision、torchaudio、cuda版本对应:https://pytorch.org/get-started/previous-versions/
- conda安装:https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/

## 镜像验证
1. 版本验证:运行:`sh script/1_base_test.sh $IMAGE_NAME`,输出:
  ```
  
==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Ubuntu 22.04.3 LTS \n \l

python version:  3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]
torch version:  2.3.1
torch cuda available:  True
torch cuda version:  12.1
torch cudnn version:  8902
torchvision version:  0.18.1
torchaudio version:  2.3.1
  ```
chenpangpang's avatar
chenpangpang committed
91
确认`输出的版本信息`和`镜像名称`是否匹配,确认`torch cuda`是否可用。<br>
chenpangpang's avatar
chenpangpang committed
92
93
94
95
96
97
98
2. 文本生成验证:运行:`sh script/2_text_generate_test.sh $IMAGE_NAME`,输出:
  ```
  Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  Hello, I'm a language model, to be honest." (Hooker)
  
  "Let's start an internal test now, and then
  ```
chenpangpang's avatar
chenpangpang committed
99
确认`输出信息`是否符合预期。<br>
chenpangpang's avatar
chenpangpang committed
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
3. 图像生成验证:运行`sh script/3_image_generate_test.sh $IMAGE_NAME`,输出:
```

==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]/opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
Loading pipeline components...: 100%|██████████| 7/7 [00:02<00:00,  3.14it/s]
100%|██████████| 50/50 [00:03<00:00, 13.03it/s]
The output image has been saved as output.png
```
![output.png](assert/output.png)<br>
确认`输出图片`是否符合预期。
## 打包镜像
chenpangpang's avatar
chenpangpang committed
128
运行`sh script/save.sh $IMAGE_NAME`保存镜像
chenpangpang's avatar
chenpangpang committed
129
130
131
132
133
134
135

## TODO:
- 从docker hub获取镜像tag
- 根据计划表和镜像tag表,获取base镜像名称、输出镜像名称
- 飞书机器人-自动统计镜像制作情况
- 飞书机器人-待推送镜像提醒
- CI/CD流程