README.md

# gpu-base-image-build
## 流程
包括准备工作、镜像构建、镜像验证、打包镜像四个模块
## 准备工作
1. 准备一台裸机器，安装[nvidia-docker2](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)、git
2. 下载镜像验证中需要的代码和模型（或从陈宜航处拷贝），放在项目根目录下
   1. 下载代码：`git clone http://developer.hpccube.com/codes/chenpangpang/gpu-base-image-test.git`
   2. 下载模型(pytorch): `cd gpu-base-image-test/pytorch && python hf_down.py`
3. 确认要构建的镜像
   - 镜像制作进度：https://bvjoh3z2qoz.feishu.cn/base/BKl6birVbarmzJsnznkcEDFTnV9?table=tbl3bCdS7qfjPn6j&view=vewww0URg8
## 镜像构建
- 基于[pytorch官方镜像](https://hub.docker.com/r/pytorch/pytorch)构建镜像
  ```bash
  cd build_space && \
  ./build_ubuntu.sh jupyterlab \
                    juypterlab-pytorch:2.3.1-py3.10-cuda12.1-ubuntu22.04-devel \
                    pytorch/pytorch:2.3.1-cuda12.1-cudnn8-devel
  ```
  - 参数1: ide，不需要改动
  - 参数2: 输出镜像名
  - 参数3: 基础镜像
- 基于[nvidia官方镜像](https://hub.docker.com/r/nvidia/cuda)构建镜像
  - pytorch
      ```bash
    cd build_space && \
    ./build_ubuntu.sh jupyterlab \
                      juypterlab-pytorch:2.3.1-py3.8-cuda12.1-ubuntu22.04-devel \
                      nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 \
                      TORCH_VERSION="2.3.1" \
                      TORCHVISION_VERSION="0.18.1" \
                      TORCHAUDIO_VERSION="2.3.1" \
                      CONDA_URL="https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-py38_22.11.1-1-Linux-x86_64.sh"
    ```
    - 参数1: ide，不需要改动
    - 参数2: 输出镜像名
    - 参数3: 基础镜像
    - TORCH_VERSION：torch版本
    - TORCHVISION_VERSION：torchvision版本
    - TORCHAUDIO_VERSION：torchaudio版本
    - CONDA_URL：安装conda的url
  - tensorflow
      ```bash
    cd build_space && \
    ./build_ubuntu.sh jupyterlab \
                      jupyterlab-tensorflow:2.17.0-py3.11-cuda12.3-ubuntu22.04-devel \
                      nvidia/cuda:12.3.2-cudnn9-devel-ubuntu22.04 \
                      TENSORFLOW_VERSION="2.17.0" \
                      CONDA_URL="https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-py311_24.7.1-0-Linux-x86_64.sh"
    ```
    - 参数1: ide，不需要改动
    - 参数2: 输出镜像名
    - 参数3: 基础镜像
    - TENSORFLOW_VERSION：tensorflow版本
    - CONDA_URL：安装conda的url


### 相关链接
- pytorch镜像(**选择devel镜像**)：https://hub.docker.com/r/pytorch/pytorch/tags
- nvidia镜像(**选择devel镜像**)：https://hub.docker.com/r/nvidia/cuda/tags
- torch、torchvision、torchaudio、cuda版本对应：https://pytorch.org/get-started/previous-versions/
- conda安装：https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/

## 镜像验证
1. 版本验证：运行：`sh script/1_base_test.sh $IMAGE_NAME`，输出：
  ```
  
==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Ubuntu 22.04.3 LTS \n \l

python version:  3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]
torch version:  2.3.1
torch cuda available:  True
torch cuda version:  12.1
torch cudnn version:  8902
torchvision version:  0.18.1
torchaudio version:  2.3.1
  ```
确认`输出的版本信息`和`镜像名称`是否匹配，确认`torch cuda`是否可用。<br>
2. 文本生成验证：运行：`sh script/2_text_generate_test.sh $IMAGE_NAME`，输出：
  ```
  Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  Hello, I'm a language model, to be honest." (Hooker)
  
  "Let's start an internal test now, and then
  ```
确认`输出信息`是否符合预期。<br>
3. 图像生成验证：运行`sh script/3_image_generate_test.sh $IMAGE_NAME`，输出：
```

==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]/opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
Loading pipeline components...: 100%|██████████| 7/7 [00:02<00:00,  3.14it/s]
100%|██████████| 50/50 [00:03<00:00, 13.03it/s]
The output image has been saved as output.png
```
![output.png](assert/output.png)<br>
确认`输出图片`是否符合预期。
## 打包镜像
运行`sh script/save.sh $IMAGE_NAME`保存镜像

## TODO：
- 从docker hub获取镜像tag
- 根据计划表和镜像tag表，获取base镜像名称、输出镜像名称
- 飞书机器人-自动统计镜像制作情况
- 飞书机器人-待推送镜像提醒
- CI/CD流程