Commit dbdd109b authored by unknown's avatar unknown
Browse files

提交ViT代码

parents
Pipeline #161 canceled with stages
# Created by .ignore support plugin (hsz.mobi)
### Python template
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
!/checkpoint/
!/data/
!/logs/
!/output/
.idea
MIT License
Copyright (c) 2020 jeonsworld
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# Vision Transformer(ViT)
## 模型介绍
Vision Transformer(ViT)是一种新的基于Transformer的神经网络模型,它是由Google Brain团队提出的。ViT的主要目标是将Transformer模型应用于计算机视觉领域,以替代传统的卷积神经网络(CNN)模型。ViT模型的核心思想是将图像转换为一组序列,并将它们输入到Transformer模型中进行处理。这个序列可以通过将图像划分为不重叠的小块(称为“补丁”)来生成,然后将每个补丁的像素值摊平为一个向量。这些向量被串联成一个长序列,然后输入到Transformer模型中。与CNN模型不同,ViT模型不需要池化或卷积层来处理输入图像。相反,它使用多头自注意力层(multi-head self-attention layer)来对输入序列进行建模。这些自注意力层允许模型在序列中的任何两个位置之间进行交互和信息传递,从而更好地捕捉图像中的全局关系。最后, ViT模型可以通过一个全连接层或者平均池化层输出最终的分类结果。ViT模型相对于CNN模型具有几个优点。首先,ViT模型可以处理任意大小的图像,而CNN模型通常需要在输入图像上进行裁剪或缩放。其次,ViT模型可以更好地处理全局关系,因为每个像素都可以与其他像素进行交互。最后,ViT模型可以很容易地应用于其他任务,例如对象检测或分割,只需在输出层进行一些修改即可。总体来说,ViT模型是一种新的、有前途的计算机视觉模型,它提供了一种基于Transformer的新思路,可以在不同的视觉任务上取得很好的性能。
## 模型结构
vision transformer模型结构如下图所示主要包括三部分,patch embeding 部分、transformer encoder部分、MLP head部分。ViT将输入图片分为多个patch,再将每个patch投影为固定长度的向量送入Transformer,后续encoder的操作和原始Transformer中完全相同。但是因为对图片分类,因此在输入序列中加入一个特殊的token,该token对应的输出即为最后的类别预测。
![img](https://pic4.zhimg.com/80/v2-5afd38bd10b279f3a572b13cda399233_1440w.webp)
## 数据集
在本测试中可以使用cifar10数据集。
数据集处理方法请参考cifar10官方介绍自行处理,也可通过下面链接下载,将数据放在data目录下。
链接:链接:https://pan.baidu.com/s/1ZFMQVBGQZI6UWZKJcTYPAQ?pwd=fq3l 提取码:fq3l
## ViT训练
### 环境配置
提供[光源](https://www.sourcefind.cn/#/service-details)拉取的训练以及推理的docker镜像:
* 训练镜像:docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.10.0-centos7.6-dtk-22.10.1-py37-latest
* pip install -r requirements.txt
### 训练
下载预训练模型放在checkpoint目录下:
```
wget https://storage.googleapis.com/vit_models/imagenet21k/ViT-B_16.npz
```
训练命令:
export HIP_VISIBLE_DEVICES=3
python3 -m torch.distributed.launch --nproc_per_node=1 train.py --name cifar10-100_500 --dataset cifar10 --model_type ViT-B_16 --pretrained_dir checkpoint/ViT-B_16.npz --train_batch_size 64 --num_steps 500
## 性能和准确率数据
测试数据使用的是cifar10,使用的加速卡是DCU Z100L。
根据模型情况填写表格:
| 卡数 | 性能 | 精度 |
| :------: | :------: | :------: |
| 1 | 67.84 samples/s | Best Accuracy=0.3051 |
### 参考
https://github.com/jeonsworld/ViT-pytorch
\ No newline at end of file
# 运行
```
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 python -m torch.distributed.launch --nproc_per_node=7 train.py --name cifar10-100_500 --dataset cifar10 --model_type ViT-B_16 --pretrained_dir checkpoint/ViT-B_16.npz --fp16 --fp16_opt_level O2 --train_batch_size 64 --num_steps 500
```
> 不使用pretrain model的时候,注释掉train.py的65行。
# Vision Transformer
Pytorch reimplementation of [Google's repository for the ViT model](https://github.com/google-research/vision_transformer) that was released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
This paper show that Transformers applied directly to image patches and pre-trained on large datasets work really well on image recognition task.
![fig1](./img/figure1.png)
Vision Transformer achieve State-of-the-Art in image recognition task with standard Transformer encoder and fixed-size patches. In order to perform classification, author use the standard approach of adding an extra learnable "classification token" to the sequence.
![fig2](./img/figure2.png)
## Usage
### 1. Download Pre-trained model (Google's Official Checkpoint)
* [Available models](https://console.cloud.google.com/storage/vit_models/): ViT-B_16(**85.8M**), R50+ViT-B_16(**97.96M**), ViT-B_32(**87.5M**), ViT-L_16(**303.4M**), ViT-L_32(**305.5M**), ViT-H_14(**630.8M**)
* imagenet21k pre-train models
* ViT-B_16, ViT-B_32, ViT-L_16, ViT-L_32, ViT-H_14
* imagenet21k pre-train + imagenet2012 fine-tuned models
* ViT-B_16-224, ViT-B_16, ViT-B_32, ViT-L_16-224, ViT-L_16, ViT-L_32
* Hybrid Model([Resnet50](https://github.com/google-research/big_transfer) + Transformer)
* R50-ViT-B_16
```
# imagenet21k pre-train
wget https://storage.googleapis.com/vit_models/imagenet21k/{MODEL_NAME}.npz
# imagenet21k pre-train + imagenet2012 fine-tuning
wget https://storage.googleapis.com/vit_models/imagenet21k+imagenet2012/{MODEL_NAME}.npz
```
### 2. Train Model
```
python3 train.py --name cifar10-100_500 --dataset cifar10 --model_type ViT-B_16 --pretrained_dir checkpoint/ViT-B_16.npz
```
CIFAR-10 and CIFAR-100 are automatically download and train. In order to use a different dataset you need to customize [data_utils.py](./utils/data_utils.py).
The default batch size is 512. When GPU memory is insufficient, you can proceed with training by adjusting the value of `--gradient_accumulation_steps`.
Also can use [Automatic Mixed Precision(Amp)](https://nvidia.github.io/apex/amp.html) to reduce memory usage and train faster
```
python3 train.py --name cifar10-100_500 --dataset cifar10 --model_type ViT-B_16 --pretrained_dir checkpoint/ViT-B_16.npz --fp16 --fp16_opt_level O2
```
## Results
To verify that the converted model weight is correct, we simply compare it with the author's experimental results. We trained using mixed precision, and `--fp16_opt_level` was set to O2.
### imagenet-21k
* [**tensorboard**](https://tensorboard.dev/experiment/Oz9GmmQIQCOEr4xbdr8O3Q)
| model | dataset | resolution | acc(official) | acc(this repo) | time |
|:------------:|:---------:|:----------:|:-------------:|:--------------:|:-------:|
| ViT-B_16 | CIFAR-10 | 224x224 | - | 0.9908 | 3h 13m |
| ViT-B_16 | CIFAR-10 | 384x384 | 0.9903 | 0.9906 | 12h 25m |
| ViT_B_16 | CIFAR-100 | 224x224 | - | 0.923 | 3h 9m |
| ViT_B_16 | CIFAR-100 | 384x384 | 0.9264 | 0.9228 | 12h 31m |
| R50-ViT-B_16 | CIFAR-10 | 224x224 | - | 0.9892 | 4h 23m |
| R50-ViT-B_16 | CIFAR-10 | 384x384 | 0.99 | 0.9904 | 15h 40m |
| R50-ViT-B_16 | CIFAR-100 | 224x224 | - | 0.9231 | 4h 18m |
| R50-ViT-B_16 | CIFAR-100 | 384x384 | 0.9231 | 0.9197 | 15h 53m |
| ViT_L_32 | CIFAR-10 | 224x224 | - | 0.9903 | 2h 11m |
| ViT_L_32 | CIFAR-100 | 224x224 | - | 0.9276 | 2h 9m |
| ViT_H_14 | CIFAR-100 | 224x224 | - | WIP | |
### imagenet-21k + imagenet2012
* [**tensorboard**](https://tensorboard.dev/experiment/CXOzjFRqTM6aLCk0jNXgAw/#scalars)
| model | dataset | resolution | acc |
|:------------:|:---------:|:----------:|:------:|
| ViT-B_16-224 | CIFAR-10 | 224x224 | 0.99 |
| ViT_B_16-224 | CIFAR-100 | 224x224 | 0.9245 |
| ViT-L_32 | CIFAR-10 | 224x224 | 0.9903 |
| ViT-L_32 | CIFAR-100 | 224x224 | 0.9285 |
### shorter train
* In the experiment below, we used a resolution size (224x224).
* [**tensorboard**](https://tensorboard.dev/experiment/lpknnMpHRT2qpVrSZi10Ag/#scalars)
| upstream | model | dataset | total_steps /warmup_steps | acc(official) | acc(this repo) |
|:-----------:|:--------:|:---------:|:-------------------------:|:-------------:|:--------------:|
| imagenet21k | ViT-B_16 | CIFAR-10 | 500/100 | 0.9859 | 0.9859 |
| imagenet21k | ViT-B_16 | CIFAR-10 | 1000/100 | 0.9886 | 0.9878 |
| imagenet21k | ViT-B_16 | CIFAR-100 | 500/100 | 0.8917 | 0.9072 |
| imagenet21k | ViT-B_16 | CIFAR-100 | 1000/100 | 0.9115 | 0.9216 |
## Visualization
The ViT consists of a Standard Transformer Encoder, and the encoder consists of Self-Attention and MLP module.
The attention map for the input image can be visualized through the attention score of self-attention.
Visualization code can be found at [visualize_attention_map](./visualize_attention_map.ipynb).
![fig3](./img/figure3.png)
## Reference
* [Google ViT](https://github.com/google-research/vision_transformer)
* [Pytorch Image Models(timm)](https://github.com/rwightman/pytorch-image-models)
## Citations
```bibtex
@article{dosovitskiy2020,
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
journal={arXiv preprint arXiv:2010.11929},
year={2020}
}
```
/usr/bin/python: No module named torch.distributed
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment