README_pretrain.md 7.29 KB
Newer Older
chenzk's avatar
v1.0.3  
chenzk committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
## 数据集

- 建立data文件夹放置数据: 
  - `cd MobileVLM && mkdir -p data/pretrain_data data/finetune_data data/benchmark_data` # work_dir为MobileVLM
- 预训练数据
  - `cd ${work_dir}/data/pretrain_data`
  - download the ShareGPT4V-PT from [here](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/share-captioner_coco_lcs_sam_1246k_1107.json), which is provided by ShareGPT4V team.
- 多任务训练数据
  - `cd ${work_dir}/data/finetune_data`
  - download the annotation of our MobileVLM_V2_FT_Mix2M data from huggingface [here](https://huggingface.co/datasets/mtgv/MobileVLM_V2_FT_Mix2M), and download the images from constituting datasets: 
  [Text-VQA](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip), 
  [IConQA](https://drive.google.com/file/d/1Xqdt1zMcMZU5N_u1SAIjk-UAclriynGx/edit), [SQA](https://drive.google.com/drive/folders/1w8imCXWYn2LxajmGeGH_g5DaL2rabHev), [SBU](https://huggingface.co/datasets/sbu_captions), follow [ShareGPT4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md) to download images from:
  [LAION-CC-SBU-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/blob/main/images.zip), [COCO](http://images.cocodataset.org/zips/train2017.zip), [WebData](https://drive.google.com/drive/folders/1tCUQ-sq6vdshZVkF0ZeF3K4eztkXJgax?usp=sharing), [SAM](https://drive.google.com/file/d/1dKumdOKSXtV7lIXdrG7jsIK_z2vZv2gs/view?usp=drive_link), [GQA](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip), [OCR-VQA](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing), [TextVQA](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip), [VisualGnome](https://cs.stanford.edu/people/rak248/VG_100K_2) ([Part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [Part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip))

-  benchmark测试数据
  - We evaluate models on a diverse set of 6 benchmarks, *i.e.* GQA, MMBench, MME, POPE, SQA, TextVQA. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs. You should follow these instructions to manage the datasets.
  - <details>
    <summary> Data Download Instructions </summary>

    - download some useful [data/scripts](https://github.com/Meituan-AutoML/MobileVLM/releases/download/v0.1/benchmark_data.zip) pre-collected by us.
      - `unzip benchmark_data.zip && cd benchmark_data`
      - `bmk_dir=${work_dir}/data/benchmark_data`
    - gqa
      - download its image data following the official instructions [here](https://cs.stanford.edu/people/dorarad/gqa/download.html)
      - `cd ${bmk_dir}/gqa && ln -s /path/to/gqa/images images`
    - mme
      - download the data following the official instructions [here](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation).
      - `cd ${bmk_dir}/mme && ln -s /path/to/MME/MME_Benchmark_release_version images`
    - pope
      - download coco from POPE following the official instructions [here](https://github.com/AoiDragon/POPE/tree/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco).
      - `cd ${bmk_dir}/pope && ln -s /path/to/pope/coco coco && ln -s /path/to/coco/val2014 val2014`
    - sqa
      - download images from the `data/scienceqa` folder of the ScienceQA [repo](https://github.com/lupantech/ScienceQA).
      - `cd ${bmk_dir}/sqa && ln -s /path/to/sqa/images images`
    - textvqa
      - download images following the instructions [here](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip).
      - `cd ${bmk_dir}/textvqa && ln -s /path/to/textvqa/train_images train_images`
    - mmbench
      - no action is needed.

    </details>

更多资料可参考源项目的[`README_origin`](./README_origin.md),由于本项目使用数据集过多,此处不方便提供迷你数据集,读者请根据是否需要自主微调决定是否进行全量下载。


完整数据目录结构如下:
```
data
├── benchmark_data
│   ├── gqa
│   │   ├── convert_gqa_for_eval.py
│   │   ├── eval.py
│   │   ├── images -> /path/to/your/gqa/images
│   │   ├── llava_gqa_testdev_balanced.jsonl
│   │   └── testdev_balanced_questions.json
│   ├── mmbench
│   │   ├── convert_mmbench_for_submission.py
│   │   ├── eval.py
│   │   └── mmbench_dev_en_20231003.tsv
│   ├── mme
│   │   ├── calculation.py
│   │   ├── convert_answer_to_mme.py
│   │   ├── images -> /path/to/your/MME/MME_Benchmark_release_version
│   │   └── llava_mme.jsonl
│   ├── pope
│   │   ├── coco -> /path/to/your/pope/coco
│   │   ├── eval.py
│   │   ├── llava_pope_test.jsonl
│   │   └── val2014 -> /path/to/your/coco/val2014
│   ├── sqa
│   │   ├── eval.py
│   │   ├── images -> /path/to/your/scienceqa/images
│   │   ├── llava_test_CQM-A.json
│   │   ├── pid_splits.json
│   │   └── problems.json
│   └── textvqa
│       ├── eval.py
│       ├── llava_textvqa_val_v051_ocr.jsonl
│       ├── TextVQA_0.5.1_val.json
│       └── train_images -> /path/to/your/textvqa/train_images
├── finetune_data
│   ├── llava_v1_5_mix665k.json
│   ├── MobileVLM_V2_FT_Mix2M.json
│   ├── coco
│   │   ├── train2017
│   │   └── val2017
│   ├── gqa
│   │   └── images
│   ├── iconqa_data
│   │   └── iconqa
│   │       └── train
│   │           ├── choose_img
│   │           ├── choose_txt
│   │           └── fill_in_blank
│   ├── ocr_vqa
│   │   └── images
│   ├── sam
│   │   └── images
│   ├── SBU
│   │   └── images
│   ├── ScienceQA
│   │   └── train
│   ├── share_textvqa
│   │   └── images
│   ├── textvqa
│   │   └── train_images
│   ├── vg
│   │   ├── VG_100K
│   │   └── VG_100K_2
│   ├── web-celebrity
│   │   └── images
│   ├── web-landmark
│   │   └── images
│   └── wikiart
│       └── images
└── pretrain_data
    ├── share-captioner_coco_lcs_sam_1246k_1107.json
    ├── blip_laion_cc_sbu_558k.json
    ├── images
    ├── coco
    │   └── train2017
    ├── llava
    │   └── llava_pretrain
    └── sam
        └── images
```
更多资料可参考源项目的[`README_origin`](./README_origin.md)

## 训练
finetune需要下载预训练权重`mtgv/MobileVLM_V2-1.7B`:https://huggingface.co/mtgv/MobileVLM_V2-1.7B

同时还需要下载图像-文本clip权重`openai/clip-vit-large-patch14-336`:https://huggingface.co/openai/clip-vit-large-patch14-336

### 单机多卡
```
bash run.sh mobilevlm_v2_1.7b pretrain mtgv/MobileVLM_V2-1.7B openai/clip-vit-large-patch14-336 # 或sh pretrain.sh
# 当前bnb库仅支持fp16微调,后期逐渐开放其它微调精度。
# 微调所需深度学习库参见前文环境配置,读者自行下载完整数据集后方可使用。
```