## 数据集 - 建立data文件夹放置数据: - `cd MobileVLM && mkdir -p data/pretrain_data data/finetune_data data/benchmark_data` # work_dir为MobileVLM - 预训练数据 - `cd ${work_dir}/data/pretrain_data` - download the ShareGPT4V-PT from [here](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/share-captioner_coco_lcs_sam_1246k_1107.json), which is provided by ShareGPT4V team. - 多任务训练数据 - `cd ${work_dir}/data/finetune_data` - download the annotation of our MobileVLM_V2_FT_Mix2M data from huggingface [here](https://huggingface.co/datasets/mtgv/MobileVLM_V2_FT_Mix2M), and download the images from constituting datasets: [Text-VQA](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip), [IConQA](https://drive.google.com/file/d/1Xqdt1zMcMZU5N_u1SAIjk-UAclriynGx/edit), [SQA](https://drive.google.com/drive/folders/1w8imCXWYn2LxajmGeGH_g5DaL2rabHev), [SBU](https://huggingface.co/datasets/sbu_captions), follow [ShareGPT4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md) to download images from: [LAION-CC-SBU-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/blob/main/images.zip), [COCO](http://images.cocodataset.org/zips/train2017.zip), [WebData](https://drive.google.com/drive/folders/1tCUQ-sq6vdshZVkF0ZeF3K4eztkXJgax?usp=sharing), [SAM](https://drive.google.com/file/d/1dKumdOKSXtV7lIXdrG7jsIK_z2vZv2gs/view?usp=drive_link), [GQA](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip), [OCR-VQA](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing), [TextVQA](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip), [VisualGnome](https://cs.stanford.edu/people/rak248/VG_100K_2) ([Part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [Part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)) - benchmark测试数据 - We evaluate models on a diverse set of 6 benchmarks, *i.e.* GQA, MMBench, MME, POPE, SQA, TextVQA. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs. You should follow these instructions to manage the datasets. -
Data Download Instructions - download some useful [data/scripts](https://github.com/Meituan-AutoML/MobileVLM/releases/download/v0.1/benchmark_data.zip) pre-collected by us. - `unzip benchmark_data.zip && cd benchmark_data` - `bmk_dir=${work_dir}/data/benchmark_data` - gqa - download its image data following the official instructions [here](https://cs.stanford.edu/people/dorarad/gqa/download.html) - `cd ${bmk_dir}/gqa && ln -s /path/to/gqa/images images` - mme - download the data following the official instructions [here](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation). - `cd ${bmk_dir}/mme && ln -s /path/to/MME/MME_Benchmark_release_version images` - pope - download coco from POPE following the official instructions [here](https://github.com/AoiDragon/POPE/tree/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco). - `cd ${bmk_dir}/pope && ln -s /path/to/pope/coco coco && ln -s /path/to/coco/val2014 val2014` - sqa - download images from the `data/scienceqa` folder of the ScienceQA [repo](https://github.com/lupantech/ScienceQA). - `cd ${bmk_dir}/sqa && ln -s /path/to/sqa/images images` - textvqa - download images following the instructions [here](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip). - `cd ${bmk_dir}/textvqa && ln -s /path/to/textvqa/train_images train_images` - mmbench - no action is needed.
更多资料可参考源项目的[`README_origin`](./README_origin.md),由于本项目使用数据集过多,此处不方便提供迷你数据集,读者请根据是否需要自主微调决定是否进行全量下载。 完整数据目录结构如下: ``` data ├── benchmark_data │   ├── gqa │   │   ├── convert_gqa_for_eval.py │   │   ├── eval.py │   │   ├── images -> /path/to/your/gqa/images │   │   ├── llava_gqa_testdev_balanced.jsonl │   │   └── testdev_balanced_questions.json │   ├── mmbench │   │   ├── convert_mmbench_for_submission.py │   │   ├── eval.py │   │   └── mmbench_dev_en_20231003.tsv │   ├── mme │   │   ├── calculation.py │   │   ├── convert_answer_to_mme.py │   │   ├── images -> /path/to/your/MME/MME_Benchmark_release_version │   │   └── llava_mme.jsonl │   ├── pope │   │   ├── coco -> /path/to/your/pope/coco │   │   ├── eval.py │   │   ├── llava_pope_test.jsonl │   │   └── val2014 -> /path/to/your/coco/val2014 │   ├── sqa │   │   ├── eval.py │   │   ├── images -> /path/to/your/scienceqa/images │   │   ├── llava_test_CQM-A.json │   │   ├── pid_splits.json │   │   └── problems.json │   └── textvqa │   ├── eval.py │   ├── llava_textvqa_val_v051_ocr.jsonl │   ├── TextVQA_0.5.1_val.json │   └── train_images -> /path/to/your/textvqa/train_images ├── finetune_data │ ├── llava_v1_5_mix665k.json │ ├── MobileVLM_V2_FT_Mix2M.json │ ├── coco │ │ ├── train2017 │ │ └── val2017 │ ├── gqa │ │ └── images │ ├── iconqa_data │ │ └── iconqa │ │    └── train │ │       ├── choose_img │ │       ├── choose_txt │ │       └── fill_in_blank │ ├── ocr_vqa │ │ └── images │ ├── sam │ │ └── images │ ├── SBU │ │ └── images │ ├── ScienceQA │ │ └── train │ ├── share_textvqa │ │ └── images │ ├── textvqa │ │ └── train_images │ ├── vg │ │ ├── VG_100K │ │ └── VG_100K_2 │ ├── web-celebrity │ │ └── images │ ├── web-landmark │ │ └── images │ └── wikiart │ └── images └── pretrain_data ├── share-captioner_coco_lcs_sam_1246k_1107.json ├── blip_laion_cc_sbu_558k.json ├── images ├── coco │   └── train2017 ├── llava │   └── llava_pretrain └── sam    └── images ``` 更多资料可参考源项目的[`README_origin`](./README_origin.md)。 ## 训练 finetune需要下载预训练权重`mtgv/MobileVLM_V2-1.7B`:https://huggingface.co/mtgv/MobileVLM_V2-1.7B 同时还需要下载图像-文本clip权重`openai/clip-vit-large-patch14-336`:https://huggingface.co/openai/clip-vit-large-patch14-336 ### 单机多卡 ``` bash run.sh mobilevlm_v2_1.7b pretrain mtgv/MobileVLM_V2-1.7B openai/clip-vit-large-patch14-336 # 或sh pretrain.sh # 当前bnb库仅支持fp16微调,后期逐渐开放其它微调精度。 # 微调所需深度学习库参见前文环境配置,读者自行下载完整数据集后方可使用。 ```