"...data/git@developer.sourcefind.cn:OpenDAS/megatron-lm.git" did not exist on "dff98d475f3efaf81a080fa43be391b55a7f6243"
Commit b81b2f59 authored by wanglch's avatar wanglch
Browse files

Initial commit

parent f7c86e68
<p align="center"> # TextMonkey
<img src="https://v1.ax1x.com/2024/04/13/7ySieU.png" width="500" style="margin-bottom: 0.2;"/>
<p> TextMonkey是这是一种专为以文本为中心的任务而定制的大型多模态模型 (LMM),包括文档问答 (DocVQA) 和场景文本分析。
<h3 align="center"> <a href="https://arxiv.org/abs/2311.06607">Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models</a></h3> ## 论文
<h2></h2>
- [TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document](https://arxiv.org/abs/2403.04473)
<h5 align="center"> Please give us a star ⭐ for the latest update. </h5>
## 模型结构
<h5 align="center">
首先使用滑动窗口模块将输入图像划分为不重叠的 patch,每个 patch 的大小为 448x448 像素。这些 patch 进一步细分为 14x14 像素的更小的patch ,每个 patch 都被视为一个 token。利用预训练的 CLIP 模型,然后分别在每个窗口 patch 上处理这些 token。为了建立各个窗口 patch 之间的连接,在 Transformer 块之间以一定间隔集成移位窗口注意力(Shifted Window Attention)。为了生成分层表示,输入图像的大小被调整为 448x448,并输入CLIP提取全局特征。这个全局特征以及来自子图像的特征,然后由共享图像重采样器处理以与语言域对齐。然后,通过压缩标记的长度,使用 Token Resampler 进一步最小化语言空间中的冗余。
[![arXiv](https://img.shields.io/badge/Arxiv-2311.06607-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2311.06607) <div align="center">
[![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://github.com/Yuliang-Liu/Monkey/blob/main/LICENSE) <img src="./assets/model_structrue.png"/>
[![GitHub issues](https://img.shields.io/github/issues/Yuliang-Liu/Monkey?color=critical&label=Issues)](https://github.com/Yuliang-Liu/Monkey/issues?q=is%3Aopen+is%3Aissue) </div>
[![GitHub closed issues](https://img.shields.io/github/issues-closed/Yuliang-Liu/Monkey?color=success&label=Issues)](https://github.com/Yuliang-Liu/Monkey/issues?q=is%3Aissue+is%3Aclosed) <br>
</h5> ## 算法原理
为了进行统一的文档结构学习,该工作基于开源数据集构建了一个全面的结构化解析数据集DocStruct4M。对于文档图片或者网页截图,主要采用空格和换行表示文字布局;对于表格,其改进的Markdown语法既能表示跨行跨列,又相比html缩减了大量标签;对于图表,同样采用markdown来表示其数学特征,并且限定数值的有效位以保证其在图片中视觉可见;对于自然图,采用描述加上ocr文本的形式。
<details open><summary>💡 Monkey series projects:✨. </summary><p>
<!-- may --> <div align=center>
<img src="./assets/model_theory.png"/>
>[CVPR'24] [**Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models**](https://arxiv.org/abs/2311.06607)<br> </div>
> Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, Xiang Bai <br>
[![Paper](https://img.shields.io/badge/Paper-CVPR'24_Highlight-red)](README.md)
[![Source_code](https://img.shields.io/badge/Code-Available-white)](README.md) ## 环境配置
[![Demo](https://img.shields.io/badge/Demo-blue)](http://vlrlab-monkey.xyz:7681/) ### Docker(方法一)
[![Detailed Caption](https://img.shields.io/badge/Detailed_Caption-yellow)](http://huggingface.co/datasets/echo840/Detailed_Caption) [光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤
[![Model Weight](https://img.shields.io/badge/Model_Weight-gray)](http://huggingface.co/echo840/Monkey)
[![Model Weight in Wisemodel](https://img.shields.io/badge/Model_Weight_in_Wisemodel-gray)](https://www.wisemodel.cn/models/HUST-VLRLab/Monkey/) ```
[![Demo in Wisemodel](https://img.shields.io/badge/Demo_in_Wisemodel-blue)](https://wisemodel.cn/space/gradio/huakeMonkey) docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=64G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name textmonkey <your imageID> bash
> [**TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document**](https://arxiv.org/abs/2403.04473)<br> cd /path/your_code_data/
> Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, Xiang Bai <br>
[![arXiv](https://img.shields.io/badge/Arxiv-2403.04473-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2403.04473) pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
[![Source_code](https://img.shields.io/badge/Code-Available-white)](monkey_model/text_monkey/README.md) ```
[![Data](https://img.shields.io/badge/Data-yellow)](https://huggingface.co/datasets/MelosY/TextMonkey_Data/tree/main)
[![Model Weight](https://img.shields.io/badge/Model_Weight-gray)](https://www.modelscope.cn/models/lvskiller/TextMonkey) ### Dockerfile(方法二)
```
cd /path/your_code_data/docker
## News
* ```2024.4.13 ``` 🚀 Sourced code for [TextMonkey](monkey_model/text_monkey/README.md) is released. docker build --no-cache -t textmonkey:latest .
* ```2024.4.5 ``` 🚀 Monkey is nominated as CVPR 2024 Highlight paper.
* ```2024.3.8 ``` 🚀 We release the paper [TextMonkey](https://arxiv.org/abs/2403.04473). docker run --shm-size=64G --name mplug-doclocal -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v /path/your_code_data/:/path/your_code_data/ -it textmonkey bash
* ```2024.2.27 ``` 🚀 Monkey is accepted by CVPR 2024. ```
* ```2024.1.3 ``` 🚀 Release the basic data generation pipeline. [Data Generation](./data_generation) ### Anaconda(方法三)
* ```2023.12.16``` 🚀 Monkey can be trained using 8 NVIDIA 3090 GPUs. See subsection [train](#Train) for details.
* ```2023.11.06``` 🚀 We release the paper [Monkey](https://arxiv.org/abs/2311.06607). 关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
```
## 🐳 Model Zoo DTK驱动:dtk24.04
python:python3.10
Monkey-Chat torch:2.1
| Model|Language Model|Transformers(HF) |MMBench-Test|CCBench|MME|SeedBench_IMG|MathVista-MiniTest|HallusionBench-Avg|AI2D Test|OCRBench| torchvision: 0.16.0
|---------------|---------|-----------------------------------------|---|---|---|---|---|---|---|---| deepspped: 0.12.3
|Monkey-Chat|Qwev-7B|[🤗echo840/Monkey-Chat](https://huggingface.co/echo840/Monkey-Chat)|72.4|48|1887.4|68.9|34.8|39.3|68.5|534| ```
`Tips:以上dtk驱动、python、paddle等DCU相关工具版本需要严格一一对应`
## Environment 关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
```
```python conda create -n textmonkey python=3.10
conda create -n monkey python=3.9
conda activate monkey conda activate textmonkey
git clone https://github.com/Yuliang-Liu/Monkey.git
cd ./Monkey cd /path/your_code_data/
pip install -r requirements.txt
``` pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple
You can download the corresponding version of flash_attention from https://github.com/Dao-AILab/flash-attention/releases/ and use the following code to install: ```
```python
pip install flash_attn-2.3.5+cu117torch2.0cxx11abiFALSE-cp39-cp39-linux_x86_64.whl --no-build-isolation ## 数据集
```
迷你数据集 [mm_tutorial](./assets/mm_tutorial)
## Train 完整数据集[MelosY/TextMonkey_Data](https://huggingface.co/datasets/MelosY/TextMonkey_Data)
We also offer Monkey's model definition and training code, which you can explore above. You can execute the training code through executing `finetune_ds_debug.sh` for Monkey and `finetune_textmonkey.sh` for TextMonkey. 预训练需要准备你的训练数据,需要将所有样本放到一个列表中并存入json文件中。每个样本对应一个字典,包含以下信息,示例如下所示:用于正常训练的完整数据集请按此目录结构进行制备:
The json file used for Monkey training can be downloaded at [Link](https://drive.google.com/file/d/18z_uQTe8Jq61V5rgHtxOt85uKBodbvw1/view?usp=sharing). ```
[
{
## Inference "id": "identity_0",
Run the inference code for Monkey and Monkey-Chat: "conversations": [
``` {
python ./inference.py --model_path MODEL_PATH --image_path IMAGE_PATH --question "YOUR_QUESTION" "from": "user",
``` "value": "你好"
},
{
## Demo "from": "assistant",
"value": "我是TextMonkey,一个支持视觉输入的大模型。"
Demo is fast and easy to use. Simply uploading an image from your desktop or phone, or capture one directly. }
[Demo_chat](http://vlrlab-monkey.xyz:7681) is also launched as an upgraded version of the original demo to deliver an enhanced interactive experience. ]
},
We also provide the source code and the model weight for the original demo, allowing you to customize certain parameters for a more unique experience. The specific operations are as follows: {
1. Make sure you have configured the [environment](#environment). "id": "identity_1",
2. You can choose to use the demo offline or online: "conversations": [
- **Offline:** {
- Download the [Model Weight](http://huggingface.co/echo840/Monkey). "from": "user",
- Modify `DEFAULT_CKPT_PATH="pathto/Monkey"` in the `demo.py` file to your model weight path. "value": "Picture 1: <img>/home/wanglch/projects/TextMonkey/Monkey/assets/mm_tutorial/Chongqing.jpeg</img>\nPicture 2: <img>/home/wanglch/projects/TextMonkey/Monkey/assets/mm_tutorial/Beijing.jpeg</img>\n图中都是哪"
- Run the demo using the following command: },
``` {
python demo.py "from": "assistant",
``` "value": "第一张图片是重庆的城市天际线,第二张图片是北京的天际线。"
- **Online:** }
- Run the demo and download model weights online with the following command: ]
``` }
python demo.py -c echo840/Monkey ]
``` ```
For TextMonkey you can download the model weight from [Model Weight](https://www.modelscope.cn/models/lvskiller/TextMonkey) and run the demo code: ## 训练
``` python
python demo_textmonkey.py -c model_path 根据实际情况在脚本中修相关路径
```
--deepspeed
Before 14/11/2023, we have observed that for some random pictures Monkey can achieve more accurate results than GPT4V. --model_name_or_path
<br> --data_path
<p align="center"> --image_folder
<img src="https://v1.ax1x.com/2024/04/13/7yS2yq.jpg" width="666"/> --output_dir
<p>
<br> ### 单机多卡
Before 31/1/2024, Monkey-chat achieved the fifth rank in the Multimodal Model category on [OpenCompass](https://opencompass.org.cn/home). 训练需要8卡 A800 80G
<br> ```
<p align="center"> sh finetune_textmonkey_dcu.sh
<img src="https://v1.ax1x.com/2024/04/13/7yShXL.jpg" width="666"/> ```
<p>
<br> ## 推理
### 单机单卡
## Dataset
You can download the training and testing data used by monkey from [Monkey_Data](https://huggingface.co/datasets/echo840/Monkey_Data). ### 网页问答
The json file used for Monkey training can be downloaded at [Link](https://drive.google.com/file/d/18z_uQTe8Jq61V5rgHtxOt85uKBodbvw1/view?usp=sharing). 修改模型路径为本地模型路径
The data from our multi-level description generation method is now open-sourced and available for download at [Link](https://huggingface.co/datasets/echo840/Detailed_Caption). We already upload the images used in multi-level description. Examples: ```
sh textmonkey_inference_web.sh
<br> ```
<p align="center">
<img src="https://v1.ax1x.com/2024/04/13/7yS6Ss.jpg" width="666"/> ### 指令问答
<p>
<br> ```
python demo_textmonkey.py
You can download train images of Monkey from [Train](https://pan.baidu.com/s/1svSjXTxWpI-3boALgSeLlw). Extraction code: 4hdh ```
You can download test images and jsonls of Monkey from [Test](https://pan.baidu.com/s/1ABrQKeE9QBeKvtGzXfM8Eg). Extraction code: 5h71 ## result
The images are from CC3M, COCO Caption, TextCaps, VQAV2, OKVQA, GQA, ScienceQA, VizWiz, TextVQA, OCRVQA, ESTVQA, STVQA, AI2D and DUE_Benchmark. When using the data, it is necessary to comply with the protocols of the original dataset. ### 网页问答
## Evaluate <div align=center>
<img src="./assets/result1.png"/>
We offer evaluation code for 14 Visual Question Answering (VQA) datasets in the `evaluate_vqa.py` file, facilitating a quick verification of results. The specific operations are as follows: </div>
1. Make sure you have configured the [environment](#environment). ### 指令问答
2. Modify `sys.path.append("pathto/Monkey")` to the project path.
3. Prepare the datasets required for evaluation. <div align=center>
4. Run the evaluation code. <img src="./assets/result2.png"/>
</div>
Take ESTVQA as an example:
- Prepare data according to the following directory structure: ### 精度
``` 迷你数据集 [mm_tutorial](./assets/mm_tutorial) ,使用的加速卡:K100/A800。
├── data
| ├── estvqa | device | train_loss |
| ├── test_image | :------: | :------: |
| ├── {image_path0} | K100 | |
| ├── {image_path1} | A800 | |
| ·
| ·
| ├── estvqa.jsonl ## 应用场景
```
- Example of the format of each line of the annotated `.jsonl` file: ### 算法类别
``` `OCR,对话问答`
{"image": "data/estvqa/test_image/011364.jpg", "question": "What is this store?", "answer": "pizzeria", "question_id": 0}
``` ### 热点应用行业
- Modify the dictionary `ds_collections`: `金融,教育,政府,交通`
```
ds_collections = { ## 预训练权重
'estvqa_test': { - [lvskiller/TextMonkey](https://www.modelscope.cn/models/lvskiller/TextMonkey)
'test': 'data/estvqa/estvqa.jsonl',
'metric': 'anls', ## 源码仓库及问题反馈
'max_new_tokens': 100, - https://developer.hpccube.com/codes/modelzoo/textmonkey_pytorch.git
},
... ## 参考资料
} - [TextMonkey github](https://github.com/Yuliang-Liu/Monkey/blob/main/monkey_model/text_monkey/README.md)
```
- Run the following command:
```
bash eval/eval.sh 'EVAL_PTH' 'SAVE_NAME'
```
## Citing Monkey
If you wish to refer to the baseline results published here, please use the following BibTeX entries:
```BibTeX
@inproceedings{li2023monkey,
title={Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models},
author={Li, Zhang and Yang, Biao and Liu, Qiang and Ma, Zhiyin and Zhang, Shuo and Yang, Jingxu and Sun, Yabo and Liu, Yuliang and Bai, Xiang},
booktitle={proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
year={2024}
}
@article{liu2024textmonkey,
title={TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document},
author={Liu, Yuliang and Yang, Biao and Liu, Qiang and Li, Zhang and Ma, Zhiyin and Zhang, Shuo and Bai, Xiang},
journal={arXiv preprint arXiv:2403.04473},
year={2024}
}
```
## Acknowledgement
[Qwen-VL](https://github.com/QwenLM/Qwen-VL.git), [LLAMA](https://github.com/meta-llama/llama), [LLaVA](https://github.com/haotian-liu/LLaVA), [OpenCompass](https://github.com/open-compass/opencompass), [InternLM](https://github.com/InternLM/InternLM).
## Copyright
We welcome suggestions to help us improve the Monkey. For any query, please contact Dr. Yuliang Liu: ylliu@hust.edu.cn. If you find something interesting, please also feel free to share with us through email or open an issue. Thanks!
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment