Commit b81b2f59 authored by wanglch's avatar wanglch
Browse files

Initial commit

parent f7c86e68
<p align="center">
<img src="https://v1.ax1x.com/2024/04/13/7ySieU.png" width="500" style="margin-bottom: 0.2;"/>
<p>
# TextMonkey
<h3 align="center"> <a href="https://arxiv.org/abs/2311.06607">Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models</a></h3>
<h2></h2>
TextMonkey是这是一种专为以文本为中心的任务而定制的大型多模态模型 (LMM),包括文档问答 (DocVQA) 和场景文本分析。
<h5 align="center"> Please give us a star ⭐ for the latest update. </h5>
## 论文
<h5 align="center">
- [TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document](https://arxiv.org/abs/2403.04473)
## 模型结构
[![arXiv](https://img.shields.io/badge/Arxiv-2311.06607-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2311.06607)
[![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://github.com/Yuliang-Liu/Monkey/blob/main/LICENSE)
[![GitHub issues](https://img.shields.io/github/issues/Yuliang-Liu/Monkey?color=critical&label=Issues)](https://github.com/Yuliang-Liu/Monkey/issues?q=is%3Aopen+is%3Aissue)
[![GitHub closed issues](https://img.shields.io/github/issues-closed/Yuliang-Liu/Monkey?color=success&label=Issues)](https://github.com/Yuliang-Liu/Monkey/issues?q=is%3Aissue+is%3Aclosed) <br>
</h5>
首先使用滑动窗口模块将输入图像划分为不重叠的 patch,每个 patch 的大小为 448x448 像素。这些 patch 进一步细分为 14x14 像素的更小的patch ,每个 patch 都被视为一个 token。利用预训练的 CLIP 模型,然后分别在每个窗口 patch 上处理这些 token。为了建立各个窗口 patch 之间的连接,在 Transformer 块之间以一定间隔集成移位窗口注意力(Shifted Window Attention)。为了生成分层表示,输入图像的大小被调整为 448x448,并输入CLIP提取全局特征。这个全局特征以及来自子图像的特征,然后由共享图像重采样器处理以与语言域对齐。然后,通过压缩标记的长度,使用 Token Resampler 进一步最小化语言空间中的冗余。
<div align="center">
<img src="./assets/model_structrue.png"/>
</div>
<details open><summary>💡 Monkey series projects:✨. </summary><p>
<!-- may -->
## 算法原理
>[CVPR'24] [**Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models**](https://arxiv.org/abs/2311.06607)<br>
> Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, Xiang Bai <br>
[![Paper](https://img.shields.io/badge/Paper-CVPR'24_Highlight-red)](README.md)
[![Source_code](https://img.shields.io/badge/Code-Available-white)](README.md)
[![Demo](https://img.shields.io/badge/Demo-blue)](http://vlrlab-monkey.xyz:7681/)
[![Detailed Caption](https://img.shields.io/badge/Detailed_Caption-yellow)](http://huggingface.co/datasets/echo840/Detailed_Caption)
[![Model Weight](https://img.shields.io/badge/Model_Weight-gray)](http://huggingface.co/echo840/Monkey)
[![Model Weight in Wisemodel](https://img.shields.io/badge/Model_Weight_in_Wisemodel-gray)](https://www.wisemodel.cn/models/HUST-VLRLab/Monkey/)
[![Demo in Wisemodel](https://img.shields.io/badge/Demo_in_Wisemodel-blue)](https://wisemodel.cn/space/gradio/huakeMonkey)
为了进行统一的文档结构学习,该工作基于开源数据集构建了一个全面的结构化解析数据集DocStruct4M。对于文档图片或者网页截图,主要采用空格和换行表示文字布局;对于表格,其改进的Markdown语法既能表示跨行跨列,又相比html缩减了大量标签;对于图表,同样采用markdown来表示其数学特征,并且限定数值的有效位以保证其在图片中视觉可见;对于自然图,采用描述加上ocr文本的形式。
<div align=center>
<img src="./assets/model_theory.png"/>
</div>
> [**TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document**](https://arxiv.org/abs/2403.04473)<br>
> Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, Xiang Bai <br>
[![arXiv](https://img.shields.io/badge/Arxiv-2403.04473-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2403.04473)
[![Source_code](https://img.shields.io/badge/Code-Available-white)](monkey_model/text_monkey/README.md)
[![Data](https://img.shields.io/badge/Data-yellow)](https://huggingface.co/datasets/MelosY/TextMonkey_Data/tree/main)
[![Model Weight](https://img.shields.io/badge/Model_Weight-gray)](https://www.modelscope.cn/models/lvskiller/TextMonkey)
## 环境配置
### Docker(方法一)
[光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=64G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name textmonkey <your imageID> bash
## News
* ```2024.4.13 ``` 🚀 Sourced code for [TextMonkey](monkey_model/text_monkey/README.md) is released.
* ```2024.4.5 ``` 🚀 Monkey is nominated as CVPR 2024 Highlight paper.
* ```2024.3.8 ``` 🚀 We release the paper [TextMonkey](https://arxiv.org/abs/2403.04473).
* ```2024.2.27 ``` 🚀 Monkey is accepted by CVPR 2024.
* ```2024.1.3 ``` 🚀 Release the basic data generation pipeline. [Data Generation](./data_generation)
* ```2023.12.16``` 🚀 Monkey can be trained using 8 NVIDIA 3090 GPUs. See subsection [train](#Train) for details.
* ```2023.11.06``` 🚀 We release the paper [Monkey](https://arxiv.org/abs/2311.06607).
cd /path/your_code_data/
## 🐳 Model Zoo
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
```
Monkey-Chat
| Model|Language Model|Transformers(HF) |MMBench-Test|CCBench|MME|SeedBench_IMG|MathVista-MiniTest|HallusionBench-Avg|AI2D Test|OCRBench|
|---------------|---------|-----------------------------------------|---|---|---|---|---|---|---|---|
|Monkey-Chat|Qwev-7B|[🤗echo840/Monkey-Chat](https://huggingface.co/echo840/Monkey-Chat)|72.4|48|1887.4|68.9|34.8|39.3|68.5|534|
### Dockerfile(方法二)
```
cd /path/your_code_data/docker
docker build --no-cache -t textmonkey:latest .
## Environment
docker run --shm-size=64G --name mplug-doclocal -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v /path/your_code_data/:/path/your_code_data/ -it textmonkey bash
```
### Anaconda(方法三)
```python
conda create -n monkey python=3.9
conda activate monkey
git clone https://github.com/Yuliang-Liu/Monkey.git
cd ./Monkey
pip install -r requirements.txt
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
```
DTK驱动:dtk24.04
python:python3.10
torch:2.1
torchvision: 0.16.0
deepspped: 0.12.3
```
You can download the corresponding version of flash_attention from https://github.com/Dao-AILab/flash-attention/releases/ and use the following code to install:
```python
pip install flash_attn-2.3.5+cu117torch2.0cxx11abiFALSE-cp39-cp39-linux_x86_64.whl --no-build-isolation
`Tips:以上dtk驱动、python、paddle等DCU相关工具版本需要严格一一对应`
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
```
conda create -n textmonkey python=3.10
conda activate textmonkey
## Train
cd /path/your_code_data/
We also offer Monkey's model definition and training code, which you can explore above. You can execute the training code through executing `finetune_ds_debug.sh` for Monkey and `finetune_textmonkey.sh` for TextMonkey.
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple
```
The json file used for Monkey training can be downloaded at [Link](https://drive.google.com/file/d/18z_uQTe8Jq61V5rgHtxOt85uKBodbvw1/view?usp=sharing).
## 数据集
迷你数据集 [mm_tutorial](./assets/mm_tutorial)
完整数据集[MelosY/TextMonkey_Data](https://huggingface.co/datasets/MelosY/TextMonkey_Data)
预训练需要准备你的训练数据,需要将所有样本放到一个列表中并存入json文件中。每个样本对应一个字典,包含以下信息,示例如下所示:用于正常训练的完整数据集请按此目录结构进行制备:
## Inference
Run the inference code for Monkey and Monkey-Chat:
```
python ./inference.py --model_path MODEL_PATH --image_path IMAGE_PATH --question "YOUR_QUESTION"
[
{
"id": "identity_0",
"conversations": [
{
"from": "user",
"value": "你好"
},
{
"from": "assistant",
"value": "我是TextMonkey,一个支持视觉输入的大模型。"
}
]
},
{
"id": "identity_1",
"conversations": [
{
"from": "user",
"value": "Picture 1: <img>/home/wanglch/projects/TextMonkey/Monkey/assets/mm_tutorial/Chongqing.jpeg</img>\nPicture 2: <img>/home/wanglch/projects/TextMonkey/Monkey/assets/mm_tutorial/Beijing.jpeg</img>\n图中都是哪"
},
{
"from": "assistant",
"value": "第一张图片是重庆的城市天际线,第二张图片是北京的天际线。"
}
]
}
]
```
## 训练
## Demo
根据实际情况在脚本中修相关路径
Demo is fast and easy to use. Simply uploading an image from your desktop or phone, or capture one directly.
[Demo_chat](http://vlrlab-monkey.xyz:7681) is also launched as an upgraded version of the original demo to deliver an enhanced interactive experience.
--deepspeed
--model_name_or_path
--data_path
--image_folder
--output_dir
We also provide the source code and the model weight for the original demo, allowing you to customize certain parameters for a more unique experience. The specific operations are as follows:
1. Make sure you have configured the [environment](#environment).
2. You can choose to use the demo offline or online:
- **Offline:**
- Download the [Model Weight](http://huggingface.co/echo840/Monkey).
- Modify `DEFAULT_CKPT_PATH="pathto/Monkey"` in the `demo.py` file to your model weight path.
- Run the demo using the following command:
```
python demo.py
```
- **Online:**
- Run the demo and download model weights online with the following command:
```
python demo.py -c echo840/Monkey
```
### 单机多卡
For TextMonkey you can download the model weight from [Model Weight](https://www.modelscope.cn/models/lvskiller/TextMonkey) and run the demo code:
``` python
python demo_textmonkey.py -c model_path
训练需要8卡 A800 80G
```
sh finetune_textmonkey_dcu.sh
```
Before 14/11/2023, we have observed that for some random pictures Monkey can achieve more accurate results than GPT4V.
<br>
<p align="center">
<img src="https://v1.ax1x.com/2024/04/13/7yS2yq.jpg" width="666"/>
<p>
<br>
## 推理
Before 31/1/2024, Monkey-chat achieved the fifth rank in the Multimodal Model category on [OpenCompass](https://opencompass.org.cn/home).
<br>
<p align="center">
<img src="https://v1.ax1x.com/2024/04/13/7yShXL.jpg" width="666"/>
<p>
<br>
### 单机单卡
### 网页问答
## Dataset
You can download the training and testing data used by monkey from [Monkey_Data](https://huggingface.co/datasets/echo840/Monkey_Data).
修改模型路径为本地模型路径
The json file used for Monkey training can be downloaded at [Link](https://drive.google.com/file/d/18z_uQTe8Jq61V5rgHtxOt85uKBodbvw1/view?usp=sharing).
```
sh textmonkey_inference_web.sh
```
The data from our multi-level description generation method is now open-sourced and available for download at [Link](https://huggingface.co/datasets/echo840/Detailed_Caption). We already upload the images used in multi-level description. Examples:
### 指令问答
<br>
<p align="center">
<img src="https://v1.ax1x.com/2024/04/13/7yS6Ss.jpg" width="666"/>
<p>
<br>
```
python demo_textmonkey.py
```
You can download train images of Monkey from [Train](https://pan.baidu.com/s/1svSjXTxWpI-3boALgSeLlw). Extraction code: 4hdh
## result
You can download test images and jsonls of Monkey from [Test](https://pan.baidu.com/s/1ABrQKeE9QBeKvtGzXfM8Eg). Extraction code: 5h71
### 网页问答
The images are from CC3M, COCO Caption, TextCaps, VQAV2, OKVQA, GQA, ScienceQA, VizWiz, TextVQA, OCRVQA, ESTVQA, STVQA, AI2D and DUE_Benchmark. When using the data, it is necessary to comply with the protocols of the original dataset.
<div align=center>
<img src="./assets/result1.png"/>
</div>
## Evaluate
### 指令问答
We offer evaluation code for 14 Visual Question Answering (VQA) datasets in the `evaluate_vqa.py` file, facilitating a quick verification of results. The specific operations are as follows:
<div align=center>
<img src="./assets/result2.png"/>
</div>
1. Make sure you have configured the [environment](#environment).
2. Modify `sys.path.append("pathto/Monkey")` to the project path.
3. Prepare the datasets required for evaluation.
4. Run the evaluation code.
### 精度
迷你数据集 [mm_tutorial](./assets/mm_tutorial) ,使用的加速卡:K100/A800。
Take ESTVQA as an example:
- Prepare data according to the following directory structure:
```
├── data
| ├── estvqa
| ├── test_image
| ├── {image_path0}
| ├── {image_path1}
| ·
| ·
| ├── estvqa.jsonl
```
- Example of the format of each line of the annotated `.jsonl` file:
```
{"image": "data/estvqa/test_image/011364.jpg", "question": "What is this store?", "answer": "pizzeria", "question_id": 0}
```
- Modify the dictionary `ds_collections`:
```
ds_collections = {
'estvqa_test': {
'test': 'data/estvqa/estvqa.jsonl',
'metric': 'anls',
'max_new_tokens': 100,
},
...
}
```
- Run the following command:
```
bash eval/eval.sh 'EVAL_PTH' 'SAVE_NAME'
```
| device | train_loss |
| :------: | :------: |
| K100 | |
| A800 | |
## Citing Monkey
If you wish to refer to the baseline results published here, please use the following BibTeX entries:
## 应用场景
```BibTeX
@inproceedings{li2023monkey,
title={Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models},
author={Li, Zhang and Yang, Biao and Liu, Qiang and Ma, Zhiyin and Zhang, Shuo and Yang, Jingxu and Sun, Yabo and Liu, Yuliang and Bai, Xiang},
booktitle={proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
year={2024}
}
@article{liu2024textmonkey,
title={TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document},
author={Liu, Yuliang and Yang, Biao and Liu, Qiang and Li, Zhang and Ma, Zhiyin and Zhang, Shuo and Bai, Xiang},
journal={arXiv preprint arXiv:2403.04473},
year={2024}
}
```
### 算法类别
`OCR,对话问答`
### 热点应用行业
`金融,教育,政府,交通`
## Acknowledgement
## 预训练权重
- [lvskiller/TextMonkey](https://www.modelscope.cn/models/lvskiller/TextMonkey)
[Qwen-VL](https://github.com/QwenLM/Qwen-VL.git), [LLAMA](https://github.com/meta-llama/llama), [LLaVA](https://github.com/haotian-liu/LLaVA), [OpenCompass](https://github.com/open-compass/opencompass), [InternLM](https://github.com/InternLM/InternLM).
## 源码仓库及问题反馈
- https://developer.hpccube.com/codes/modelzoo/textmonkey_pytorch.git
## 参考资料
- [TextMonkey github](https://github.com/Yuliang-Liu/Monkey/blob/main/monkey_model/text_monkey/README.md)
## Copyright
We welcome suggestions to help us improve the Monkey. For any query, please contact Dr. Yuliang Liu: ylliu@hust.edu.cn. If you find something interesting, please also feel free to share with us through email or open an issue. Thanks!
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment