Initial commit

b81b2f59 · wanglch · f7c86e68 · b81b2f59 · b81b2f59 · b81b2f59
Commit b81b2f59 authored Jul 03, 2024 by wanglch
20 changed files
--- a/README.md
+++ b/README.md
-<p align="center">
-    <img src="https://v1.ax1x.com/2024/04/13/7ySieU.png" width="500" style="margin-bottom: 0.2;"/>
-<p>
+# TextMonkey

-<h3 align="center"> <a href="https://arxiv.org/abs/2311.06607">Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models</a></h3>
-<h2></h2>
+TextMonkey是这是一种专为以文本为中心的任务而定制的大型多模态模型 (LMM)，包括文档问答 (DocVQA) 和场景文本分析。 

-<h5 align="center"> Please give us a star ⭐ for the latest update.  </h5>
+## 论文

-<h5 align="center">
+- [TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document](https://arxiv.org/abs/2403.04473)

+## 模型结构

-[![arXiv](https://img.shields.io/badge/Arxiv-2311.06607-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2311.06607) 
-[![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://github.com/Yuliang-Liu/Monkey/blob/main/LICENSE) 
-[![GitHub issues](https://img.shields.io/github/issues/Yuliang-Liu/Monkey?color=critical&label=Issues)](https://github.com/Yuliang-Liu/Monkey/issues?q=is%3Aopen+is%3Aissue)
-[![GitHub closed issues](https://img.shields.io/github/issues-closed/Yuliang-Liu/Monkey?color=success&label=Issues)](https://github.com/Yuliang-Liu/Monkey/issues?q=is%3Aissue+is%3Aclosed)  <br>
-</h5>
+首先使用滑动窗口模块将输入图像划分为不重叠的 patch，每个 patch 的大小为 448x448 像素。这些 patch 进一步细分为 14x14 像素的更小的patch ，每个 patch 都被视为一个 token。利用预训练的 CLIP 模型，然后分别在每个窗口 patch 上处理这些 token。为了建立各个窗口 patch 之间的连接，在 Transformer 块之间以一定间隔集成移位窗口注意力（Shifted Window Attention）。为了生成分层表示，输入图像的大小被调整为 448x448，并输入CLIP提取全局特征。这个全局特征以及来自子图像的特征，然后由共享图像重采样器处理以与语言域对齐。然后，通过压缩标记的长度，使用 Token Resampler 进一步最小化语言空间中的冗余。

+<div align="center">
+    <img src="./assets/model_structrue.png"/>
+</div>

-<details open><summary>💡 Monkey series projects:✨. </summary><p>
-<!--  may -->
+## 算法原理

->[CVPR'24] [**Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models**](https://arxiv.org/abs/2311.06607)<br>
-> Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, Xiang Bai <br>
-[![Paper](https://img.shields.io/badge/Paper-CVPR'24_Highlight-red)](README.md)
-[![Source_code](https://img.shields.io/badge/Code-Available-white)](README.md)
-[![Demo](https://img.shields.io/badge/Demo-blue)](http://vlrlab-monkey.xyz:7681/)
-[![Detailed Caption](https://img.shields.io/badge/Detailed_Caption-yellow)](http://huggingface.co/datasets/echo840/Detailed_Caption)
-[![Model Weight](https://img.shields.io/badge/Model_Weight-gray)](http://huggingface.co/echo840/Monkey)
-[![Model Weight in Wisemodel](https://img.shields.io/badge/Model_Weight_in_Wisemodel-gray)](https://www.wisemodel.cn/models/HUST-VLRLab/Monkey/)
-[![Demo in Wisemodel](https://img.shields.io/badge/Demo_in_Wisemodel-blue)](https://wisemodel.cn/space/gradio/huakeMonkey)
+为了进行统一的文档结构学习，该工作基于开源数据集构建了一个全面的结构化解析数据集DocStruct4M。对于文档图片或者网页截图，主要采用空格和换行表示文字布局；对于表格，其改进的Markdown语法既能表示跨行跨列，又相比html缩减了大量标签；对于图表，同样采用markdown来表示其数学特征，并且限定数值的有效位以保证其在图片中视觉可见；对于自然图，采用描述加上ocr文本的形式。

+<div align=center>
+    <img src="./assets/model_theory.png"/>
+</div>


-> [**TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document**](https://arxiv.org/abs/2403.04473)<br>
-> Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, Xiang Bai <br>
-[![arXiv](https://img.shields.io/badge/Arxiv-2403.04473-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2403.04473) 
-[![Source_code](https://img.shields.io/badge/Code-Available-white)](monkey_model/text_monkey/README.md)
-[![Data](https://img.shields.io/badge/Data-yellow)](https://huggingface.co/datasets/MelosY/TextMonkey_Data/tree/main)
-[![Model Weight](https://img.shields.io/badge/Model_Weight-gray)](https://www.modelscope.cn/models/lvskiller/TextMonkey)
+## 环境配置
+### Docker（方法一）
+[光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤

+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
+
+docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=64G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name textmonkey <your imageID> bash

-## News 
-* ```2024.4.13 ``` 🚀 Sourced code for [TextMonkey](monkey_model/text_monkey/README.md) is released.
-* ```2024.4.5  ``` 🚀 Monkey is nominated as CVPR 2024 Highlight paper.
-* ```2024.3.8  ``` 🚀 We release the paper [TextMonkey](https://arxiv.org/abs/2403.04473).
-* ```2024.2.27 ``` 🚀 Monkey is accepted by CVPR 2024. 
-* ```2024.1.3  ``` 🚀 Release the basic data generation pipeline. [Data Generation](./data_generation)
-* ```2023.12.16``` 🚀 Monkey can be trained using 8 NVIDIA 3090 GPUs. See subsection [train](#Train) for details.
-* ```2023.11.06``` 🚀 We release the paper [Monkey](https://arxiv.org/abs/2311.06607).
+cd /path/your_code_data/

-## 🐳 Model Zoo
+pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
+```

-Monkey-Chat
-| Model|Language Model|Transformers(HF) |MMBench-Test|CCBench|MME|SeedBench_IMG|MathVista-MiniTest|HallusionBench-Avg|AI2D Test|OCRBench|
-|---------------|---------|-----------------------------------------|---|---|---|---|---|---|---|---|
-|Monkey-Chat|Qwev-7B|[🤗echo840/Monkey-Chat](https://huggingface.co/echo840/Monkey-Chat)|72.4|48|1887.4|68.9|34.8|39.3|68.5|534|
+### Dockerfile（方法二）
+```
+cd /path/your_code_data/docker

+docker build --no-cache -t textmonkey:latest .

-## Environment
+docker run --shm-size=64G --name mplug-doclocal -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v /path/your_code_data/:/path/your_code_data/ -it textmonkey bash
+```
+### Anaconda（方法三）

-```python
-conda create -n monkey python=3.9
-conda activate monkey
-git clone https://github.com/Yuliang-Liu/Monkey.git
-cd ./Monkey
-pip install -r requirements.txt
+关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
+```
+DTK驱动：dtk24.04
+python：python3.10
+torch:2.1
+torchvision: 0.16.0
+deepspped: 0.12.3
 ```
-You can download the corresponding version of flash_attention from https://github.com/Dao-AILab/flash-attention/releases/ and use the following code to install:
-```python
-pip install flash_attn-2.3.5+cu117torch2.0cxx11abiFALSE-cp39-cp39-linux_x86_64.whl --no-build-isolation
+`Tips：以上dtk驱动、python、paddle等DCU相关工具版本需要严格一一对应`
+
+关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
 ```
+conda create -n textmonkey python=3.10

+conda activate textmonkey

-## Train
+cd /path/your_code_data/

-We also offer Monkey's model definition and training code, which you can explore above. You can execute the training code through executing `finetune_ds_debug.sh` for Monkey and `finetune_textmonkey.sh` for TextMonkey.
+pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple
+```

-The json file used for Monkey training can be downloaded at [Link](https://drive.google.com/file/d/18z_uQTe8Jq61V5rgHtxOt85uKBodbvw1/view?usp=sharing).
+## 数据集

+迷你数据集 [mm_tutorial](./assets/mm_tutorial) 
+
+完整数据集[MelosY/TextMonkey_Data](https://huggingface.co/datasets/MelosY/TextMonkey_Data)
+
+预训练需要准备你的训练数据，需要将所有样本放到一个列表中并存入json文件中。每个样本对应一个字典，包含以下信息，示例如下所示：用于正常训练的完整数据集请按此目录结构进行制备：

-## Inference
-Run the inference code for Monkey and Monkey-Chat:
 ```
-python ./inference.py --model_path MODEL_PATH  --image_path IMAGE_PATH  --question "YOUR_QUESTION"
+[
+    {
+      "id": "identity_0",
+      "conversations": [
+        {
+          "from": "user",
+          "value": "你好"
+        },
+        {
+          "from": "assistant",
+          "value": "我是TextMonkey,一个支持视觉输入的大模型。"
+        }
+      ]
+    },
+    { 
+      "id": "identity_1",
+      "conversations": [
+        {
+          "from": "user",
+          "value": "Picture 1: <img>/home/wanglch/projects/TextMonkey/Monkey/assets/mm_tutorial/Chongqing.jpeg</img>\nPicture 2: <img>/home/wanglch/projects/TextMonkey/Monkey/assets/mm_tutorial/Beijing.jpeg</img>\n图中都是哪"
+        },
+        {
+          "from": "assistant",
+          "value": "第一张图片是重庆的城市天际线，第二张图片是北京的天际线。"
+        }
+      ]
+    }
+  ]
 ```

+## 训练

-## Demo
+根据实际情况在脚本中修相关路径

-Demo is fast and easy to use. Simply uploading an image from your desktop or phone, or capture one directly. 
-[Demo_chat](http://vlrlab-monkey.xyz:7681) is also launched as an upgraded version of the original demo to deliver an enhanced interactive experience.
+--deepspeed
+--model_name_or_path
+--data_path
+--image_folder
+--output_dir

-We also provide the source code and the model weight for the original demo, allowing you to customize certain parameters for a more unique experience. The specific operations are as follows:
- 1. Make sure you have configured the [environment](#environment).
- 2. You can choose to use the demo offline or online:
- **Offline:** 
-	- Download the [Model Weight](http://huggingface.co/echo840/Monkey). 
-	- Modify `DEFAULT_CKPT_PATH="pathto/Monkey"` in the `demo.py` file to your model weight path. 
-	- Run the demo using the following command: 
-	```
-	python demo.py
-	```
- **Online:** 
-	- Run the demo and download model weights online with the following command: 
-	```
-	python demo.py -c echo840/Monkey 
-	```
+### 单机多卡

-For TextMonkey you can download the model weight from [Model Weight](https://www.modelscope.cn/models/lvskiller/TextMonkey)  and run the demo code:
-``` python
-python demo_textmonkey.py -c model_path
+训练需要8卡 A800 80G
+```
+sh finetune_textmonkey_dcu.sh
 ```

-Before 14/11/2023, we have observed that for some random pictures Monkey can achieve more accurate results than GPT4V.  
-<br>
-<p align="center">
-    <img src="https://v1.ax1x.com/2024/04/13/7yS2yq.jpg" width="666"/>
-<p>
-<br>
+## 推理

-Before 31/1/2024, Monkey-chat achieved the fifth rank in the Multimodal Model category on [OpenCompass](https://opencompass.org.cn/home). 
-<br>
-<p align="center">
-    <img src="https://v1.ax1x.com/2024/04/13/7yShXL.jpg" width="666"/>
-<p>
-<br>
+### 单机单卡

+### 网页问答

-## Dataset
-You can download the training and testing data used by monkey from [Monkey_Data](https://huggingface.co/datasets/echo840/Monkey_Data).
+修改模型路径为本地模型路径

-The json file used for Monkey training can be downloaded at [Link](https://drive.google.com/file/d/18z_uQTe8Jq61V5rgHtxOt85uKBodbvw1/view?usp=sharing).
+```
+sh textmonkey_inference_web.sh
+```

-The data from our multi-level description generation method is now open-sourced and available for download at [Link](https://huggingface.co/datasets/echo840/Detailed_Caption). We already upload the images used in multi-level description. Examples:
+### 指令问答

-<br>
-<p align="center">
-    <img src="https://v1.ax1x.com/2024/04/13/7yS6Ss.jpg" width="666"/>
-<p>
-<br>
+```
+python demo_textmonkey.py
+```

-You can download train images of Monkey from [Train](https://pan.baidu.com/s/1svSjXTxWpI-3boALgSeLlw). Extraction code: 4hdh
+## result

-You can download test images and jsonls of Monkey from [Test](https://pan.baidu.com/s/1ABrQKeE9QBeKvtGzXfM8Eg). Extraction code: 5h71
+### 网页问答

-The images are from CC3M, COCO Caption, TextCaps, VQAV2, OKVQA, GQA, ScienceQA, VizWiz, TextVQA, OCRVQA, ESTVQA, STVQA, AI2D and DUE_Benchmark. When using the data, it is necessary to comply with the protocols of the original dataset.
+<div align=center>
+    <img src="./assets/result1.png"/>
+</div>

-## Evaluate
+### 指令问答

-We offer evaluation code for 14 Visual Question Answering (VQA) datasets in the `evaluate_vqa.py` file, facilitating a quick verification of results.  The specific operations are as follows:
+<div align=center>
+    <img src="./assets/result2.png"/>
+</div>

- 1. Make sure you have configured the [environment](#environment).
- 2. Modify `sys.path.append("pathto/Monkey")`  to the project path.
- 3. Prepare the datasets required for evaluation. 
- 4. Run the evaluation code.
+### 精度
+迷你数据集 [mm_tutorial](./assets/mm_tutorial) ，使用的加速卡:K100/A800。

- Take ESTVQA as an example:
- - Prepare data according to the following directory structure:
-```
-├── data
-|	├── estvqa
-|		├── test_image
-|			├── {image_path0}
-|			├── {image_path1}
-|				  ·
-|				  ·
-|	├── estvqa.jsonl
-```
- - Example of the format of each line of the annotated `.jsonl` file:
-```
-{"image": "data/estvqa/test_image/011364.jpg", "question": "What is this store?", "answer": "pizzeria", "question_id": 0}
-```
- - Modify the dictionary `ds_collections`:
-```
-ds_collections = {
-	'estvqa_test': {
-		'test': 'data/estvqa/estvqa.jsonl',
-		'metric': 'anls',
-		'max_new_tokens': 100,
-	},
-	...
-}
-```
- - Run the following command:
-```
-bash eval/eval.sh 'EVAL_PTH' 'SAVE_NAME'
-```
+| device | train_loss | 
+| :------: | :------: |  
+| K100 |  |
+| A800 |  |


-## Citing Monkey
-If you wish to refer to the baseline results published here, please use the following BibTeX entries:
+## 应用场景

-```BibTeX
-@inproceedings{li2023monkey,
-  title={Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models},
-  author={Li, Zhang and Yang, Biao and Liu, Qiang and Ma, Zhiyin and Zhang, Shuo and Yang, Jingxu and Sun, Yabo and Liu, Yuliang and Bai, Xiang},
-  booktitle={proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
-  year={2024}
-}
-@article{liu2024textmonkey,
-  title={TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document},
-  author={Liu, Yuliang and Yang, Biao and Liu, Qiang and Li, Zhang and Ma, Zhiyin and Zhang, Shuo and Bai, Xiang},
-  journal={arXiv preprint arXiv:2403.04473},
-  year={2024}
-}
-```
+### 算法类别
+`OCR,对话问答`
+
+### 热点应用行业
+`金融,教育,政府,交通`

-## Acknowledgement
+## 预训练权重
+- [lvskiller/TextMonkey](https://www.modelscope.cn/models/lvskiller/TextMonkey)

-[Qwen-VL](https://github.com/QwenLM/Qwen-VL.git), [LLAMA](https://github.com/meta-llama/llama), [LLaVA](https://github.com/haotian-liu/LLaVA), [OpenCompass](https://github.com/open-compass/opencompass), [InternLM](https://github.com/InternLM/InternLM). 
+## 源码仓库及问题反馈
+- https://developer.hpccube.com/codes/modelzoo/textmonkey_pytorch.git

+## 参考资料
+- [TextMonkey github](https://github.com/Yuliang-Liu/Monkey/blob/main/monkey_model/text_monkey/README.md)

-## Copyright
-We welcome suggestions to help us improve the Monkey. For any query, please contact Dr. Yuliang Liu: ylliu@hust.edu.cn. If you find something interesting, please also feel free to share with us through email or open an issue. Thanks!
--- a/monkey_model/qwen.tiktoken
+++ b/monkey_model/qwen.tiktoken
--- a/assets/apple_r.jpeg
+++ b/assets/apple_r.jpeg
--- a/assets/demo_highfive.jpg
+++ b/assets/demo_highfive.jpg
--- a/assets/demo_spotting_caption.jpg
+++ b/assets/demo_spotting_caption.jpg
--- a/assets/demo_vl.gif
+++ b/assets/demo_vl.gif
--- a/assets/logo.jpg
+++ b/assets/logo.jpg
--- a/assets/mm_tutorial/Beijing.jpeg
+++ b/assets/mm_tutorial/Beijing.jpeg
--- a/assets/mm_tutorial/Beijing_Small.jpeg
+++ b/assets/mm_tutorial/Beijing_Small.jpeg
--- a/assets/mm_tutorial/Chongqing.jpeg
+++ b/assets/mm_tutorial/Chongqing.jpeg
--- a/assets/mm_tutorial/Chongqing_Small.jpeg
+++ b/assets/mm_tutorial/Chongqing_Small.jpeg
--- a/assets/mm_tutorial/Hospital.jpg
+++ b/assets/mm_tutorial/Hospital.jpg
--- a/assets/mm_tutorial/Hospital_Small.jpg
+++ b/assets/mm_tutorial/Hospital_Small.jpg
--- a/assets/mm_tutorial/Menu.jpeg
+++ b/assets/mm_tutorial/Menu.jpeg
--- a/assets/mm_tutorial/Rebecca_(1939_poster).jpeg
+++ b/assets/mm_tutorial/Rebecca_(1939_poster).jpeg
--- a/assets/mm_tutorial/Rebecca_(1939_poster)_Small.jpeg
+++ b/assets/mm_tutorial/Rebecca_(1939_poster)_Small.jpeg
--- a/assets/mm_tutorial/Shanghai.jpg
+++ b/assets/mm_tutorial/Shanghai.jpg
--- a/assets/mm_tutorial/Shanghai_Output.jpg
+++ b/assets/mm_tutorial/Shanghai_Output.jpg
--- a/assets/mm_tutorial/Shanghai_Output_Small.jpeg
+++ b/assets/mm_tutorial/Shanghai_Output_Small.jpeg
--- a/assets/mm_tutorial/Shanghai_Small.jpeg
+++ b/assets/mm_tutorial/Shanghai_Small.jpeg