Unverified Commit d16b84c7 authored by ShuoZhang2003's avatar ShuoZhang2003 Committed by GitHub
Browse files

Update README.md

parent 5af89d59
...@@ -14,7 +14,7 @@ Zhang Li*, Biao Yang*, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, ...@@ -14,7 +14,7 @@ Zhang Li*, Biao Yang*, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun,
<strong>Huazhong University of Science and Technology, Kingsoft</strong> <strong>Huazhong University of Science and Technology, Kingsoft</strong>
</div> </div>
<p align="center"> <p align="center">
<a href="https://arxiv.org/abs/2311.06607">Paper</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://27.17.252.152:7680/">Demo</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://huggingface.co/datasets/echo840/Detailed_Caption">Detailed Caption</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://huggingface.co/echo840/Monkey">Model Weight</a>&nbsp&nbsp <a href="https://arxiv.org/abs/2311.06607">Paper</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://27.17.252.152:7681/">Demo</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://huggingface.co/datasets/echo840/Detailed_Caption">Detailed Caption</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://huggingface.co/echo840/Monkey">Model Weight</a>&nbsp&nbsp
<!-- | &nbsp&nbsp<a href="Monkey Model">Monkey Models</a>&nbsp | &nbsp <a href="http://huggingface.co/echo840/Monkey">Tutorial</a> --> <!-- | &nbsp&nbsp<a href="Monkey Model">Monkey Models</a>&nbsp | &nbsp <a href="http://huggingface.co/echo840/Monkey">Tutorial</a> -->
</p> </p>
...@@ -29,6 +29,7 @@ Zhang Li*, Biao Yang*, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, ...@@ -29,6 +29,7 @@ Zhang Li*, Biao Yang*, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun,
- **Support resolution up to 1344 x 896.** Surpassing the standard 448 x 448 resolution typically employed for LMMs, this significant increase in resolution augments the ability to discern and understand unnoticeable or tightly clustered objects and dense text. - **Support resolution up to 1344 x 896.** Surpassing the standard 448 x 448 resolution typically employed for LMMs, this significant increase in resolution augments the ability to discern and understand unnoticeable or tightly clustered objects and dense text.
- **Enhanced general performance.** We carried out testing across 16 diverse datasets, leading to impressive performance by our Monkey model in tasks such as Image Captioning, General Visual Question Answering, Text-centric Visual Question Answering, and Document-oriented Visual Question Answering. - **Enhanced general performance.** We carried out testing across 16 diverse datasets, leading to impressive performance by our Monkey model in tasks such as Image Captioning, General Visual Question Answering, Text-centric Visual Question Answering, and Document-oriented Visual Question Answering.
## Environment ## Environment
```python ```python
...@@ -40,23 +41,15 @@ pip install -r requirements.txt ...@@ -40,23 +41,15 @@ pip install -r requirements.txt
``` ```
## Demo ## Demo
[Demo](http://27.17.252.152:7680/) is fast and easy to use. Simply uploading an image from your desktop or phone, or capture one directly. Before 14/11/2023, we have observed that for some random pictures Monkey can achieve more accurate results than GPT4V. [Demo](http://27.17.252.152:7681/) is fast and easy to use. Simply uploading an image from your desktop or phone, or capture one directly. Before 14/11/2023, we have observed that for some random pictures Monkey can achieve more accurate results than GPT4V.
<br> <br>
<p align="center"> <p align="center">
<img src="images/demo_gpt4v_compare4.png" width="900"/> <img src="images/demo_gpt4v_compare4.png" width="900"/>
<p> <p>
<br> <br>
For those who prefer responses in Chinese, use the '生成中文描述' button to get descriptions in Chinese.
<br>
<p align="center">
<img src="images/generation_chn.png" width="900"/>
<p>
<br>
We also provide the source code for the demo, allowing you to customize certain parameters for a more unique experience. The specific operations are as follows: We also provide the source code for the demo, allowing you to customize certain parameters for a more unique experience. The specific operations are as follows:
1. Make sure you have configured the [environment](#environment). 1. Make sure you have configured the [environment](#environment).
...@@ -73,11 +66,23 @@ We also provide the source code for the demo, allowing you to customize certain ...@@ -73,11 +66,23 @@ We also provide the source code for the demo, allowing you to customize certain
``` ```
python demo.py -c echo840/Monkey python demo.py -c echo840/Monkey
``` ```
In order to generate more detailed captions, we provide some prompt examples so that you can conduct more interesting explorations. You can modify these two variables in the `caption` function to implement different prompt inputs for the caption task, as shown below:
```
query = "Generate the detailed caption in English. Answer:"
chat_query = "Generate the detailed caption in English. Answer:"
```
- Generate the detailed caption in English.
- Explain the visual content of the image in great detail.
- Analyze the image in a comprehensive and detailed manner.
- Describe the image in as much detail as possible in English without duplicating it.
- Describe the image in as much detail as possible in English, including as many elements from the image as possible, but without repetition.
## Dataset ## Dataset
We have open-sourced the data generated by the multi-level description generation method. You can download it at [Detailed Caption](https://huggingface.co/datasets/echo840/Detailed_Caption). We have open-sourced the data generated by the multi-level description generation method. You can download it at [Detailed Caption](https://huggingface.co/datasets/echo840/Detailed_Caption).
## Evaluate ## Evaluate
We offer evaluation code for 14 Visual Question Answering (VQA) datasets in the `evaluate_vqa.py` file, facilitating a quick verification of results. The specific operations are as follows: We offer evaluation code for 14 Visual Question Answering (VQA) datasets in the `evaluate_vqa.py` file, facilitating a quick verification of results. The specific operations are as follows:
...@@ -119,6 +124,7 @@ ds_collections = { ...@@ -119,6 +124,7 @@ ds_collections = {
bash eval/eval.sh 'EVAL_PTH' 'SAVE_NAME' bash eval/eval.sh 'EVAL_PTH' 'SAVE_NAME'
``` ```
## Train ## Train
We also offer Monkey's model definition and training code, which you can explore above. You can execute the training code through executing `finetune_ds_debug.sh`. We also offer Monkey's model definition and training code, which you can explore above. You can execute the training code through executing `finetune_ds_debug.sh`.
...@@ -126,9 +132,6 @@ We also offer Monkey's model definition and training code, which you can explore ...@@ -126,9 +132,6 @@ We also offer Monkey's model definition and training code, which you can explore
**ATTENTION:** Specify the path to your training data, which should be a json file consisting of a list of conversations. **ATTENTION:** Specify the path to your training data, which should be a json file consisting of a list of conversations.
## Performance ## Performance
<br> <br>
...@@ -139,7 +142,6 @@ We also offer Monkey's model definition and training code, which you can explore ...@@ -139,7 +142,6 @@ We also offer Monkey's model definition and training code, which you can explore
<br> <br>
## Cases ## Cases
Our model can accurately describe the details in the image. Our model can accurately describe the details in the image.
...@@ -174,6 +176,7 @@ We qualitatively compare with existing LMMs including GPT4V, Qwen-vl, etc, which ...@@ -174,6 +176,7 @@ We qualitatively compare with existing LMMs including GPT4V, Qwen-vl, etc, which
<p> <p>
<br> <br>
## Citing Monkey ## Citing Monkey
If you wish to refer to the baseline results published here, please use the following BibTeX entries: If you wish to refer to the baseline results published here, please use the following BibTeX entries:
...@@ -188,6 +191,7 @@ If you wish to refer to the baseline results published here, please use the foll ...@@ -188,6 +191,7 @@ If you wish to refer to the baseline results published here, please use the foll
If you find the Monkey cute, please star. It would be a great encouragement for us. If you find the Monkey cute, please star. It would be a great encouragement for us.
## Acknowledgement ## Acknowledgement
[Qwen-VL](https://github.com/QwenLM/Qwen-VL.git): the codebase we built upon. Thanks for the authors of Qwen for providing the framework. [Qwen-VL](https://github.com/QwenLM/Qwen-VL.git): the codebase we built upon. Thanks for the authors of Qwen for providing the framework.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment