Unverified Commit d16b84c7 authored by ShuoZhang2003's avatar ShuoZhang2003 Committed by GitHub
Browse files

Update README.md

parent 5af89d59
......@@ -14,7 +14,7 @@ Zhang Li*, Biao Yang*, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun,
<strong>Huazhong University of Science and Technology, Kingsoft</strong>
</div>
<p align="center">
<a href="https://arxiv.org/abs/2311.06607">Paper</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://27.17.252.152:7680/">Demo</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://huggingface.co/datasets/echo840/Detailed_Caption">Detailed Caption</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://huggingface.co/echo840/Monkey">Model Weight</a>&nbsp&nbsp
<a href="https://arxiv.org/abs/2311.06607">Paper</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://27.17.252.152:7681/">Demo</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://huggingface.co/datasets/echo840/Detailed_Caption">Detailed Caption</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://huggingface.co/echo840/Monkey">Model Weight</a>&nbsp&nbsp
<!-- | &nbsp&nbsp<a href="Monkey Model">Monkey Models</a>&nbsp | &nbsp <a href="http://huggingface.co/echo840/Monkey">Tutorial</a> -->
</p>
......@@ -29,6 +29,7 @@ Zhang Li*, Biao Yang*, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun,
- **Support resolution up to 1344 x 896.** Surpassing the standard 448 x 448 resolution typically employed for LMMs, this significant increase in resolution augments the ability to discern and understand unnoticeable or tightly clustered objects and dense text.
- **Enhanced general performance.** We carried out testing across 16 diverse datasets, leading to impressive performance by our Monkey model in tasks such as Image Captioning, General Visual Question Answering, Text-centric Visual Question Answering, and Document-oriented Visual Question Answering.
## Environment
```python
......@@ -40,23 +41,15 @@ pip install -r requirements.txt
```
## Demo
[Demo](http://27.17.252.152:7680/) is fast and easy to use. Simply uploading an image from your desktop or phone, or capture one directly. Before 14/11/2023, we have observed that for some random pictures Monkey can achieve more accurate results than GPT4V.
[Demo](http://27.17.252.152:7681/) is fast and easy to use. Simply uploading an image from your desktop or phone, or capture one directly. Before 14/11/2023, we have observed that for some random pictures Monkey can achieve more accurate results than GPT4V.
<br>
<p align="center">
<img src="images/demo_gpt4v_compare4.png" width="900"/>
<p>
<br>
For those who prefer responses in Chinese, use the '生成中文描述' button to get descriptions in Chinese.
<br>
<p align="center">
<img src="images/generation_chn.png" width="900"/>
<p>
<br>
We also provide the source code for the demo, allowing you to customize certain parameters for a more unique experience. The specific operations are as follows:
1. Make sure you have configured the [environment](#environment).
......@@ -73,11 +66,23 @@ We also provide the source code for the demo, allowing you to customize certain
```
python demo.py -c echo840/Monkey
```
In order to generate more detailed captions, we provide some prompt examples so that you can conduct more interesting explorations. You can modify these two variables in the `caption` function to implement different prompt inputs for the caption task, as shown below:
```
query = "Generate the detailed caption in English. Answer:"
chat_query = "Generate the detailed caption in English. Answer:"
```
- Generate the detailed caption in English.
- Explain the visual content of the image in great detail.
- Analyze the image in a comprehensive and detailed manner.
- Describe the image in as much detail as possible in English without duplicating it.
- Describe the image in as much detail as possible in English, including as many elements from the image as possible, but without repetition.
## Dataset
We have open-sourced the data generated by the multi-level description generation method. You can download it at [Detailed Caption](https://huggingface.co/datasets/echo840/Detailed_Caption).
## Evaluate
We offer evaluation code for 14 Visual Question Answering (VQA) datasets in the `evaluate_vqa.py` file, facilitating a quick verification of results. The specific operations are as follows:
......@@ -119,6 +124,7 @@ ds_collections = {
bash eval/eval.sh 'EVAL_PTH' 'SAVE_NAME'
```
## Train
We also offer Monkey's model definition and training code, which you can explore above. You can execute the training code through executing `finetune_ds_debug.sh`.
......@@ -126,9 +132,6 @@ We also offer Monkey's model definition and training code, which you can explore
**ATTENTION:** Specify the path to your training data, which should be a json file consisting of a list of conversations.
## Performance
<br>
......@@ -139,7 +142,6 @@ We also offer Monkey's model definition and training code, which you can explore
<br>
## Cases
Our model can accurately describe the details in the image.
......@@ -174,6 +176,7 @@ We qualitatively compare with existing LMMs including GPT4V, Qwen-vl, etc, which
<p>
<br>
## Citing Monkey
If you wish to refer to the baseline results published here, please use the following BibTeX entries:
......@@ -188,6 +191,7 @@ If you wish to refer to the baseline results published here, please use the foll
If you find the Monkey cute, please star. It would be a great encouragement for us.
## Acknowledgement
[Qwen-VL](https://github.com/QwenLM/Qwen-VL.git): the codebase we built upon. Thanks for the authors of Qwen for providing the framework.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment