Unverified Commit 6a3c5678 authored by Yuliang Liu's avatar Yuliang Liu Committed by GitHub
Browse files

Update README.md

parent d4b0c5b4
......@@ -15,39 +15,41 @@ Zhang Li*, Biao Yang*, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun,
</div>
<p align="center">
<a href="updating">Paper will be released soon</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://121.60.58.184:7680/">Demo</a>&nbsp&nbsp
<a href="https://arxiv.org/abs/2311.06607">Paper</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://121.60.58.184:7680/">Demo</a>&nbsp&nbsp | &nbsp&nbsp<a href="updating">Model&Code update soon</a>&nbsp&nbsp
<!-- | &nbsp&nbsp<a href="Monkey Model">Monkey Models</a>&nbsp | &nbsp <a href="updating">Tutorial</a> -->
</p>
-----
**Monkey** brings a training-efficient approach to effectively improve the input resolution capacity up to 896 x 1344 pixels without pretraining from the start. To bridge the gap between simple text labels and high input resolution, we propose a multi-level description generation method, which automatically provides rich information that can guide the model to learn the contextual association between scenes and objects. With the synergy of these two designs, our model achieved excellent results on multiple benchmarks. By comparing our model with various LMMs, including GPT4V, our model demonstrates promising performance in image captioning by paying attention to textual information and capturing fine details within the images; its improved input resolution also enables remarkable performance in document images with dense text.
## Spotlights
- **Contextual associations.** Our method demonstrates a superior ability to infer the relationships between targets more effectively when answering questions, which results in delivering more comprehensive and insightful results.
- **Support resolution up to 1344 x 896.** Surpassing the standard 448 x 448 resolution typically employed for LMMs, this significant increase in resolution augments the ability to discern and understand unnoticeable or tightly clustered objects and dense text.
- **Enhanced general performance.** We carried out testing across 16 diverse datasets, leading to impressive performance by our Monkey model in tasks such as Image Captioning, General Visual Question Answering, Text-centric Visual Question Answering, and Document-oriented Visual Question Answering.
## performance
## Demo
<br>
To use the [demo](http://121.60.58.184:7680/), simply upload an image from your desktop or phone, or capture one directly. After uploading or capturing the image, click on 'Generate' to obtain information about the image. If you wish to see different results, feel free to generate multiple times. For those who prefer responses in Chinese, use the '生成中文描述' button to get descriptions in Chinese.
<br>
<p align="center">
<img src="images/radar.png" width="800"/>
<img src="images/generation.png" width="900"/>
<p>
<br>
## Demo
Have a try using the providing [Demo](http://121.60.58.184:7680/). All you need are to simpley upload or capture image from desktop or your phone, then click the generate. You may also generate multiple times to get more information. You can also generate Chinese answer by using “生成中文描述”:
## performance
<br>
<p align="center">
<img src="images/generation.png" width="900"/>
<img src="images/radar.png" width="800"/>
<p>
<br>
## Cases
Our model can accurately describe the details in the image.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment