README.md 2.87 KB
Newer Older
lvskiller's avatar
readme  
lvskiller committed
1
2
# Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

Melos's avatar
Melos committed
3
![](images/logo_monkey.png)
lvskiller's avatar
readme  
lvskiller committed
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

<div align="center">
Zhang Li*, Biao Yang*, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu†, Xiang Bai
</div>
<div align="center">
<strong>Huazhong University of Science and Technology, Kingsoft</strong>
</div>
<div align="center">
*Equal Contribution; † Corresponding Author
</div>

**Monkey** introduces a resource-efficient method to enhance input resolution within the LMM paradigm. Using the wealth of excellent open-source efforts, we eschew the laborious pre-training phase by using existing LMMs(Qwen-VL). We propose a simple but effective module that segments high-resolution images into smaller, local segments via a sliding window technique. Each segment is encoded independently using a static visual encoder, enriched with various LoRA adjustments, and a trainable visual resampler. These segmented encodings are subsequently amalgamated and presented to the language decoder, complemented by a resized global image feature to maintain overall structural integrity. In parallel, we’ve developed a hierarchical pipeline for enhancing caption data quality, good at generating detailed image descriptions that encapsulate local elements, textual content, and the broader structural context.

## Spotlights

- **Contextual associations. **. Our method demonstrates a superior ability to infer the relationships between targets more effectively when answering questions, which results in delivering more comprehensive and insightful results.
- **Support resolution up to 1344 x 896. ** Surpassing the standard 448 x 448 resolution typically employed for LMMs, this significant increase in resolution augments the ability to discern and understand unnoticeable or tightly clustered objects and dense text. 
- **Enhanced general performance. ** We carried out testing across 16 diverse datasets, leading to impressive performance by our Monkey model in tasks such as Image Captioning, General Visual Question Answering, Text-centric Visual Question Answering, and Document-oriented Visual Question Answering.

## performance

Melos's avatar
Melos committed
25
![](images/radar.png)
lvskiller's avatar
readme  
lvskiller committed
26
27
28
29
30
31
32

## Demo

We have a demo open for everyone to play.[Demo](https://74a00f7621c2ecf691.gradio.live/ )

## Cases

Melos's avatar
Melos committed
33
Our model can accurately describe almost all the details in the image.
lvskiller's avatar
readme  
lvskiller committed
34

Melos's avatar
Melos committed
35
![](images/caption_1.png)
lvskiller's avatar
readme  
lvskiller committed
36
37
38

Besides, our model has also demonstrated some capabilities in fine-grained question answering and even answering questions involving world knowledge.

Melos's avatar
Melos committed
39
![](images/qa_1.png)
lvskiller's avatar
readme  
lvskiller committed
40
41
42
43
44
45
46
47
48

With the power of large-scale architecture, we have also achieved impressive performance on document-based tasks.

![](images/Doc_Chart.png)

## Acknowledgement

[Qwen-VL]https://github.com/QwenLM/Qwen-VL.git): the codebase we built upon. Thanks for the authors of Qwen for providing the framework.