"git@developer.sourcefind.cn:gaoqiong/composable_kernel.git" did not exist on "0a2657312ec62a65e92a36cebd7d3b2a3c0712e1"
Commit 30f7a82d authored by lvskiller's avatar lvskiller
Browse files

readme

parent 2975c76d
# Monkey
\ No newline at end of file
# Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
![](C:/Users/31311/Desktop/研究生工作/金山合作/Monkey/Monkey/images/logo_monkey.png)
<div align="center">
Zhang Li*, Biao Yang*, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu†, Xiang Bai
</div>
<div align="center">
<strong>Huazhong University of Science and Technology, Kingsoft</strong>
</div>
<div align="center">
*Equal Contribution; † Corresponding Author
</div>
**Monkey** introduces a resource-efficient method to enhance input resolution within the LMM paradigm. Using the wealth of excellent open-source efforts, we eschew the laborious pre-training phase by using existing LMMs(Qwen-VL). We propose a simple but effective module that segments high-resolution images into smaller, local segments via a sliding window technique. Each segment is encoded independently using a static visual encoder, enriched with various LoRA adjustments, and a trainable visual resampler. These segmented encodings are subsequently amalgamated and presented to the language decoder, complemented by a resized global image feature to maintain overall structural integrity. In parallel, we’ve developed a hierarchical pipeline for enhancing caption data quality, good at generating detailed image descriptions that encapsulate local elements, textual content, and the broader structural context.
## Spotlights
- **Contextual associations. **. Our method demonstrates a superior ability to infer the relationships between targets more effectively when answering questions, which results in delivering more comprehensive and insightful results.
- **Support resolution up to 1344 x 896. ** Surpassing the standard 448 x 448 resolution typically employed for LMMs, this significant increase in resolution augments the ability to discern and understand unnoticeable or tightly clustered objects and dense text.
- **Enhanced general performance. ** We carried out testing across 16 diverse datasets, leading to impressive performance by our Monkey model in tasks such as Image Captioning, General Visual Question Answering, Text-centric Visual Question Answering, and Document-oriented Visual Question Answering.
## performance
![](C:/Users/31311/Desktop/研究生工作/金山合作/Monkey/Monkey/images/radar.png)
## Demo
We have a demo open for everyone to play.[Demo](https://74a00f7621c2ecf691.gradio.live/ )
## Cases
Our model is able to accurately describe almost all the details in the image.
![](C:/Users/31311/Desktop/研究生工作/金山合作/Monkey/Monkey/images/caption_1.png)
Besides, our model has also demonstrated some capabilities in fine-grained question answering and even answering questions involving world knowledge.
![](C:/Users/31311/Desktop/研究生工作/金山合作/Monkey/Monkey/images/qa_1.png)
With the power of large-scale architecture, we have also achieved impressive performance on document-based tasks.
![](images/Doc_Chart.png)
## Acknowledgement
[Qwen-VL]https://github.com/QwenLM/Qwen-VL.git): the codebase we built upon. Thanks for the authors of Qwen for providing the framework.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment