Commit 5e887c2c authored by wanglch's avatar wanglch
Browse files

Initial commit

parents
Pipeline #1060 canceled with stages
name: 🐞 Bug
description: 提交错误报告 | File a bug/issue
title: "[BUG] <title>"
labels: []
body:
- type: checkboxes
attributes:
label: 是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
description: |
请先搜索您遇到的错误是否在已有的issues或讨论中提到过。
Please search to see if an issue / discussion already exists for the bug you encountered.
[Issues](https://github.com/QwenLM/Qwen-7B/issues)
[Discussions](https://github.com/QwenLM/Qwen-7B/discussions)
options:
- label: 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
required: true
- type: checkboxes
attributes:
label: 该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
description: |
请先搜索您遇到的错误是否已在FAQ中有相关解答。
Please search to see if an answer already exists in FAQ for the bug you encountered.
[FAQ-en](https://github.com/QwenLM/Qwen-7B/blob/main/FAQ.md)
[FAQ-zh](https://github.com/QwenLM/Qwen-7B/blob/main/FAQ_zh.md)
options:
- label: 我已经搜索过FAQ | I have searched FAQ
required: true
- type: textarea
attributes:
label: 当前行为 | Current Behavior
description: |
准确描述遇到的行为。
A concise description of what you're experiencing.
validations:
required: false
- type: textarea
attributes:
label: 期望行为 | Expected Behavior
description: |
准确描述预期的行为。
A concise description of what you expected to happen.
validations:
required: false
- type: textarea
attributes:
label: 复现方法 | Steps To Reproduce
description: |
复现当前行为的详细步骤。
Steps to reproduce the behavior.
placeholder: |
1. In this environment...
2. With this config...
3. Run '...'
4. See error...
validations:
required: false
- type: textarea
attributes:
label: 运行环境 | Environment
description: |
examples:
- **OS**: Ubuntu 20.04
- **Python**: 3.8
- **Transformers**: 4.31.0
- **PyTorch**: 2.0.1
- **CUDA**: 11.4
value: |
- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):
render: Markdown
validations:
required: false
- type: textarea
attributes:
label: 备注 | Anything else?
description: |
您可以在这里补充其他关于该问题背景信息的描述、链接或引用等。
您可以通过点击高亮此区域然后拖动文件的方式上传图片或日志文件。
Links? References? Anything that will give us more context about the issue you are encountering!
Tip: You can attach images or log files by clicking this area to highlight it and then dragging files in.
validations:
required: false
blank_issues_enabled: true
name: "💡 Feature Request"
description: 创建新功能请求 | Create a new ticket for a new feature request
title: "💡 [REQUEST] - <title>"
labels: [
"question"
]
body:
- type: input
id: start_date
attributes:
label: "起始日期 | Start Date"
description: |
起始开发日期
Start of development
placeholder: "month/day/year"
validations:
required: false
- type: textarea
id: implementation_pr
attributes:
label: "实现PR | Implementation PR"
description: |
实现该功能的Pull request
Pull request used
placeholder: "#Pull Request ID"
validations:
required: false
- type: textarea
id: reference_issues
attributes:
label: "相关Issues | Reference Issues"
description: |
与该功能相关的issues
Common issues
placeholder: "#Issues IDs"
validations:
required: false
- type: textarea
id: summary
attributes:
label: "摘要 | Summary"
description: |
简要描述新功能的特点
Provide a brief explanation of the feature
placeholder: |
Describe in a few lines your feature request
validations:
required: true
- type: textarea
id: basic_example
attributes:
label: "基本示例 | Basic Example"
description: Indicate here some basic examples of your feature.
placeholder: A few specific words about your feature request.
validations:
required: true
- type: textarea
id: drawbacks
attributes:
label: "缺陷 | Drawbacks"
description: |
该新功能有哪些缺陷/可能造成哪些影响?
What are the drawbacks/impacts of your feature request ?
placeholder: |
Identify the drawbacks and impacts while being neutral on your feature request
validations:
required: true
- type: textarea
id: unresolved_question
attributes:
label: "未解决问题 | Unresolved questions"
description: |
有哪些尚未解决的问题?
What questions still remain unresolved ?
placeholder: |
Identify any unresolved issues.
validations:
required: false
\ No newline at end of file
__pycache__
*.so
build
.coverage_*
*.egg-info
*~
.vscode/
.idea/
.DS_Store
/private/
Qwen-VL-Chat/
Qwen-VL-Chat-Int4/
SimSun.ttf
## Code of Conduct 🤝
Before diving in, please take a moment to review our Code of Conduct. It sets the tone for our community and emphasizes the importance of respect and inclusivity. [Read the Code of Conduct](LICENSE.md).
## Contribution Types 🦠🚀📚
### Bug Reports 🐞
If you encounter any bugs during your journey, don't fret! We have the Bug Busters ready to help. To report a bug, follow these steps:
1. Check if the bug has already been reported in [GitHub Issues](https://github.com/AI4Finance-Foundation/FinGPT/issues).
2. If it's a new bug, open a new issue with a concise description and provide detailed, step-by-step instructions to reproduce it.
### Feature Requests 💡
Do you have visionary ideas that could elevate FinGPT? Share them with us! When submitting a feature request, be sure to include:
1. A clear and vivid description of the feature you envision.
2. Discuss the impact and potential benefits.
### Documentation 📖
For those with a penchant for words and an eye for detail, consider contributing to our documentation. You can make the documentation more enlightening for everyone. 🧙📜
### Code Contributions 💻
Calling all AI heroes and wizards! You are the secret sauce behind the FinGPT project. To contribute code and save the financial world:
1. **Fork the Repository**: Click the "Fork" button on the top right of the repository's page. This creates your own copy of the project.
2. **Clone your Fork**: In your terminal, use the following command to clone your fork to your local machine:
```bash
git clone https://github.com/YourUsername/FinGPT.git
```
3. **Create a New Branch**: Make a new branch for your adventures. This helps keep the main codebase clean:
```bash
git checkout -b your-feature-branch
```
4. **Work Your Magic**: Implement your code or changes.
5. **Commit and Push**: Use these commands to commit your changes and push them to your fork:
```bash
git commit -m "Your commit message"
git push origin your-feature-branch
```
6. **Create a Pull Request**: Go to the original FinGPT repository and click "New Pull Request." Select your branch, write a description, and submit.
## Seeking Assistance ❓🙋‍♀️
If you find yourself stuck or have questions, remember that our support team is your sidekick. Don't hesitate to reach out. We are here to guide you through the process and provide any necessary assistance.
## Getting Started 🚀🚀
Are you ready to make a mark on the FinGPT project? Grab your cape and join us in our mission to make finance and AI even more incredible. Your contributions are the magic that fuels our journey.
🔗 [FinGPT GitHub Repository](https://github.com/AI4Finance-Foundation/FinGPT)
### May your contributions be as amazing as you are! 🌌🚀
\ No newline at end of file
# FAQ
## 安装&环境
#### 我应该用哪个transformers版本?
建议使用4.31.0。
#### 我把模型和代码下到本地,按照教程无法使用,该怎么办?
答:别着急,先检查你的代码是不是更新到最新版本,然后确认你是否完整地将模型checkpoint下到本地。
#### `qwen.tiktoken`这个文件找不到,怎么办?
这个是我们的tokenizer的merge文件,你必须下载它才能使用我们的tokenizer。注意,如果你使用git clone却没有使用git-lfs,这个文件不会被下载。如果你不了解git-lfs,可点击[官网](https://git-lfs.com/)了解。
#### transformers_stream_generator/tiktoken/accelerate,这几个库提示找不到,怎么办?
运行如下命令:`pip install -r requirements.txt`。相关依赖库在[https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt) 可以找到。
<br><br>
## Demo & 推理
#### 是否提供Demo?
`web_demo_mm.py`提供了Web UI。请查看README相关内容了解更多。
#### Qwen-VL支持流式推理吗?
Qwen-VL当前不支持流式推理。
#### 模型的输出看起来与输入无关/没有遵循指令/看起来呆呆的
请检查是否加载的是Qwen-VL-Chat模型进行推理,Qwen-VL模型是未经align的预训练基模型,不期望具备响应用户指令的能力。我们在模型最新版本已经对`chat`接口内进行了检查,避免您误将预训练模型作为SFT/Chat模型使用。
#### 是否有量化版本模型
目前Qwen-VL不支持量化,后续我们将支持高效的量化推理实现。
#### 处理长序列时效果有问题
请确认是否开启ntk。若要启用这些技巧,请将`config.json`里的`use_dynamc_ntk``use_logn_attn`设置为`true`。最新代码默认为`true`
<br><br>
## Tokenizer
#### bos_id/eos_id/pad_id,这些token id不存在,为什么?
在训练过程中,我们仅使用<|endoftext|>这一token作为sample/document之间的分隔符及padding位置占位符,你可以将bos_id, eos_id, pad_id均指向tokenizer.eod_id。请阅读我们关于tokenizer的文档,了解如何设置这些id。
Tongyi Qianwen LICENSE AGREEMENT
Tongyi Qianwen Release Date: August 23, 2023
By clicking to agree or by using or distributing any portion or element of the Tongyi Qianwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
1. Definitions
a. This Tongyi Qianwen LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement.
b. "We"(or "Us") shall mean Alibaba Cloud.
c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use.
d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You.
e. "Tongyi Qianwen" shall mean the large language models (including Qwen-VL model and Qwen-VL-Chat model), and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us.
f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement.
g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files.
h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation,
and conversions to other media types.
2. Grant of Rights
You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials.
3. Redistribution
You may reproduce and distribute copies of the Materials or derivative works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement;
b. You shall cause any modified files to carry prominent notices stating that You changed the files;
c. You shall retain in all copies of the Materials that You distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Tongyi Qianwen is licensed under the Tongyi Qianwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and
d. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such derivative works as a whole, provided Your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement.
4. Restrictions
If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, You shall request a license from Us. You cannot exercise your rights under this Agreement without our express authorization.
5. Rules of use
a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials.
b. You can not use the Materials or any output therefrom to improve any other large language model (excluding Tongyi Qianwen or derivative works thereof).
6. Intellectual Property
a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for Us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications.
b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of Us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials.
c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licences granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought.
7. Disclaimer of Warranty and Limitation of Liability
a. We are not obligated to support, update, provide training for, or develop any further version of the Tongyi Qianwen Materials or to grant any license thereto.
b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM.
c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED.
d. You will defend, indemnify and hold harmless Us from and against any claim by any third party arising out of or related to your use or distribution of the Materials.
8. Survival and Termination.
a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 7 and 9 shall survive the termination of this Agreement.
9. Governing Law and Jurisdiction.
a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.
\ No newline at end of file
# Qwen-VL
Qwen-VL 是阿里云研发的大规模视觉语言模型(Large Vision Language Model, LVLM)。Qwen-VL 可以以图像、文本、检测框作为输入,并以文本和检测框作为输出。
## 论文
- [Qwen-VL: A Versatile Vision-Language Model for
Understanding, Localization, Text Reading, and Beyond](https://arxiv.org/pdf/2308.12966)
## 模型结构
Qwen-VL的多语言视觉语言模型系列,基于Qwen-7B语言模型。该模型通过视觉编码器和位置感知的视觉语言适配器,赋予语言模型视觉理解能力。
<div align="center">
<img src="./assets/transformer.jpg"/>
</div>
## 算法原理
Qwen-VL: Qwen-VL 以 Qwen-7B 的预训练模型作为语言模型的初始化,并以 Openclip ViT-bigG 作为视觉编码器的初始化,中间加入单层随机初始化的 cross-attention,经过约1.5B的图文数据训练得到。最终图像输入分辨率为448。
<div align=center>
<img src="./assets/transformer.png"/>
</div>
Qwen-VL采用了三阶段的训练流程,并在多个视觉语言理解基准测试中取得了领先的成绩。该模型支持多语言、多图像输入,具备细粒度的视觉理解能力。
另外,通过指令调优,生成了交互式的Qwen-VL-Chat模型,在现实世界用户行为的评估中展现出了优异的表现。总体而言,Qwen-VL系列模型在视觉语言理解任务上取得了显著的成果,并在开源社区中具有领先的地位。
<div align=center>
<img src="./assets/qwenvl.jpeg"/>
</div>
## 环境配置
### Docker(方法一)
[光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu22.04-dtk23.10.1-py310
docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=64G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name qwen-vl <your imageID> bash
cd /path/your_code_data/
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
pip install -r requirements_web_demo.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
```
### Dockerfile(方法二)
```
cd /path/your_code_data/docker
docker build --no-cache -t qwen-vl:latest .
docker run --shm-size=64G --name qwen-vl -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v /path/your_code_data/:/path/your_code_data/ -it qwen-vl bash
```
### Anaconda(方法三)
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
```
DTK驱动:dtk23.10
python:python3.10
torch:2.1
torchvision: 0.16.0
deepspped: 0.12.3
```
`Tips:以上dtk驱动、python、paddle等DCU相关工具版本需要严格一一对应`
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
```
conda create -n qwen-vl python=3.10
conda activate qwen-vl
cd /path/your_code_data/
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple
pip install -r requirements_web_demo.txt -i http://mirrors.aliyun.com/pypi/simple
```
## 数据集
迷你数据集 [assets/mm_tutorial](./assets/mm_tutorial)
预训练需要准备你的训练数据,需要将所有样本放到一个列表中并存入json文件中。每个样本对应一个字典,包含id和conversation,其中后者为一个列表。示例如下所示:用于正常训练的完整数据集请按此目录结构进行制备:
```
[
{
"id": "identity_0",
"conversations": [
{
"from": "user",
"value": "你好"
},
{
"from": "assistant",
"value": "我是Qwen-VL,一个支持视觉输入的大模型。"
}
]
},
{
"id": "identity_1",
"conversations": [
{
"from": "user",
"value": "Picture 1: <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>\n图中的狗是什么品种?"
},
{
"from": "assistant",
"value": "图中是一只拉布拉多犬。"
},
{
"from": "user",
"value": "框出图中的格子衬衫"
},
{
"from": "assistant",
"value": "<ref>格子衬衫</ref><box>(588,499),(725,789)</box>"
}
]
},
{
"id": "identity_2",
"conversations": [
{
"from": "user",
"value": "Picture 1: <img>assets/mm_tutorial/Chongqing.jpeg</img>\nPicture 2: <img>assets/mm_tutorial/Beijing.jpeg</img>\n图中都是哪"
},
{
"from": "assistant",
"value": "第一张图片是重庆的城市天际线,第二张图片是北京的天际线。"
}
]
}
]
```
## 训练
### 单机单卡
```
sh finetune/finetune_lora_single_gpu.sh
```
## 推理
执行多种任务时需要对以下参数进行修改,可使用中文指令,如下:
`'image'= 图片路径`
`'text'= 任务需求`
```
query = tokenizer.from_list_format([
{'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url
{'text': 'Generate the caption in English with grounding:'},
])
```
### 单机单卡
```
python qwen_vl_inference.py
```
## result
### 检测任务
<div align=center>
<img src="./assets/mm_tutorial\2.jpg"/>
</div>
### 车牌识别
<div align=center>
<img src="./assets/mm_tutorial/car.png"/>
</div>
<div align=center>
<img src="./assets/car_num.png"/>
</div>
### 火车票识别
<div align=center>
<img src="./assets/train_ticket.jpg"/>
</div>
<div align=center>
<img src="./assets/train_ticket_info.png"/>
</div>
### 精度
测试数据: [assets/mm_tutorial](./assets/mm_tutorial) ,使用的加速卡:V100S/K100。
| device | train_loss |
| :------: | :------: |
| V100s | 1.9149 |
| K100 | 1.9149 |
## 应用场景
### 算法类别
`ocr`
### 热点应用行业
`金融,教育,政府,科研,制造,能源,交通`
## 预训练权重
- [Qwen/Qwen-VL-Chat](https://hf-mirror.com/Qwen/Qwen-VL-Chat/tree/main)
- [Qwen/Qwen-VL](https://hf-mirror.com/Qwen/Qwen-VL/tree/main)
## 源码仓库及问题反馈
- http://developer.hpccube.com/codes/modelzoo/umt5.git
## 参考资料
- [Qwen-VL: A Versatile Vision-Language Model for
Understanding, Localization, Text Reading, and Beyond](https://arxiv.org/pdf/2308.12966)
- [Qwen-VL github](https://github.com/QwenLM/Qwen-VL)
<p align="left">
中文</a>&nbsp | &nbsp<a href="README.md">English</a>&nbsp&nbsp | &nbsp<a href="README_JA.md">日本語</a>&nbsp| &nbsp<a href="README_KO.md">한국어</a>&nbsp
</p>
<br><br>
<p align="center">
<img src="assets/logo.jpg" width="400"/>
<p>
<br>
<p align="center">
Qwen-VL
<a href="https://huggingface.co/Qwen/Qwen-VL">🤗</a>
<a href="https://modelscope.cn/models/qwen/Qwen-VL/summary">🤖</a>&nbsp |
Qwen-VL-Chat
<a href="https://huggingface.co/Qwen/Qwen-VL-Chat">🤗</a>
<a href="https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary">🤖</a>&nbsp
(Int4:
<a href="https://huggingface.co/Qwen/Qwen-VL-Chat-Int4">🤗</a>
<a href="https://modelscope.cn/models/qwen/Qwen-VL-Chat-Int4/summary">🤖</a>&nbsp) |
Qwen-VL-Plus
<a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Plus">🤗</a>
<a href="https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary">🤖</a>&nbsp |
Qwen-VL-Max
<a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Max">🤗</a>
<a href="https://modelscope.cn/studios/qwen/Qwen-VL-Max/summary">🤖</a>&nbsp
<br>
<a href="https://tongyi.aliyun.com/qianwen">Web</a>&nbsp&nbsp | &nbsp&nbsp
<a href="http://ofasys-wlcb.oss-accelerate-overseas.aliyuncs.com/QwenVL/blog/app_qrcode.jpg">APP</a>&nbsp&nbsp | &nbsp&nbsp
<a href="https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start">API</a>&nbsp&nbsp | &nbsp&nbsp
<a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat</a>&nbsp&nbsp | &nbsp&nbsp
<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp | &nbsp&nbsp
<a href="https://arxiv.org/abs/2308.12966">Paper</a>&nbsp&nbsp | &nbsp&nbsp
<a href="TUTORIAL.md">Tutorial</a>
</p>
<br><br>
---
## Qwen-VL-Plus & Qwen-VL-Max
Qwen-VL 系列再次迎来重磅升级,我们推出 Qwen-VL-Plus 和 Qwen-VL-Max 两个升级版的模型。目前支持通过<a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Max">🤗</a><a href="https://modelscope.cn/studios/qwen/Qwen-VL-Max/summary">🤖</a>[网页端](https://qianwen.aliyun.com)[APP](http://ofasys-wlcb.oss-accelerate-overseas.aliyuncs.com/QwenVL/blog/app_qrcode.jpg)[API](https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start)免费访问。
| 模型名 | 模型简介 |
| --- | --- |
| Qwen-VL-Plus | 通义千问大规模视觉语言模型增强版。大幅提升细节识别能力和文字识别能力,支持超百万像素分辨率和任意长宽比规格的图像。在广泛的视觉任务上提供**卓越**的性能。 |
| Qwen-VL-Max | 通义千问超大规模视觉语言模型。相比增强版,再次提升视觉推理能力和指令遵循能力,提供更高的视觉感知和认知水平。在更多复杂任务上提供**最佳**的性能。 |
这两个版本的主要技术升级在于:
- 大幅提升图像相关的推理能力;
- 大幅提升对图中细节和文字的识别、提取和分析能力;
- 支持百万像素以上的高清分辨率图,支持各种长宽比的图像;
这两个模型不仅大幅超越此前所有开源 LVLM 模型的最佳水平,并且在多项图文多模态标准测试中获得了堪比 Gemini Ultra 和 GPT4-v 的水准。
甚至,Qwen-VL-Max 在中文问答、中文文字理解相关的任务上超越了 OpenAI的 GPT4-v 和 Google 的 Gemini-Pro。
<table>
<thead>
<tr>
<th>Model</th>
<th>DocVQA<br><sup>(文档理解)</sup></th>
<th>ChartQA<br><sup>(图表理解)</sup></th>
<th>AI2D<br><sup>(科学图例)</sup></th>
<th>TextVQA<br><sup>(文字阅读)</sup></th>
<th>MMMU<br><sup>(多学科问题)</sup></th>
<th>MathVista<br><sup>(数学推理)</sup></th>
<th>MM-Bench-CN<br><sup>(中文问答)</sup></th>
</tr>
</thead>
<tbody align="center">
<tr>
<td>Other Best<br>Open-source LVLM</td>
<td>81.6%<br><sup>(CogAgent)</sup></td>
<td>68.4%<br><sup>(CogAgent)</sup></td>
<td>73.7%<br><sup>(Fuyu-Medium)</sup></td>
<td>76.1%<br><sup>(CogAgent)</sup></td>
<td>45.9%<br><sup>(Yi-VL-34B)</sup></td>
<td>36.7%<br><sup>(SPHINX-V2)</sup></td>
<td>72.4%<br><sup>(InternLM-XComposer-VL)</sup></td>
</tr>
<tr>
<td>Gemini Pro</td>
<td>88.1%</td>
<td>74.1%</td>
<td>73.9%</td>
<td>74.6%</td>
<td>47.9%</td>
<td>45.2%</td>
<td>74.3%</td>
</tr>
<tr>
<td>Gemini Ultra</td>
<td>90.9%</td>
<td>80.8% <sup>1</sup></td>
<td>79.5% <sup>1</sup></td>
<td>82.3% <sup>1</sup></td>
<td>59.4% <sup>1</sup></td>
<td>53.0% <sup>1</sup></td>
<td>-</td>
</tr>
<tr>
<td>GPT-4V</td>
<td>88.4%</td>
<td>78.5%</td>
<td>78.2%</td>
<td>78.0%</td>
<td>56.8%</td>
<td>49.9%</td>
<td>73.9%</td>
</tr>
<tr>
<td><b>Qwen-VL-Plus</b></td>
<td>91.4%</td>
<td>78.1%</td>
<td>75.9%</td>
<td>78.9%</td>
<td>44.0%</td>
<td>43.3%</td>
<td>68.0%</td>
</tr>
<tr>
<td><b>Qwen-VL-Max</b></td>
<td>92.5% <sup>1</sup></td>
<td>79.8% <sup>2</sup></td>
<td>79.3% <sup>2</sup></td>
<td>79.5% <sup>2</sup></td>
<td>51.4% <sup>3</sup></td>
<td>51.0% <sup>2</sup></td>
<td>75.1% <sup>1</sup></td>
</tr>
</tbody>
</table>
所有评测都是在不使用任何外部OCR工具(“only pixel”)的情况下获得的。
---
## 新闻
* 2024年01月18日 我们推出 Qwen-Vl-Max,大幅超越此前所有开源 LVLM 模型的最佳水平,并且在多项图文多模态标准测试中获得了堪比 Gemini Ultra 和 GPT4-v 的水准。直接访问[通义千问网页端或APP](https://qianwen.aliyun.com)就能体验新模型。
* 2023年11月28日 Qwen-VL单模型在[DOCVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1)达到了最强水平,超越了GPT4V,PALI-X,与此同时它还是一个通用模型,直接输入图片就能帮你分析理解各种任务。
* 2023年9月12日 更新Qwen-VL-Chat模型,该模型有更鲁棒的中文指令跟随,更好的网页和表格图片理解和问答能力以及更好的对话表现(Touchstone: 中文: 401.2->481.7, 英文: 645.2->711.6)。
* 2023年9月12日 支持Qwen-VL和Qwen-VL-Chat的微调,其中包括全参数微调、LoRA以及Q-LoRA
* 2023年9月8日 感谢[camenduru](https://github.com/camenduru)贡献了[Colab](https://github.com/camenduru/Qwen-VL-Chat-colab)示例,每个人都可以以此为教程,在12G的GPU上做本地或在线的Demo。
* 2023年9月5日 在社区多模态通用模型榜单 [MME Benchmark](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) 上取得了感知和认知双赛道的当前最好结果。
* 2023年9月4日 在社区多模态通用模型榜单 [SEED-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard) 上取得了图像理解和视频理解的当前最好结果。
* 2023年9月1日 发布[TouchStone](https://github.com/OFA-Sys/TouchStone) 测评, 这是一个综合评估LVLM能力的测评,它不仅考察模型的视觉描述和推理能力,还包括根据视觉内容的文学创作能力。同时它是将多模态信息用文本表述并用LLMs进行评估的方法。
* 2023年8月31日 发布Qwen-VL-Chat量化模型,**Qwen-VL-Chat-Int4**,该模型显存占用低,推理速度相比半精度模型显著提升,在基准评测上效果损失较小。
* 2023年8月22日 在魔搭社区(ModelScope)和Hugging Face同步推出Qwen-VL和Qwen-VL-Chat模型。同时,我们提供一个[论文](https://arxiv.org/abs/2308.12966)介绍了相关的模型结构、训练细节和模型表现。
---
## Qwen-VL
**Qwen-VL** 是阿里云研发的大规模视觉语言模型(Large Vision Language Model, LVLM)。Qwen-VL 可以以图像、文本、检测框作为输入,并以文本和检测框作为输出。Qwen-VL 系列模型的特点包括:
- **强大的性能**:在四大类多模态任务的标准英文测评中(Zero-shot Captioning/VQA/DocVQA/Grounding)上,均取得同等通用模型大小下最好效果;
- **多语言对话模型**:天然支持英文、中文等多语言对话,端到端支持图片里中英双语的长文本识别;
- **多图交错对话**:支持多图输入和比较,指定图片问答,多图文学创作等;
- **首个支持中文开放域定位的通用模型**:通过中文开放域语言表达进行检测框标注;
- **细粒度识别和理解**:相比于目前其它开源LVLM使用的224分辨率,Qwen-VL是首个开源的448分辨率的LVLM模型。更高分辨率可以提升细粒度的文字识别、文档问答和检测框标注。
<br>
<p align="center">
<img src="assets/demo_vl.gif" width="400"/>
<p>
<br>
目前,我们提供了 Qwen-VL 系列的两个模型:
- Qwen-VL: Qwen-VL 以 Qwen-7B 的预训练模型作为语言模型的初始化,并以 [Openclip ViT-bigG](https://github.com/mlfoundations/open_clip) 作为视觉编码器的初始化,中间加入单层随机初始化的 cross-attention,经过约1.5B的图文数据训练得到。最终图像输入分辨率为448。
- Qwen-VL-Chat: 在 Qwen-VL 的基础上,我们使用对齐机制打造了基于大语言模型的视觉AI助手Qwen-VL-Chat,它支持更灵活的交互方式,包括多图、多轮问答、创作等能力。
<br>
## 评测
我们从三个角度评测了模型的能力:
1.**英文标准 Benchmark** 上评测模型的基础任务能力。目前评测了四大类多模态任务:
- Zero-shot Captioning: 评测模型在未见过数据集上的零样本图片描述能力;
- General VQA: 评测模型的通用问答能力,例如判断题、颜色、个数、类目等问答能力;
- Text-based VQA:评测模型对于图片中文字相关的识别/问答能力,例如文档问答、图表问答、文字问答等;
- Referring Expression Compression:评测模型给定物体描述画检测框的能力;
2. **试金石 (TouchStone)**:为了评测模型整体的图文对话能力和人类对齐水平。我们为此构建了一个基于 GPT4 打分来评测 LVLM 模型的 Benchmark:TouchStone。在 TouchStone-v0.1 中:
- 评测基准总计涵盖 300+张图片、800+道题目、27个类别。包括基础属性问答、人物地标问答、影视作品问答、视觉推理、反事实推理、诗歌创作、故事写作,商品比较、图片解题等**尽可能广泛的类别**
- 为了弥补目前 GPT4 无法直接读取图片的缺陷,我们给所有的带评测图片提供了**人工标注的充分详细描述**,并且将图片的详细描述、问题和模型的输出结果一起交给 GPT4 打分。
- 评测同时包含英文版本和中文版本。
3. **其它多模态通用模型榜单**:我们也在其它多模态通用模型榜单中评测了模型的能力:
- MME Benchmark: 是一个多模态大型语言模型的综合评价基准。它在总共14个子任务上评测**感知和认知**能力,Qwen-VL-Chat在这两个总维度上都实现了当前最好结果。
- SEED-Bench: 是一个包含1.9万选择题的多模态基准测评,通过人工注释的结果评估多模态大模型,涵盖12个评估维度,包括**图像和视频理解**,Qwen-VL和Qwen-VL-chat在这个基准上实现了当前最好结果。
评测结果如下:
Qwen-VL在多个VL任务上相比目前SOTA的Generalist Models都有明显优势,并且在能力范围也覆盖更加全面。
<p align="center">
<img src="assets/radar.png" width="600"/>
<p>
### 零样本图像描述生成(Zero-shot Image Caption) 及 通用视觉问答(General VQA)
<table>
<thead>
<tr>
<th rowspan="2">Model type</th>
<th rowspan="2">Model</th>
<th colspan="2">Zero-shot Captioning</th>
<th colspan="5">General VQA</th>
</tr>
<tr>
<th>NoCaps</th>
<th>Flickr30K</th>
<th>VQAv2<sup>dev</sup></th>
<th>OK-VQA</th>
<th>GQA</th>
<th>SciQA-Img<br>(0-shot)</th>
<th>VizWiz<br>(0-shot)</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td rowspan="10">Generalist<br>Models</td>
<td>Flamingo-9B</td>
<td>-</td>
<td>61.5</td>
<td>51.8</td>
<td>44.7</td>
<td>-</td>
<td>-</td>
<td>28.8</td>
</tr>
<tr>
<td>Flamingo-80B</td>
<td>-</td>
<td>67.2</td>
<td>56.3</td>
<td>50.6</td>
<td>-</td>
<td>-</td>
<td>31.6</td>
</tr>
<tr>
<td>Unified-IO-XL</td>
<td>100.0</td>
<td>-</td>
<td>77.9</td>
<td>54.0</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Kosmos-1</td>
<td>-</td>
<td>67.1</td>
<td>51.0</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>29.2</td>
</tr>
<tr>
<td>Kosmos-2</td>
<td>-</td>
<td>66.7</td>
<td>45.6</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>BLIP-2 (Vicuna-13B)</td>
<td>103.9</td>
<td>71.6</td>
<td>65.0</td>
<td>45.9</td>
<td>32.3</td>
<td>61.0</td>
<td>19.6</td>
</tr>
<tr>
<td>InstructBLIP (Vicuna-13B)</td>
<td><strong>121.9</strong></td>
<td>82.8</td>
<td>-</td>
<td>-</td>
<td>49.5</td>
<td>63.1</td>
<td>33.4</td>
</tr>
<tr>
<td>Shikra (Vicuna-13B)</td>
<td>-</td>
<td>73.9</td>
<td>77.36</td>
<td>47.16</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td><strong>Qwen-VL (Qwen-7B)</strong></td>
<td>121.4</td>
<td><b>85.8</b></td>
<td><b>78.8</b></td>
<td><b>58.6</b></td>
<td><b>59.3</b></td>
<td>67.1</td>
<td>35.2</td>
</tr>
<!-- <tr>
<td>Qwen-VL (4-shot)</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>63.6</td>
<td>-</td>
<td>-</td>
<td>39.1</td>
</tr> -->
<tr>
<td>Qwen-VL-Chat</td>
<td>120.2</td>
<td>81.0</td>
<td>78.2</td>
<td>56.6</td>
<td>57.5</td>
<td><b>68.2</b></td>
<td><b>38.9</b></td>
</tr>
<!-- <tr>
<td>Qwen-VL-Chat (4-shot)</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>60.6</td>
<td>-</td>
<td>-</td>
<td>44.45</td>
</tr> -->
<tr>
<td>Previous SOTA<br>(Per Task Fine-tuning)</td>
<td>-</td>
<td>127.0<br>(PALI-17B)</td>
<td>84.5<br>(InstructBLIP<br>-FlanT5-XL)</td>
<td>86.1<br>(PALI-X<br>-55B)</td>
<td>66.1<br>(PALI-X<br>-55B)</td>
<td>72.1<br>(CFR)</td>
<td>92.53<br>(LLaVa+<br>GPT-4)</td>
<td>70.9<br>(PALI-X<br>-55B)</td>
</tr>
</tbody>
</table>
- 在 Zero-shot Captioning 中,Qwen-VL 在 Flickr30K 数据集上取得了 **SOTA** 的结果,并在 Nocaps 数据集上取得了和 InstructBlip 可竞争的结果。
- 在 General VQA 中,Qwen-VL 取得了 LVLM 模型同等量级和设定下 **SOTA** 的结果。
### 文本导向的视觉问答(Text-oriented VQA)
<table>
<thead>
<tr>
<th>Model type</th>
<th>Model</th>
<th>TextVQA</th>
<th>DocVQA</th>
<th>ChartQA</th>
<th>AI2D</th>
<th>OCR-VQA</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td rowspan="5">Generalist Models</td>
<td>BLIP-2 (Vicuna-13B)</td>
<td>42.4</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>InstructBLIP (Vicuna-13B)</td>
<td>50.7</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>mPLUG-DocOwl (LLaMA-7B)</td>
<td>52.6</td>
<td>62.2</td>
<td>57.4</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Pix2Struct-Large (1.3B)</td>
<td>-</td>
<td><b>76.6</b></td>
<td>58.6</td>
<td>42.1</td>
<td>71.3</td>
</tr>
<tr>
<td>Qwen-VL (Qwen-7B)</td>
<td><b>63.8</b></td>
<td>65.1</td>
<td><b>65.7</b></td>
<td><b>62.3</b></td>
<td><b>75.7</b></td>
</tr>
<tr>
<td>Specialist SOTAs<br>(Specialist/Finetuned)</td>
<td>PALI-X-55B (Single-task FT)<br>(Without OCR Pipeline)</td>
<td>71.44</td>
<td>80.0</td>
<td>70.0</td>
<td>81.2</td>
<td>75.0</td>
</tr>
</tbody>
</table>
- 在文字相关的识别/问答评测上,取得了当前规模下通用 LVLM 达到的最好结果。
- 分辨率对上述某几个评测非常重要,大部分 224 分辨率的开源 LVLM 模型无法完成以上评测,或只能通过切图的方式解决。Qwen-VL 将分辨率提升到 448,可以直接以端到端的方式进行以上评测。Qwen-VL 在很多任务上甚至超过了 1024 分辨率的 Pix2Struct-Large 模型。
### 细粒度视觉定位(Referring Expression Comprehension)
<table>
<thead>
<tr>
<th rowspan="2">Model type</th>
<th rowspan="2">Model</th>
<th colspan="3">RefCOCO</th>
<th colspan="3">RefCOCO+</th>
<th colspan="2">RefCOCOg</th>
<th>GRIT</th>
</tr>
<tr>
<th>val</th>
<th>test-A</th>
<th>test-B</th>
<th>val</th>
<th>test-A</th>
<th>test-B</th>
<th>val-u</th>
<th>test-u</th>
<th>refexp</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td rowspan="8">Generalist Models</td>
<td>GPV-2</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>51.50</td>
</tr>
<tr>
<td>OFA-L*</td>
<td>79.96</td>
<td>83.67</td>
<td>76.39</td>
<td>68.29</td>
<td>76.00</td>
<td>61.75</td>
<td>67.57</td>
<td>67.58</td>
<td>61.70</td>
</tr>
<tr>
<td>Unified-IO</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td><b>78.61</b></td>
</tr>
<tr>
<td>VisionLLM-H</td>
<td></td>
<td>86.70</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Shikra-7B</td>
<td>87.01</td>
<td>90.61</td>
<td>80.24 </td>
<td>81.60</td>
<td>87.36</td>
<td>72.12</td>
<td>82.27</td>
<td>82.19</td>
<td>69.34</td>
</tr>
<tr>
<td>Shikra-13B</td>
<td>87.83 </td>
<td>91.11</td>
<td>81.81</td>
<td>82.89</td>
<td>87.79</td>
<td>74.41</td>
<td>82.64</td>
<td>83.16</td>
<td>69.03</td>
</tr>
<tr>
<td>Qwen-VL-7B</td>
<td><b>89.36</b></td>
<td>92.26</td>
<td><b>85.34</b></td>
<td><b>83.12</b></td>
<td>88.25</td>
<td><b>77.21</b></td>
<td>85.58</td>
<td>85.48</td>
<td>78.22</td>
</tr>
<tr>
<td>Qwen-VL-7B-Chat</td>
<td>88.55</td>
<td><b>92.27</b></td>
<td>84.51</td>
<td>82.82</td>
<td><b>88.59</b></td>
<td>76.79</td>
<td><b>85.96</b></td>
<td><b>86.32</b></td>
<td>-</td>
</tr>
<tr>
<td rowspan="3">Specialist SOTAs<br>(Specialist/Finetuned)</td>
<td>G-DINO-L</td>
<td>90.56&nbsp;&nbsp;</td>
<td>93.19</td>
<td>88.24</td>
<td>82.75</td>
<td>88.95</td>
<td>75.92</td>
<td>86.13</td>
<td>87.02</td>
<td>-</td>
</tr>
<tr>
<td>UNINEXT-H</td>
<td>92.64 </td>
<td>94.33</td>
<td>91.46</td>
<td>85.24</td>
<td>89.63</td>
<td>79.79</td>
<td>88.73</td>
<td>89.37</td>
<td>-</td>
</tr>
<tr>
<td>ONE-PEACE</td>
<td>92.58 </td>
<td>94.18</td>
<td>89.26</td>
<td>88.77</td>
<td>92.21</td>
<td>83.23</td>
<td>89.22</td>
<td>89.27</td>
<td>-</td>
</tr>
</tbody>
</table>
- 在定位任务上,Qwen-VL 全面超过 Shikra-13B,取得了目前 Generalist LVLM 模型上在 Refcoco 上的 **SOTA**
- Qwen-VL 并没有在任何中文定位数据上训练过,但通过中文 Caption 数据和 英文 Grounding 数据的训练,可以 Zero-shot 泛化出中文 Grounding 能力。
我们提供了以上**所有**评测脚本以供复现我们的实验结果。请阅读 [eval_mm/EVALUATION.md](eval_mm/EVALUATION.md) 了解更多信息。
### 对话能力测评
TouchStone 是一个基于 GPT4 打分来评测 LVLM 模型的图文对话能力和人类对齐水平的基准。它涵盖了 300+张图片、800+道题目、27个类别,包括基础属性、人物地标、视觉推理、诗歌创作、故事写作、商品比较、图片解题等**尽可能广泛的类别**。关于 TouchStone 的详细介绍,请参考[touchstone/README_CN.md](touchstone/README_CN.md)了解更多信息。
#### 英语
| Model | Score |
| ---------------- | ----- |
| PandaGPT | 488.5 |
| MiniGPT4 | 531.7 |
| InstructBLIP | 552.4 |
| LLaMA-AdapterV2 | 590.1 |
| LLaVA | 602.7 |
| mPLUG-Owl | 605.4 |
| Qwen-VL-Chat | 645.2 |
| Qwen-VL-Chat-1.1 | 711.6 |
#### 中文
| Model | Score |
| ---------------- | ----- |
| VisualGLM | 247.1 |
| Qwen-VL-Chat | 401.2 |
| Qwen-VL-Chat-1.1 | 481.7 |
Qwen-VL-Chat 模型在中英文的对齐评测中均取得当前 LVLM 模型下的最好结果。
<br>
### 其它榜单测评
#### MME Benchmark
MME是多模态大型语言模型的综合评价基准。它在总共14个子任务上评测**感知和认知**能力。Qwen-VL-Chat在这个基准上实现了SOTAs。完整复现[见此](eval_mm/mme/EVAL_MME.md).
<p align="center">
<img src="eval_mm/mme/perception.jpg" width="600"/>
<p>
<p align="center">
<img src="eval_mm/mme/cognition.jpg" width="600"/>
<p>
#### SEED-Bench
SEED-Bench是一个包含1.9万选择题的多模态基准测评,通过人工注释的结果评估多模态大模型,涵盖12个评估维度,包括**图像和视频理解**。Qwen-VL和Qwen-VL-chat在这个基准上实现了SOTAs。完整复现[见此](eval_mm/seed_bench/EVAL_SEED.md)
<p align="center">
<img src="eval_mm/seed_bench/leaderboard.jpg"/>
<p>
## 部署要求
* python 3.8及以上版本
* pytorch 1.12及以上版本,推荐2.0及以上版本
* 建议使用CUDA 11.4及以上(GPU用户需考虑此选项)
<br>
## 快速使用
我们提供简单的示例来说明如何利用 🤖 ModelScope 和 🤗 Transformers 快速使用 Qwen-VL 和 Qwen-VL-Chat。
在开始前,请确保你已经配置好环境并安装好相关的代码包。最重要的是,确保你满足上述要求,然后安装相关的依赖库。
```bash
pip install -r requirements.txt
```
接下来你可以开始使用Transformers或者ModelScope来使用我们的模型。关于视觉模块的更多用法,请参考[教程](TUTORIAL_zh.md)
#### 🤗 Transformers
如希望使用 Qwen-VL-chat 进行推理,所需要写的只是如下所示的数行代码。**请确保你使用的是最新代码。**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)
# 请注意:分词器默认行为已更改为默认关闭特殊token攻击防护。
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
# 打开bf16精度,A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# 打开fp16精度,V100、P100、T4等显卡建议启用以节省显存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# 使用CPU进行推理,需要约32GB内存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval()
# 默认gpu进行推理,需要约24GB显存
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval()
# 可指定不同的生成长度、top_p等相关超参(transformers 4.32.0及以上无需执行此操作)
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
# 第一轮对话
query = tokenizer.from_list_format([
{'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url
{'text': '这是什么?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# 图中是一名女子在沙滩上和狗玩耍,旁边是一只拉布拉多犬,它们处于沙滩上。
# 第二轮对话
response, history = model.chat(tokenizer, '框出图中击掌的位置', history=history)
print(response)
# <ref>击掌</ref><box>(536,509),(588,602)</box>
image = tokenizer.draw_bbox_on_latest_picture(response, history)
if image:
image.save('1.jpg')
else:
print("no box")
```
<p align="center">
<img src="assets/demo_highfive.jpg" width="500"/>
<p>
运行Qwen-VL同样非常简单。
<summary>运行Qwen-VL</summary>
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True)
# 打开bf16精度,A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, bf16=True).eval()
# 打开fp16精度,V100、P100、T4等显卡建议启用以节省显存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, fp16=True).eval()
# 使用CPU进行推理,需要约32GB内存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cpu", trust_remote_code=True).eval()
# 默认gpu进行推理,需要约24GB显存
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cuda", trust_remote_code=True).eval()
# 可指定不同的生成长度、top_p等相关超参(transformers 4.32.0及以上无需执行此操作)
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True)
query = tokenizer.from_list_format([
{'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url
{'text': 'Generate the caption in English with grounding:'},
])
inputs = tokenizer(query, return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False)
print(response)
# <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>Generate the caption in English with grounding:<ref> Woman</ref><box>(451,379),(731,806)</box> and<ref> her dog</ref><box>(219,424),(576,896)</box> playing on the beach<|endoftext|>
image = tokenizer.draw_bbox_on_latest_picture(response)
if image:
image.save('2.jpg')
else:
print("no box")
```
<p align="center">
<img src="assets/demo_spotting_caption.jpg" width="500"/>
<p>
若在使用上述代码时由于各种原因无法从 HuggingFace 拉取模型和代码,可以先从 ModelScope 下载模型及代码至本地,再从本地加载模型:
```python
from modelscope import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
# Downloading model checkpoint to a local dir model_dir
# model_dir = snapshot_download('qwen/Qwen-VL')
model_dir = snapshot_download('qwen/Qwen-VL-Chat')
# Loading local checkpoints
# trust_remote_code is still set as True since we still load codes from local dir instead of transformers
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_dir,
device_map="cuda",
trust_remote_code=True
).eval()
```
#### 🤖 ModelScope
魔搭(ModelScope)是开源的模型即服务共享平台,为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品。使用ModelScope同样非常简单,代码如下所示:
```python
from modelscope import (
snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig
)
import torch
model_id = 'qwen/Qwen-VL-Chat'
revision = 'v1.0.0'
model_dir = snapshot_download(model_id, revision=revision)
torch.manual_seed(1234)
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
if not hasattr(tokenizer, 'model_dir'):
tokenizer.model_dir = model_dir
# 打开bf16精度,A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval()
# 打开fp16精度,V100、P100、T4等显卡建议启用以节省显存
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval()
# 使用CPU进行推理,需要约32GB内存
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval()
# 默认gpu进行推理,需要约24GB显存
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval()
# 指定生成超参数(transformers 4.32.0及以上无需执行此操作)
# model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True)
# 第一轮对话
# Either a local path or an url between <img></img> tags.
image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'
response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None)
print(response)
# 图中是一名年轻女子在沙滩上和她的狗玩耍,狗的品种是拉布拉多。她们坐在沙滩上,狗的前腿抬起来,与人互动。
# 第二轮对话
response, history = model.chat(tokenizer, '输出击掌的检测框', history=history)
print(response)
# <ref>"击掌"</ref><box>(211,412),(577,891)</box>
image = tokenizer.draw_bbox_on_latest_picture(response, history)
if image:
image.save('output_chat.jpg')
else:
print("no box")
```
<br>
## 量化
### 用法
当前我们提供了基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的量化方案,并提供了Qwen-VL-Chat的Int4量化版本Qwen-VL-Chat-Int4 [点击此处](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4)。该模型在效果评测上几乎无损,并在显存占用和推理速度上具有明显优势。
下文说明如何使用该量化模型。开始之前,请确保你满足要求(如torch2.0及以上、transformers 4.32.0及以上,等)并安装所需的代码库:
```bash
pip install optimum
git clone https://github.com/JustinLin610/AutoGPTQ.git & cd AutoGPTQ
pip install -v .
```
如遇到安装 `auto-gptq` 的问题,建议您前往官方[repo](https://github.com/PanQiWei/AutoGPTQ) 寻找合适的wheel。
随后你便可以按照上述用法****,轻松调用量化模型:
```python
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-VL-Chat-Int4",
device_map="auto",
trust_remote_code=True
).eval()
# Either a local path or an url between <img></img> tags.
image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'
response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None)
print(response)
```
### 效果评测
我们列出不同精度下模型在评测基准 **[TouchStone](https://github.com/OFA-Sys/TouchStone)** 上的表现,并发现量化模型并没有显著性能损失。结果如下所示:
| Quantization | ZH | EN |
| ------------ | :--------: | :-----------: |
| BF16 | 401.2 | 645.2 |
| Int4 | 386.6 | 651.4 |
### 推理速度
我们测算了在输入一张图片(即258个token)的条件下BF16和Int4的模型生成1792 (2048-258) 和 7934 (8192-258) 个token的平均速度。
| Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
| ------------ | :-----------------: | :-----------------: |
| BF16 | 28.87 | 24.32 |
| Int4 | 37.79 | 34.34 |
推理速度测算是在单卡 A100-SXM4-80G GPU上运行,使用PyTorch 2.0.1及CUDA 11.4。
### GPU显存占用
我们还测算了在一张图片输入的条件下BF16和Int4模型生成1792 (2048-258) 和 7934 (8192-258) 个token所需显存。结果如下所示:
| Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
| ------------ | :---------------------------------: | :-----------------------------------: |
| BF16 | 22.60GB | 28.01GB |
| Int4 | 11.82GB | 17.23GB |
上述速度和显存测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile_mm.py)完成。
<br>
## 微调
我们提供了`finetune.py`这个脚本供用户实现在自己的数据上进行微调的功能,以接入下游任务。此外,我们还提供了shell脚本减少用户的工作量。这个脚本支持 [DeepSpeed](https://github.com/microsoft/DeepSpeed)[FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/) 。我们提供的shell脚本使用了DeepSpeed,因此建议您确保已经安装DeepSpeed。
首先,你需要准备你的训练数据。你需要将所有样本放到一个列表中并存入json文件中。每个样本对应一个字典,包含id和conversation,其中后者为一个列表。示例如下所示:
```json
[
{
"id": "identity_0",
"conversations": [
{
"from": "user",
"value": "你好"
},
{
"from": "assistant",
"value": "我是Qwen-VL,一个支持视觉输入的大模型。"
}
]
},
{
"id": "identity_1",
"conversations": [
{
"from": "user",
"value": "Picture 1: <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>\n图中的狗是什么品种?"
},
{
"from": "assistant",
"value": "图中是一只拉布拉多犬。"
},
{
"from": "user",
"value": "框出图中的格子衬衫"
},
{
"from": "assistant",
"value": "<ref>格子衬衫</ref><box>(588,499),(725,789)</box>"
}
]
},
{
"id": "identity_2",
"conversations": [
{
"from": "user",
"value": "Picture 1: <img>assets/mm_tutorial/Chongqing.jpeg</img>\nPicture 2: <img>assets/mm_tutorial/Beijing.jpeg</img>\n图中都是哪"
},
{
"from": "assistant",
"value": "第一张图片是重庆的城市天际线,第二张图片是北京的天际线。"
}
]
}
]
```
为针对多样的VL任务,我们增加了一下的特殊tokens: `<img> </img> <ref> </ref> <box> </box>`.
对于带图像输入的内容可表示为 `Picture id: <img>img_path</img>\n{your prompt}`,其中`id`表示对话中的第几张图片。"img_path"可以是本地的图片或网络地址。
对话中的检测框可以表示为`<box>(x1,y1),(x2,y2)</box>`,其中 `(x1, y1)``(x2, y2)`分别对应左上角和右下角的坐标,并且被归一化到`[0, 1000)`的范围内. 检测框对应的文本描述也可以通过`<ref>text_caption</ref>`表示。
准备好数据后,你可以使用我们提供的shell脚本实现微调。注意,你需要在脚本中指定你的数据的路径。
微调脚本能够帮你实现:
- 全参数微调
- LoRA
- Q-LoRA
### 全参数微调
默认下全参数微调在训练过程中更新LLM所有参数。我们的实验中,在微调阶段不更新ViT的参数会取得更好的表现。你可以运行这个脚本开始训练:
```bash
# 分布式训练。由于显存限制将导致单卡训练失败,我们不提供单卡训练脚本。
sh finetune/finetune_ds.sh
```
尤其注意,你需要在脚本中指定正确的模型名称或路径、数据路径、以及模型输出的文件夹路径。如果你想修改deepspeed配置,可以删除掉`--deepspeed`这个输入或者自行根据需求修改DeepSpeed配置json文件。此外,我们支持混合精度训练,因此你可以设置`--bf16 True`或者`--fp16 True`。经验上,如果你的机器支持bf16,我们建议使用bf16,这样可以和我们的预训练和对齐训练保持一致,这也是为什么我们把默认配置设为它的原因。
### LoRA
运行LoRA的方法类似全参数微调。但在开始前,请确保已经安装`peft`代码库。另外,记住要设置正确的模型、数据和输出路径。我们建议你为模型路径使用绝对路径。这是因为LoRA仅存储adapter部分参数,而adapter配置json文件记录了预训练模型的路径,用于读取预训练模型权重。同样,你可以设置bf16或者fp16。
```bash
# 单卡训练
sh finetune/finetune_lora_single_gpu.sh
# 分布式训练
sh finetune/finetune_lora_ds.sh
```
与全参数微调不同,LoRA ([论文](https://arxiv.org/abs/2106.09685)) 只更新adapter层的参数而无需更新原有语言模型的参数。这种方法允许用户用更低的显存开销来训练模型,也意味着更小的计算开销。
注意,如果你使用预训练模型进行LoRA微调,而非chat模型,模型的embedding和输出层的参数将被设为可训练的参数。这是因为预训练模型没有学习过ChatML格式中的特殊token,因此需要将这部分参数设为可训练才能让模型学会理解和预测这些token。这也意味着,假如你的训练引入新的特殊token,你需要通过代码中的`modules_to_save`将这些参数设为可训练的参数。如果你想节省显存占用,可以考虑使用chat模型进行LoRA微调,显存占用将大幅度降低。下文的显存占用和训练速度的记录将详细介绍这部分细节。
### Q-LoRA
如果你依然遇到显存不足的问题,可以考虑使用Q-LoRA ([论文](https://arxiv.org/abs/2305.14314))。该方法使用4比特量化模型以及paged attention等技术实现更小的显存开销。运行Q-LoRA你只需运行如下脚本:
```bash
# 单卡训练
sh finetune/finetune_qlora_single_gpu.sh
# 分布式训练
sh finetune/finetune_qlora_ds.sh
```
我们建议你使用我们提供的Int4量化模型进行训练,即Qwen-VL-Chat-Int4。请**不要使用**非量化模型!与全参数微调以及LoRA不同,Q-LoRA仅支持fp16。此外,上述LoRA关于特殊token的问题在Q-LoRA依然存在。并且,Int4模型的参数无法被设为可训练的参数。所幸的是,我们只提供了Chat模型的Int4模型,因此你不用担心这个问题。但是,如果你执意要在Q-LoRA中引入新的特殊token,很抱歉,我们无法保证你能成功训练。
与全参数微调不同,LoRA和Q-LoRA的训练只需存储adapter部分的参数。假如你需要使用LoRA训练后的模型,你需要使用如下方法。你可以用如下代码读取模型:
```python
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained(
path_to_adapter, # path to the output directory
device_map="auto",
trust_remote_code=True
).eval()
```
如果你觉得这样一步到位的方式让你很不安心或者影响你接入下游应用,你可以选择先合并并存储模型(LoRA支持合并,Q-LoRA不支持),再用常规方式读取你的新模型,示例如下:
```python
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained(
path_to_adapter, # path to the output directory
device_map="auto",
trust_remote_code=True
).eval()
merged_model = model.merge_and_unload()
# max_shard_size and safe serialization are not necessary.
# They respectively work for sharding checkpoint and save the model to safetensors
merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True)
```
注意:分布式训练需要根据你的需求和机器指定正确的分布式训练超参数。此外,你需要根据你的数据、显存情况和训练速度预期,使用`--model_max_length`设定你的数据长度。
### 显存占用及训练速度
下面记录Qwen_VL模型在单GPU使用LoRA(LoRA (Base)指的是embedding和输出层参与训练,而LoRA (Chat)则不优化这部分参数)和QLoRA时处理不同长度输入的显存占用和训练速度的情况。本次评测运行于单张A100-SXM4-80G GPU,使用CUDA 11.8和Pytorch 2.0。我们统一使用batch size为1,gradient accumulation为8的训练配置,每个样本包含一张图,分别记录输入长度分别为384、512、1024和2048的显存占用(GB)和训练速度(s/iter)。具体数值如下所示:
<table>
<tr>
<th rowspan="2">Method</th><th colspan="4" align="center">Sequence Length</th>
</tr>
<tr>
<th align="center">384</th><th align="center">512</th><th align="center">1024</th><th align="center">2048</th>
</tr>
<tr>
<td>LoRA (Base)</td><td align="center">37.1G / 2.3s/it</td><td align="center">37.3G / 2.4s/it</td><td align="center">38.7G / 3.6s/it</td><td align="center">38.7G / 6.1s/it</td>
</tr>
<tr>
<td>LoRA (Chat)</td><td align="center">23.3G / 2.2s/it</td><td align="center">23.6G / 2.3s/it</td><td align="center">25.1G / 3.5s/it</td><td align="center">27.3G / 5.9s/it</td>
</tr>
<tr>
<td>Q-LoRA</td><td align="center">17.0G / 4.2s/it</td><td align="center">17.2G / 4.5s/it</td><td align="center">18.2G / 5.5s/it</td><td align="center">19.3G / 7.9s/it</td>
</tr>
</table>
<br><br>
## Demo
### Web UI
我们提供了Web UI的demo供用户使用。在开始前,确保已经安装如下代码库:
```
pip install -r requirements_web_demo.txt
```
随后运行如下命令,并点击生成链接:
```
python web_demo_mm.py
```
<br>
## FAQ
如遇到问题,敬请查阅 [FAQ](FAQ_zh.md)以及issue区,如仍无法解决再提交issue。
<br>
## 使用协议
研究人员与开发者可使用Qwen-VL和Qwen-VL-Chat或进行二次开发。我们同样允许商业使用,具体细节请查看[LICENSE](LICENSE)。如需商用,请填写[问卷](https://dashscope.console.aliyun.com/openModelApply/qianwen)申请。
<br>
## 引用
如果你觉得我们的论文和代码对你的研究有帮助,请考虑:star: 和引用 :pencil: :)
```BibTeX
@article{Qwen-VL,
title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2308.12966},
year={2023}
}
```
<br>
## 联系我们
如果你想给我们的研发团队和产品团队留言,请通过邮件(qianwen_opensource@alibabacloud.com)联系我们。
# Qwen-VL-Chat使用教程
Qwen-VL-Chat是通用多模态大规模语言模型,因此它可以完成多种视觉语言任务。在本教程之中,我们会给出一些简明的例子,用以展示Qwen-VL-Chat在**视觉问答,文字理解,图表数学推理,多图理解和Grounding**(根据指令标注图片中指定区域的包围框)等多方面的能力。请注意,展示的例子远非Qwen-VL-Chat能力的极限,**您可以通过更换不同的输入图像和提示词(Prompt),来进一步挖掘Qwen-VL-Chat的能力!**
## 初始化Qwen-VL-Chat模型
在使用Qwen-VL-Chat之前,您首先需要初始化Qwen-VL-Chat的分词器(Tokenizer)和Qwen-VL-Chat的模型:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# 如果您希望结果可复现,可以设置随机数种子。
# torch.manual_seed(1234)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
```
在执行完上述代码后,```tokenizer```将对应Qwen-VL-Chat使用的分词器,而```model```将对应Qwen-VL-Chat的模型。```tokenizer```用于对图文混排输入进行分词和预处理,而```model```则是Qwen-VL-Chat模型本身。
## 使用Qwen-VL-Chat
### **多轮视觉问答**
#### **第一个问题**
首先我们来看一个最简单的例子,如下图所示,文件```assets/mm_tutorial/Rebecca_(1939_poster).jpeg```是1940年电影Rebecca的于1939发布的海报。
![](assets/mm_tutorial/Rebecca_(1939_poster)_Small.jpeg)
我们来问一问Qwen-VL-Chat海报上电影的名称是什么。首先,我们使用tokenizer.from_list_format可以对图文混排输入进行分词与处理:
```python
query = tokenizer.from_list_format([
{'image': 'assets/mm_tutorial/Rebecca_(1939_poster).jpeg'},
{'text': 'What is the name of the movie in the poster?'},
])
```
接下来,我们可以使用```model.chat```向Qwen-VL-Chat模型提问并获得回复。注意在第一次提问时,对话历史为空,因此我们使用```history=None```
```python
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
```
您应该会得到类似下列的输出结果:
> The name of the movie in the poster is "Rebecca."
这说明模型正确的回答了问题!根据海报,该电影的名称的确是**Rebecca**
#### **多轮问答**
我们还可以继续向模型发问,例如询问电影的导演是谁。在后续提问时,对话历史并不为空,我们使用```history=history``````model.chat```传递之前的对话历史:
```python
query = tokenizer.from_list_format([
{'text': 'Who directed this movie?'},
])
response, history = model.chat(tokenizer, query=query, history=history)
print(response)
```
您应该会得到类似下列的输出结果:
> The movie "Rebecca" was directed by Alfred Hitchcock.
模型再次正确回答了问题!根据海报,该电影的导演是Alfred Hitchcock。
### **文字理解**
Qwen-VL-Chat具有一定的针对包含密集文字图片的理解能力。如下图所示,文件```assets/mm_tutorial/Hospital.jpeg```是一个包含密集文字的医院指示牌。
![](assets/mm_tutorial/Hospital_Small.jpg)
我们可以像之前一样向模型询问医院中各个科室的位置,对话历史为空,因此我们使用```history=None```
```python
query = tokenizer.from_list_format([
{'image': 'assets/mm_tutorial/Hospital.jpg'},
{'text': 'Based on the photo, which floor is the Department of Otorhinolaryngology on?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
```
您应该会得到类似下列的输出结果:
> The Department of Otorhinolaryngology is located on the 4th floor.
您同样可以进一步提出后续问题,此时需要使用```history=history``````model.chat```传递之前的对话历史。
```python
query = tokenizer.from_list_format([
{'text': 'Based on the photo, which floor is the Department of Surgery on?'},
])
response, history = model.chat(tokenizer, query=query, history=history)
print(response)
```
您应该会得到类似下列的输出结果:
> The Department of Surgery is located on the 3rd floor.
### **图表数学推理**
利用模型的图表理解和数学推理能力,Qwen-VL-Chat还可以完成更复杂的一些任务!如下图所示,文件```assets/mm_tutorial/Menu.jpeg```展示了一家餐厅的菜单。现在我们想知道,如果购买两个Salmon Burger和三个Meat Lover's Pizza需要花多少钱呢?
![](assets/mm_tutorial/Menu.jpeg)
```python
query = tokenizer.from_list_format([
{'image': 'assets/mm_tutorial/Menu.jpeg'},
{'text': 'How much would I pay if I want to order two Salmon Burger and three Meat Lover\'s Pizza? Think carefully step by step.'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
```
```Think carefully step by step.```是一个引导模型分步处理复杂任务的常见提示词,如果您需要完成的任务较为复杂,可以试着使用它来提高准确率。您应该会得到类似下列的输出结果:
> To order two Salmon Burgers and three Meat Lover's Pizzas, you would need to pay the following:
>
> 1. For two Salmon Burgers: x2 Salmon Burgers at $10 each = $20
> 2. For three Meat Lover's Pizzas: x3 Meat Lover's Pizzas at $12 each = $36
>
> Therefore, the total cost would be $56.
### **多图理解与中文输入**
在之前的例子中,我们主要展示了Qwen-VL-Chat针对单张图像和英文问题的问答能力。但实际上,Qwen-VL-Chat是支持中文输入的多语言模型,而且也支持多张图片的输入!下面的例子中,我们用中文让Qwen-VL-Chat来为我们比较重庆和北京这两个城市的照片(```assets/mm_tutorial/Chongqing.jpeg```和```assets/mm_tutorial/Beijing.jpeg```):
![](assets/mm_tutorial/Chongqing_Small.jpeg)
![](assets/mm_tutorial/Beijing_Small.jpeg)
```python
query = tokenizer.from_list_format([
{'image': 'assets/mm_tutorial/Chongqing.jpeg'},
{'image': 'assets/mm_tutorial/Beijing.jpeg'},
{'text': '上面两张图片分别是哪两个城市?请对它们进行对比。'},
])
torch.manual_seed(5678)
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
```
您应该会得到类似下列的输出结果:
> 第一张图片是重庆的城市天际线,它反映了现代都市的繁华与喧嚣。第二张图片是北京的天际线,它象征着中国首都的现代化和国际化。两座城市都是中国的重要城市,拥有独特的文化和发展历史。
**请注意,城市间的比较是一个具有相当主观性的问题,因此模型产生的回复可能具有相当高的随机性。若不使用```torch.manual_seed(5678)```设置随机数种子,每次的输出结果会不一样。即使您设置了随机数种子,由于软硬件环境的差异,得到的结果也可能与本文档中的有所不同。**
### **Grounding能力**
在最后,我们展示Qwen-VL-Chat模型产生包围框的能力。Qwen-VL-Chat可以根据您的语言描述,在图像中用矩形框框出指定区域。这样说可能有些抽象,让我们来看下面的例子。如下图所示,文件```assets/mm_tutorial/Shanghai.jpg```是上海的一张照片,我们先用常规的提示词,问一下模型图里有什么。
![](assets/mm_tutorial/Shanghai_Small.jpeg)
```python
torch.manual_seed(1234)
query = tokenizer.from_list_format([
{'image': 'assets/mm_tutorial/Shanghai.jpg'},
{'text': '图里有啥'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
```
您应该会得到类似下列的输出结果:
> 图中是中国上海的天际线,包括了上海塔、金茂大厦、上海环球金融中心、海洋大厦等著名建筑。
接下来,我们通过使用```请给我框出图中上海环球金融中心和东方明珠```这个提示词来和模型对话,看看会发生什么。注意此时需要使用```history=history```向```model.chat```传递之前的对话历史。
```python
query = tokenizer.from_list_format([
{'text': '请给我框出图中上海环球金融中心和东方明珠'},
])
response, history = model.chat(tokenizer, query=query, history=history)
print(response)
```
您应该会得到类似下列的输出结果:
```xml
<ref>上海环球金融中心</ref><box>(667,437),(760,874)</box><ref>东方明珠</ref><box>(506,75),(582,946)</box>
```
Qwen-VL-Chat模型没有手,但也没有拒绝您的请求,而是输出了一些“奇怪”的东西——并不是,实际上,模型的输出以标记语言的形式给出了上海环球金融中心和东方明珠在图中的具体位置。您可以使用下列代码将其可视化:
```python
image = tokenizer.draw_bbox_on_latest_picture(response, history)
image.save('Shanghai_Output.jpg')
```
保存下来的```Shanghai_Output.jpg```结果将类似于下面的截图:
![](assets/mm_tutorial/Shanghai_Output_Small.jpeg)
在此之后,您还可以继续照常和Qwen-VL-Chat对话:
```python
query = tokenizer.from_list_format([
{'text': '帮我写个这座城市的旅游计划'},
])
response, history = model.chat(tokenizer, query=query, history=history)
print(response)
```
您应该会得到类似下列的输出结果:
> 好的,以下是一个简单的上海旅游计划:
>
> 第一天:
> 上午:抵达上海,前往酒店办理入住手续。
> 中午:享用午餐后,前往外滩,欣赏黄浦江畔的美景,游览上海地标性建筑如浦发银行大楼、汇丰银行大楼等。
> 下午:游览南京路步行街,购买特色礼品或品尝当地美食。
> 晚上:在南京路附近的餐厅享用晚餐,然后去看上海的夜景。
>
> 第二天:
> 上午:前往上海科技馆,了解科技发展历史,观看各种科技展览。
> 中午:在科技馆附近的餐厅享用午餐。
> 下午:游览世纪公园,欣赏美景并放松身心。
> 晚上:在南京路或附近的陆家嘴地区享用晚餐,然后去看上海的夜景。
>
> 第三天:
> 上午:游览上海迪士尼乐园或上海海昌海洋公园,与各种迪士尼角色互动,或者在海洋公园观看海洋生物表演。
> 中午:在迪士尼乐园或海洋公园附近的餐厅享用午餐。
> 下午:自由活动,可以去购物、品尝当地美食或者去博物馆等。
> 晚上:在酒店附近享用晚餐,然后离开上海。
>
> 当然,以上只是一个简单的计划,上海有许多其他景点和活动,例如参观上海博物馆、游览田子坊、观看上海话剧等。具体计划可以根据个人兴趣和时间进行调整。
**请注意,旅游计划是一个具有相当主观性的问题,因此模型产生的回复可能具有相当高的随机性。若不使用```torch.manual_seed(1234)```设置随机数种子,每次的输出结果会不一样。即使您设置了随机数种子,由于软硬件环境的差异,得到的结果也可能与本文档中的有所不同。**
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment