Commit a3d43354 authored by wanglch's avatar wanglch
Browse files

Initial commit

parents
Pipeline #1498 failed with stages
in 0 seconds
name: "\U0001F41B Bug Report"
description: Submit a bug report to help us improve CogVLM1.5 / 提交一个 Bug 问题报告来帮助我们改进 CogVLM1.5 模型
body:
- type: textarea
id: system-info
attributes:
label: System Info / 系統信息
description: Your operating environment / 您的运行环境信息
placeholder: Includes Cuda version, Transformers version, Python version, operating system, hardware information (if you suspect a hardware problem)... / 包括Cuda版本,Transformers版本,Python版本,操作系统,硬件信息(如果您怀疑是硬件方面的问题)...
validations:
required: true
- type: textarea
id: who-can-help
attributes:
label: Who can help? / 谁可以帮助到您?
description: |
Your issue will be replied to more quickly if you can figure out the right person to tag with @
All issues are read by one of the maintainers, so if you don't know who to tag, just leave this blank and our maintainer will ping the right person.
Please tag fewer than 3 people.
如果您能找到合适的标签 @,您的问题会更快得到回复。
所有问题都会由我们的维护者阅读,如果您不知道该标记谁,只需留空,我们的维护人员会找到合适的开发组成员来解决问题。
标记的人数应该不超过 1 个人。
Related demo leader / 相关demo负责人 :
- composite_demo: @zR
- openai_demo: @zR
If it's not a bug in these three subsections, you may not specify the helper. Our maintainer will find the right person in the development group to solve the problem.
如果不是这三个子版块的bug,您可以不指明帮助者,我们的维护人员会找到合适的开发组成员来解决问题。
placeholder: "@Username ..."
- type: checkboxes
id: information-scripts-examples
attributes:
label: Information / 问题信息
description: 'The problem arises when using: / 问题出现在'
options:
- label: "The official example scripts / 官方的示例脚本"
- label: "My own modified scripts / 我自己修改的脚本和任务"
- type: textarea
id: reproduction
validations:
required: true
attributes:
label: Reproduction / 复现过程
description: |
Please provide a code example that reproduces the problem you encountered, preferably with a minimal reproduction unit.
If you have code snippets, error messages, stack traces, please provide them here as well.
Please format your code correctly using code tags. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
Do not use screenshots, as they are difficult to read and (more importantly) do not allow others to copy and paste your code.
请提供能重现您遇到的问题的代码示例,最好是最小复现单元。
如果您有代码片段、错误信息、堆栈跟踪,也请在此提供。
请使用代码标签正确格式化您的代码。请参见 https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
请勿使用截图,因为截图难以阅读,而且(更重要的是)不允许他人复制粘贴您的代码。
placeholder: |
Steps to reproduce the behavior/复现Bug的步骤:
1.
2.
3.
- type: textarea
id: expected-behavior
validations:
required: true
attributes:
label: Expected behavior / 期待表现
description: "A clear and concise description of what you would expect to happen. /简单描述您期望发生的事情。"
\ No newline at end of file
name: "\U0001F680 Feature request"
description: Submit a request for a new CogVLM1.5 feature / 提交一个新的 CogVLM1.5 模型的功能建议
labels: [ "feature" ]
body:
- type: textarea
id: feature-request
validations:
required: true
attributes:
label: Feature request / 功能建议
description: |
A brief description of the functional proposal. Links to corresponding papers and code are desirable.
对功能建议的简述。最好提供对应的论文和代码链接
- type: textarea
id: motivation
validations:
required: true
attributes:
label: Motivation / 动机
description: |
Your motivation for making the suggestion. If that motivation is related to another GitHub issue, link to it here.
您提出建议的动机。如果该动机与另一个 GitHub 问题有关,请在此处提供对应的链接。
- type: textarea
id: contribution
validations:
required: true
attributes:
label: Your contribution / 您的贡献
description: |
Your PR link or any other link you can help with.
您的PR链接或者其他您能提供帮助的链接。
\ No newline at end of file
# Raise valuable PR / 提出有价值的PR
## Caution/ 注意事项:
Users should keep the following points in mind when submitting PRs:
1. The proposed PR should be about this project.
2. the proposed PR should be relevant, if there are multiple ideas and optimizations, they should be assigned to different PRs.
用户在提交PR时候应该注意以下几点:
1. 提出的PR应该是关于本项目的。
2. 提出的PR应该具有针对性,如果具有多个不同的想法和优化方案,应该分配到不同的PR中。
## 不应该提出的PR / PRs that should not be proposed
If a developer proposes a PR about any of the following, it may be closed or Rejected.
1. those that don't describe improvement options.
2. multiple issues of different types combined in one PR.
3. The proposed PR is highly duplicative of already existing PRs.
如果开发者提出关于以下方面的PR,则可能会被直接关闭或拒绝通过。
1. 没有说明改进方案的。
2. 多个不同类型的问题合并在一个PR中的。
3. 提出的PR与已经存在的PR高度重复的。
# 检查您的PR
- [ ] Have you read the Contributor Guidelines, Pull Request section? / 您是否阅读了贡献者指南、Pull Request 部分?
- [ ] Has this been discussed/approved via a Github issue or forum? If so, add a link. / 是否通过 Github 问题或论坛讨论/批准过?如果是,请添加链接。
- [ ] Did you make sure you updated the documentation with your changes? Here are the Documentation Guidelines, and here are the Documentation Formatting Tips. /您是否确保根据您的更改更新了文档?这里是文档指南,这里是文档格式化技巧。
- [ ] Did you write new required tests? / 您是否编写了新的必要测试?
- [ ] Are your PRs for only one issue / 您的PR是否仅针对一个问题
\ No newline at end of file
venv
.DS_Store
model_image
model_video
*.idea/
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2024 CogVLM Model Team @ Zhipu AI
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
The CogVLM License
1. Definitions
“Licensor” means the CogVLM Model Team that distributes its Software.
“Software” means the CogVLM model parameters made available under this license.
2. License Grant
Under the terms and conditions of this license, the Licensor hereby grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license.
This license permits you to use all open-source models in this repository for academic research free. Users who wish to use the models for commercial purposes must register [here](https://open.bigmodel.cn/mla/form).
Registered users may use the models for commercial activities free of charge, but must comply with all terms and conditions of this license.
The license notice shall be included in all copies or substantial portions of the Software.
3. Restriction
You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any military, or illegal purposes.
You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
4. Disclaimer
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
5. Limitation of Liability
EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER BASED IN TORT, NEGLIGENCE, CONTRACT, LIABILITY, OR OTHERWISE WILL ANY LICENSOR BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES, OR ANY OTHER COMMERCIAL LOSSES, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
6. Dispute Resolution
This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
Note that the license is subject to update to a more comprehensive version. For any questions related to the license and copyright, please contact us at license@zhipuai.cn.
7. Llama3 and EVA-CLIP2 License
For the CogVLM2 open source model based on the LLama3 series model as the base model, the Llama3 license conditions (https://llama.meta.com/llama3/license/, a copy of this repository license conditions) and the EVA-CLIP2 license conditions (MIT , https://github.com/baaivision/EVA/blob/master/LICENSE) for model weights.
1. 定义
“许可方”是指分发其软件的 CogVLM 模型团队。
“软件”是指根据本许可提供的 CogVLM 模型参数。
2. 许可授予
根据本许可的条款和条件,许可方特此授予您非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可。
本许可允许您免费使用本仓库中的所有开源模型进行学术研究,对于希望将模型用于商业目的的用户,需在[这里](https://open.bigmodel.cn/mla/form)完成登记。
经过登记的用户可以免费使用本模型进行商业活动,但必须遵守本许可的所有条款和条件。
上述版权声明和本许可声明应包含在本软件的所有副本或重要部分中。
3.限制
您不得出于任何军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建本软件的全部或部分衍生作品。
您不得利用本软件从事任何危害国家安全和国家统一、危害社会公共利益、侵犯人身权益的行为。
4.免责声明
本软件“按原样”提供,不提供任何明示或暗示的保证,包括但不限于对适销性、特定用途的适用性和非侵权性的保证。 在任何情况下,作者或版权持有人均不对任何索赔、损害或其他责任负责,无论是在合同诉讼、侵权行为还是其他方面,由软件或软件的使用或其他交易引起、由软件引起或与之相关 软件。
5. 责任限制
除适用法律禁止的范围外,在任何情况下且根据任何法律理论,无论是基于侵权行为、疏忽、合同、责任或其他原因,任何许可方均不对您承担任何直接、间接、特殊、偶然、示范性、 或间接损害,或任何其他商业损失,即使许可人已被告知此类损害的可能性。
6.争议解决
本许可受中华人民共和国法律管辖并按其解释。 因本许可引起的或与本许可有关的任何争议应提交北京市海淀区人民法院。
请注意,许可证可能会更新到更全面的版本。 有关许可和版权的任何问题,请通过 license@zhipuai.cn 与我们联系。
7. Llama3 和 EVA-CLIP2 许可
针对基于以 LLama3 系列模型作为基座模型的 CogVLM2 开源模型, Llama3 许可条件 (https://llama.meta.com/llama3/license/ ,本仓库副本一份许可条件) 和 EVA-CLIP2 许可条件 (MIT, https://github.com/baaivision/EVA/blob/master/LICENSE) 适用于模型权重。
\ No newline at end of file
# CogVLM2
CogVLM2是一个开源的多模态大型语言模型,旨在缩小开源模型与商业专有模型在多模态理解方面的能力差距,19B 即可比肩 GPT-4V, 可用于OCR、视频理解、文档问答。
## 论文
- [CogVLM:Visual Expert for Pretrained Language Models](https://arxiv.org/abs/2311.03079)
## 模型结构
CogVLM2 继承并优化了上一代模型的经典架构,采用了一个拥有50亿参数的强大视觉编码器,并创新性地在大语言模型中整合了一个70亿参数的视觉专家模块。这一模块通过独特的参数设置,精细地建模了视觉与语言序列的交互,确保了在增强视觉理解能力的同时,不会削弱模型在语言处理上的原有优势。这种深度融合的策略,使得视觉模态与语言模态能够更加紧密地结合。Cogvlm模型共包含四个基本组件:ViT 编码器,MLP 适配器,预训练大语言模型(GPT-style)和视觉专家模块。
<div align="center">
<img src="./images/architecture.png"/>
</div>
## 算法原理
当前主流的浅层对齐方法不佳在于视觉和语言信息之间缺乏深度融合,而cogvlm在attention和FFN layers引入一个可训练的视觉专家模块,将图像特征与文本特征分别处理,并在每一层中使用新的QKV矩阵和MLP层。通过引入视觉专家模块弥补预训练语言模型和图像编码器之间的差距,这种设计既保证了模型的强大性能,又显著提高了推理的效率。
为了更好地处理和理解高分辨率的文档或网页图片,CogVLM2能够支持高达1344分辨率的图像输入。为了提高处理此类高分辨率图像的效率,模型在视觉编码器后引入了一个专门的降采样模块。这个模块能够有效地提取视觉序列中的关键信息,大幅减少输入到语言模型中的序列长度,从而在确保模型性能的同时,显著提升了推理速度,实现了性能与效率的最佳平衡。
<div align=center>
<img src="./images/theory.png"/>
</div>
## 环境配置
### Docker(方法一)
[光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=128G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name cogvlm2 <your imageID> bash
cd /path/your_code_data/
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
```
### Dockerfile(方法二)
```
cd /path/your_code_data/docker
docker build --no-cache -t cogvlm2:latest .
docker run --shm-size=128G --name cogvlm2 -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v /path/your_code_data/:/path/your_code_data/ -it cogvlm2 bash
```
### Anaconda(方法三)
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
```
DTK驱动:dtk24.04.1
python:python3.10
torch:2.1
torchvision: 0.16.0
deepspped: 0.12.3
```
`Tips:以上dtk驱动、python、paddle等DCU相关工具版本需要严格一一对应`
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
```
conda create -n cogvlm2 python=3.10
conda activate cogvlm2
cd /path/your_code_data/
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple
```
## 数据集
测试数据集 [CogVLM-SFT-311K](https://huggingface.co/datasets/THUDM/CogVLM-SFT-311K)
数据格式如下:
数据集由 images 和 labels 两个文件夹组成 (在 CogVLM-SFT-311K 中 为 labels_en 和 labels_zh,分别对应中英文标签。 在微调代码中,你可以修改这两行代码来修改文件夹名称。
```
CogVLM-SFT-311K/
├── images
│ │ ├── .....
├── labels_en
│ │ ├── .....
├── labels_zh
│ │ ├── .....
```
images 文件夹中存放了图片文件,labels 文件夹中存放了对应的标签文件。图片和标签文件的名称一一对应。图片文件的格式为 jpg,标签文件的格式为 json。
每个标签文件中包含了一段对话,对话由 user 和 assistant 两个角色组成,每个角色的对话内容由 role 和 content 两个字段组成。如下字段所示。
```
{
"conversations": [
{
"role": "user",
"content": "What can be inferred about the zebras' behavior and surroundings?"
},
{
"role": "assistant",
"content": "Based on the image, we can infer that the two zebras are likely seeking relief from the sun's heat, as they are standing side by side under the branches of a thorny tree. This shade-providing tree offers some respite from the sun, possibly during the hottest part of the day. The zebras are in a green field with grass, providing them with an ideal environment to graze and eat while staying near their source of shelter. This shows that the zebras' behavior is influenced by the conditions and available resources in their surroundings. It also highlights that these animals adopt strategies to adapt to the fluctuating conditions of their environment, such as cooperation and seeking shelter, to survive and thrive in their natural habitat."
}
]
}
```
## 训练
## 推理
### 单卡推理
推理前需要修改模型路径
MODEL_PATH = "/home/wanglch/projects/CogVLM2/cogvlm2-llama3-chinese-chat-19B"
```
sh cli_demo.sh
```
### web 推理
修改 web_demo.py 中模型文件地址
```
sh web_demo.sh
```
## result
### OCR
<div align=center>
<img src="./images/result2.png"/>
</div>
### 问答
<div align=center>
<img src="./images/result1.png"/>
</div>
### 精度
暂无
## 应用场景
### 算法类别
`OCR`
### 热点应用行业
`金融,教育,交通,政府`
## 预训练权重
- [THUDM/cogvlm2-llama3-chinese-chat-19B](https://modelscope.cn/models/Duxiaoman-DI/XuanYuan-13B-Chat/files)
预训练权重快速下载中心:[SCNet AIModels](http://113.200.138.88:18080/aimodels)
项目中的预训练权重可从快速下载通道下载: [THUDM/cogvlm2-llama3-chinese-chat-19B](http://113.200.138.88:18080/aimodels/cogvlm2-llama3-chinese-chat-19B)
## 源码仓库及问题反馈
- https://developer.hpccube.com/codes/modelzoo/cogvlm2_pytorch
## 参考资料
- [CogVLM:Visual Expert for Pretrained Language Models](https://arxiv.org/abs/2311.03079)
This diff is collapsed.
# Basic Demo
[中文版README](./README_zh.md)
### Minimum Requirements
Python: 3.10.12 or above
OS: It is recommended to run on a Linux operating system with NVIDIA GPU to avoid installation issues with
the `xformers` library.
GPU requirements are as shown in the table below:
| Model Name | 19B Series Model | Remarks |
|--------------------------------------------|-------------------------------------------|-------------------------------|
| BF16 inference | 42GB | Tested with 2K dialogue text |
| Int4 inference | 16GB | Tested with 2K dialogue text |
| BF16 Lora Tuning (With Vision Expert Part) | 73GB(8 GPUs with A100 x 80G using zero 2) | Trained with 2K dialogue text |
Before running any code, make sure you have all dependencies installed. You can install all dependency packages with the
following command:
```shell
pip install -r requirements.txt
```
## Using CLI Demo
Run this code to start a conversation at the command line. Please note that the model must be loaded on a GPU
```shell
CUDA_VISIBLE_DEVICES=0 python cli_demo.py
```
If you want to use `int4` (or `int8`) quantization, please use
```shell
CUDA_VISIBLE_DEVICES=0 python cli_demo.py --quant 4
```
If you have multiple GPUs, you can use the following code to perform multiple pull-up models and distribute different
layers of the model on different GPUs.
```shell
python cli_demo_multi_gpus.py
```
In `cli_demo_multi_gpus.py`, we use the `infer_auto_device_map` function to automatically allocate different layers of
the model to different GPUs. You need to set the `max_memory` parameter to specify the maximum memory for each GPU. For
example, if you have two GPUs, each with 23GiB of memory, you can set it like this:
```python
device_map = infer_auto_device_map(
model=model,
max_memory={i: "23GiB" for i in range(torch.cuda.device_count())},
# set 23GiB for each GPU, depends on your GPU memory, you can adjust this value
no_split_module_classes=["CogVLMDecoderLayer"]
)
```
## Using Web Demo
Run this code to start a conversation in the WebUI.
```shell
chainlit run web_demo.py
```
if you want to use int4 or in8 quantization, you can launch it like:
```shell
CUDA_VISIBLE_DEVICES=0 QUANT=4 chainlit run web_demo.py
```
After starting the conversation, you will be able to interact with the model, as shown below:
<img src="../resources/web_demo.png" alt="web_demo" width="600" />
## Using OpenAI API format
We provide a simple example to pull up the model through the following code.
After that, you can use the OpenAI API
format to request a conversation with the model (optionally `--quant 4`).
```shell
python openai_api_demo.py
```
Developers can call the model through the following code:
```shell
python openai_api_request.py
```
# Basic Demo
[Read this in English.](./README.md)
### 最低配置要求
Python: 3.10.12 以上版本。
操作系统: 建议使用 Linux 操作系统运行以避免`xformers`库安装问题。建议使用 NVIDIA GPU 以防止兼容性问题。
GPU要求如下表格所示
| 模型名称 | 19B 系列模型 | 备注 |
|-----------------------------|---------------------------------|--------------|
| BF16 / FP16 推理 | 42GB | 测试对话文本长度为2K |
| Int4 推理 | 16GB | 测试对话文本长度为2K |
| BF16 Lora Tuning (带有视觉专家微调) | 73GB (使用8卡A100 80G显卡并使用 Zero 2) | 训练对话文本长度为 2K |
在运行任何代码之前,请确保你已经安装好了所有的依赖包。你可以通过以下命令来安装所有的依赖包:
```shell
pip install -r requirements.txt
```
## CLI 调用模型
运行本代码以开始在命令行中对话。请注意,模型必须在一张GPU上载入
```shell
CUDA_VISIBLE_DEVICES=0 python cli_demo.py
```
如果您有多张GPU,您可以通过以下代码执行多卡拉起模型,并将模型的不同层分布在不同的GPU上。
```shell
python cli_demo_multi_gpu.py
```
`cli_demo_multi_gpu.py` 中,我们使用了 `infer_auto_device_map`
函数来自动分配模型的不同层到不同的GPU上。你需要设置 `max_memory` 参数来指定每张GPU的最大内存。例如,如果你有两张GPU,每张GPU的内存为23GiB,你可以这样设置:
```python
device_map = infer_auto_device_map(
model=model,
max_memory={i: "23GiB" for i in range(torch.cuda.device_count())},
# set 23GiB for each GPU, depends on your GPU memory, you can adjust this value
no_split_module_classes=["CogVLMDecoderLayer"]
)
```
## Web端在线调用模型
运行本代码以开始在 WebUI 中对话。
```shell
chainlit run web_demo.py
```
拉起对话后,你将能和模型进行对话,效果如下:
<img src="../resources/web_demo.png" alt="web_demo" width="600" />
## OpenAI API
我们提供了一个简单的示例,通过以下代码拉起模型,之后,您可以使用 OpenAI API格式的方式请求和模型的对话。
```shell
python openai_api_demo.py
```
开发者可以通过以下代码来调用模型:
```shell
python openai_api_request.py
```
## CogVLM2
Welcome to use the CogVLM2. Please note:
+ The model supports two languages: Chinese and English version(cogvlm2-llama3-chinese-chat-19B), and English Only
version(cogvlm2-llama3-chat-19B).
该模型支持两种语言:中文版(cogvlm2-llama3-chinese-chat-19B)和纯英语版(cogvlm2-llama3-chat-19B)。
+ For the first conversation, you must upload a picture. The model only supports dialogue with one picture. Uploading
the next picture will overwrite the previous picture information.
在第一次对话时,您必须上传一张图片。该模型只支持与一张图片对话。上传下一张图片会覆盖之前的图片信息。
+ Do not interrupt the conversation halfway, as this may cause CUDA kernel errors.
请勿在对话中途打断,因为这可能会导致CUDA内核错误。
+ In Firefox browser, there might be issues with Chinese input. Please use Chrome browser.
在Firefox浏览器中可能出现中文问题无法输入,请使用Chrome浏览器。
\ No newline at end of file
"""
This is a demo for using CogVLM2 in CLI using Single GPU.
Strongly suggest to use GPU with bfloat16 support, otherwise, it will be slow.
Mention that only one picture can be processed at one conversation, which means you can not replace or insert another picture during the conversation.
for multi-GPU, please use cli_demo_multi_gpus.py
"""
import torch
import argparse
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
0] >= 8 else torch.float16
# Argument parser
parser = argparse.ArgumentParser(description="CogVLM2 CLI Demo")
parser.add_argument('--quant', type=int, choices=[4, 8], help='Enable 4-bit or 8-bit precision loading', default=0)
args = parser.parse_args()
if 'int4' in MODEL_PATH:
args.quant = 4
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH,
trust_remote_code=True
)
# Check GPU memory
if torch.cuda.is_available() and torch.cuda.get_device_properties(0).total_memory < 48 * 1024 ** 3 and not args.quant:
print("GPU memory is less than 48GB. Please use cli_demo_multi_gpus.py or pass `--quant 4` or `--quant 8`.")
exit()
# Load the model
if args.quant == 4:
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True,
quantization_config=BitsAndBytesConfig(load_in_4bit=True),
low_cpu_mem_usage=True
).eval()
elif args.quant == 8:
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True,
quantization_config=BitsAndBytesConfig(load_in_8bit=True),
low_cpu_mem_usage=True
).eval()
else:
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True
).eval().to(DEVICE)
while True:
image_path = input("image path >>>>> ")
if image_path == '':
print('You did not enter image path, the following will be a plain text conversation.')
image = None
text_only_first_query = True
else:
image = Image.open(image_path).convert('RGB')
history = []
while True:
query = input("Human:")
if query == "clear":
break
if image is None:
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
template_version='chat'
)
else:
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
images=[image],
template_version='chat'
)
inputs = {
'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
}
gen_kwargs = {
"max_new_tokens": 2048,
"pad_token_id": 128002,
"top_k": 1,
}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\nCogVLM2:", response)
history.append((query, response))
import os
import time
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"
TORCH_TYPE = torch.bfloat16
device = 'cuda:0'
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH,
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True,
device_map=device,
# load_in_4bit=True,
# low_cpu_mem_usage=True
).eval()
def recur_move_to(item, tgt, criterion_func):
if criterion_func(item):
device_copy = item.to(tgt)
return device_copy
elif isinstance(item, list):
return [recur_move_to(v, tgt, criterion_func) for v in item]
elif isinstance(item, tuple):
return tuple([recur_move_to(v, tgt, criterion_func) for v in item])
elif isinstance(item, dict):
return {k: recur_move_to(v, tgt, criterion_func) for k, v in item.items()}
else:
return item
def collate_fn(features, tokenizer) -> dict:
images = [feature.pop('images', None) for feature in features if 'images' in feature]
tokenizer.pad_token = tokenizer.eos_token
max_length = max(len(feature['input_ids']) for feature in features)
def pad_to_max_length(feature, max_length):
padding_length = max_length - len(feature['input_ids'])
feature['input_ids'] = torch.cat([feature['input_ids'], torch.full((padding_length,), tokenizer.pad_token_id)])
feature['token_type_ids'] = torch.cat([feature['token_type_ids'], torch.zeros(padding_length, dtype=torch.long)])
feature['attention_mask'] = torch.cat([feature['attention_mask'], torch.zeros(padding_length, dtype=torch.long)])
if feature['labels'] is not None:
feature['labels'] = torch.cat([feature['labels'], torch.full((padding_length,), tokenizer.pad_token_id)])
else:
feature['labels'] = torch.full((max_length,), tokenizer.pad_token_id)
return feature
features = [pad_to_max_length(feature, max_length) for feature in features]
batch = {
key: torch.stack([feature[key] for feature in features])
for key in features[0].keys()
}
if images:
batch['images'] = images
return batch
image_folder = "/path/to/folder"
data = []
for root, dirs, files in os.walk(image_folder):
for file in files:
if file.endswith(('.jpg', '.jpeg', '.png', '.bmp', '.tiff', '.webp')):
data.append({"image": os.path.join(root, file)})
length = len(data)
batch_size = 14
query = 'Describe this image in detail, and the description should be between 15 to 80 words.'
for idx in range(0, length, batch_size):
i_list = []
for i in range(batch_size):
if idx + i < length:
i_list.append(data[idx + i])
else:
break
input_sample_list = []
start = time.time()
for i in i_list:
image = Image.open(i["image"]).convert('RGB')
input_sample = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image], template_version='chat')
input_sample_list.append(input_sample)
print(f"Prepare input time: {time.time() - start}")
start = time.time()
input_batch = collate_fn(input_sample_list, tokenizer)
input_batch = recur_move_to(input_batch, device, lambda x: isinstance(x, torch.Tensor))
input_batch = recur_move_to(input_batch, torch.bfloat16, lambda x: isinstance(x, torch.Tensor) and torch.is_floating_point(x))
print(f"Prepare batch time: {time.time() - start}")
gen_kwargs = {
"max_new_tokens": 2048,
"pad_token_id": 128002,
"top_k": 1,
}
start = time.time()
with torch.no_grad():
outputs = model.generate(**input_batch, **gen_kwargs)
outputs = outputs[:, input_batch['input_ids'].shape[1]:]
outputs = tokenizer.batch_decode(outputs)
outlist = [output.split("<|end_of_text|>")[0].strip() for output in outputs]
print(outlist)
print(f"Generate time: {time.time() - start}")
\ No newline at end of file
"""
This is a demo for using CogVLM2 in CLI using multi-GPU with lower memory.
If your single GPU is not enough to drive this model, you can use this demo to run this model on multiple graphics cards with limited video memory.
Here, we default that your graphics card has 24GB of video memory, which is not enough to load the FP16 / BF16 model.
so , need to use two graphics cards to load. We set '23GiB' for each GPU to avoid out of memory.
GPUs less than 2 is recommended and need more than 16GB of video memory.
test success in 3 GPUs with 16GB video memory.
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 1 N/A N/A 1890574 C python 13066MiB |
| 2 N/A N/A 1890574 C python 14560MiB |
| 3 N/A N/A 1890574 C python 11164MiB |
+---------------------------------------------------------------------------------------+
"""
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import init_empty_weights, load_checkpoint_and_dispatch, infer_auto_device_map
MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
0] >= 8 else torch.float16
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH,
trust_remote_code=True
)
with init_empty_weights():
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True,
)
num_gpus = torch.cuda.device_count()
max_memory_per_gpu = "16GiB"
if num_gpus > 2:
max_memory_per_gpu = f"{round(42 / num_gpus)}GiB"
device_map = infer_auto_device_map(
model=model,
max_memory={i: max_memory_per_gpu for i in range(num_gpus)},
no_split_module_classes=["CogVLMDecoderLayer"]
)
model = load_checkpoint_and_dispatch(model, MODEL_PATH, device_map=device_map, dtype=TORCH_TYPE)
model = model.eval()
text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"
while True:
image_path = input("image path >>>>> ")
if image_path == '':
print('You did not enter image path, the following will be a plain text conversation.')
image = None
text_only_first_query = True
else:
image = Image.open(image_path).convert('RGB')
history = []
while True:
query = input("Human:")
if query == "clear":
break
if image is None:
if text_only_first_query:
query = text_only_template.format(query)
text_only_first_query = False
else:
old_prompt = ''
for _, (old_query, response) in enumerate(history):
old_prompt += old_query + " " + response + "\n"
query = old_prompt + "USER: {} ASSISTANT:".format(query)
if image is None:
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
template_version='chat'
)
else:
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
images=[image],
template_version='chat'
)
inputs = {
'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
}
gen_kwargs = {
"max_new_tokens": 2048,
"pad_token_id": 128002,
"top_k": 1,
}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
response = tokenizer.decode(outputs[0])
response = response.split("")[0]
print("\nCogVLM2:", response)
history.append((query, response))
import gc
import threading
import time
import base64
from contextlib import asynccontextmanager
from typing import List, Literal, Union, Tuple, Optional
import torch
import uvicorn
import requests
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from loguru import logger
from pydantic import BaseModel, Field
from sse_starlette.sse import EventSourceResponse
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from PIL import Image
from io import BytesIO
MODEL_PATH = 'THUDM/cogvlm2-llama3-chat-19B'
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
0] >= 8 else torch.float16
@asynccontextmanager
async def lifespan(app: FastAPI):
"""
An asynchronous context manager for managing the lifecycle of the FastAPI app.
It ensures that GPU memory is cleared after the app's lifecycle ends, which is essential for efficient resource management in GPU environments.
"""
yield
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.ipc_collect()
app = FastAPI(lifespan=lifespan)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
class ModelCard(BaseModel):
"""
A Pydantic model representing a model card, which provides metadata about a machine learning model.
It includes fields like model ID, owner, and creation time.
"""
id: str
object: str = "model"
created: int = Field(default_factory=lambda: int(time.time()))
owned_by: str = "owner"
root: Optional[str] = None
parent: Optional[str] = None
permission: Optional[list] = None
class ModelList(BaseModel):
object: str = "list"
data: List[ModelCard] = []
class ImageUrl(BaseModel):
url: str
class TextContent(BaseModel):
type: Literal["text"]
text: str
class ImageUrlContent(BaseModel):
type: Literal["image_url"]
image_url: ImageUrl
ContentItem = Union[TextContent, ImageUrlContent]
class ChatMessageInput(BaseModel):
role: Literal["user", "assistant", "system"]
content: Union[str, List[ContentItem]]
name: Optional[str] = None
class ChatMessageResponse(BaseModel):
role: Literal["assistant"]
content: str = None
name: Optional[str] = None
class DeltaMessage(BaseModel):
role: Optional[Literal["user", "assistant", "system"]] = None
content: Optional[str] = None
class ChatCompletionRequest(BaseModel):
model: str
messages: List[ChatMessageInput]
temperature: Optional[float] = 0.8
top_p: Optional[float] = 0.8
max_tokens: Optional[int] = None
stream: Optional[bool] = False
# Additional parameters
repetition_penalty: Optional[float] = 1.0
class ChatCompletionResponseChoice(BaseModel):
index: int
message: ChatMessageResponse
class ChatCompletionResponseStreamChoice(BaseModel):
index: int
delta: DeltaMessage
class UsageInfo(BaseModel):
prompt_tokens: int = 0
total_tokens: int = 0
completion_tokens: Optional[int] = 0
class ChatCompletionResponse(BaseModel):
model: str
object: Literal["chat.completion", "chat.completion.chunk"]
choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]
created: Optional[int] = Field(default_factory=lambda: int(time.time()))
usage: Optional[UsageInfo] = None
@app.get("/v1/models", response_model=ModelList)
async def list_models():
"""
An endpoint to list available models. It returns a list of model cards.
This is useful for clients to query and understand what models are available for use.
"""
model_card = ModelCard(id="cogvlm2-19b")
return ModelList(data=[model_card])
@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
async def create_chat_completion(request: ChatCompletionRequest):
global model, tokenizer
if len(request.messages) < 1 or request.messages[-1].role == "assistant":
raise HTTPException(status_code=400, detail="Invalid request")
gen_params = dict(
messages=request.messages,
temperature=request.temperature,
top_p=request.top_p,
max_tokens=request.max_tokens or 1024,
echo=False,
stream=request.stream,
repetition_penalty=request.repetition_penalty
)
if request.stream:
generate = predict(request.model, gen_params)
return EventSourceResponse(generate, media_type="text/event-stream")
response = generate_cogvlm(model, tokenizer, gen_params)
usage = UsageInfo()
message = ChatMessageResponse(
role="assistant",
content=response["text"],
)
logger.debug(f"==== message ====\n{message}")
choice_data = ChatCompletionResponseChoice(
index=0,
message=message,
)
task_usage = UsageInfo.model_validate(response["usage"])
for usage_key, usage_value in task_usage.model_dump().items():
setattr(usage, usage_key, getattr(usage, usage_key) + usage_value)
return ChatCompletionResponse(model=request.model, choices=[choice_data], object="chat.completion", usage=usage)
def predict(model_id: str, params: dict):
global model, tokenizer
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=DeltaMessage(role="assistant"),
finish_reason=None
)
chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
previous_text = ""
for new_response in generate_stream_cogvlm(model, tokenizer, params):
decoded_unicode = new_response["text"]
delta_text = decoded_unicode[len(previous_text):]
previous_text = decoded_unicode
delta = DeltaMessage(content=delta_text, role="assistant")
choice_data = ChatCompletionResponseStreamChoice(index=0, delta=delta)
chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
choice_data = ChatCompletionResponseStreamChoice(index=0, delta=DeltaMessage())
chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
def generate_cogvlm(model: AutoModelForCausalLM, tokenizer: AutoTokenizer, params: dict):
"""
Generates a response using the CogVLM2 model. It processes the chat history and image data, if any,
and then invokes the model to generate a response.
"""
response = None
for response in generate_stream_cogvlm(model, tokenizer, params):
pass
return response
def process_history_and_images(messages: List[ChatMessageInput]) -> Tuple[
Optional[str], Optional[List[Tuple[str, str]]], Optional[List[Image.Image]]]:
"""
Process history messages to extract text, identify the last user query,
and convert base64 encoded image URLs to PIL images.
Args:
messages(List[ChatMessageInput]): List of ChatMessageInput objects.
return: A tuple of three elements:
- The last user query as a string.
- Text history formatted as a list of tuples for the model.
- List of PIL Image objects extracted from the messages.
"""
formatted_history = []
image_list = []
last_user_query = ''
for i, message in enumerate(messages):
role = message.role
content = message.content
if isinstance(content, list): # text
text_content = ' '.join(item.text for item in content if isinstance(item, TextContent))
else:
text_content = content
if isinstance(content, list): # image
for item in content:
if isinstance(item, ImageUrlContent):
image_url = item.image_url.url
if image_url.startswith("data:image/jpeg;base64,"):
base64_encoded_image = image_url.split("data:image/jpeg;base64,")[1]
image_data = base64.b64decode(base64_encoded_image)
image = Image.open(BytesIO(image_data)).convert('RGB')
else:
response = requests.get(image_url, verify=False)
image = Image.open(BytesIO(response.content)).convert('RGB')
image_list.append(image)
if role == 'user':
if i == len(messages) - 1: # 最后一条用户消息
last_user_query = text_content
else:
formatted_history.append((text_content, ''))
elif role == 'assistant':
if formatted_history:
if formatted_history[-1][1] != '':
assert False, f"the last query is answered. answer again. {formatted_history[-1][0]}, {formatted_history[-1][1]}, {text_content}"
formatted_history[-1] = (formatted_history[-1][0], text_content)
else:
assert False, f"assistant reply before user"
else:
assert False, f"unrecognized role: {role}"
return last_user_query, formatted_history, image_list
@torch.inference_mode()
def generate_stream_cogvlm(model: AutoModelForCausalLM, tokenizer: AutoTokenizer, params: dict):
messages = params["messages"]
temperature = float(params.get("temperature", 1.0))
repetition_penalty = float(params.get("repetition_penalty", 1.0))
top_p = float(params.get("top_p", 1.0))
max_new_tokens = int(params.get("max_tokens", 256))
query, history, image_list = process_history_and_images(messages)
input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history,
images=[image_list[-1]])
inputs = {
'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]],
}
if 'cross_images' in input_by_model and input_by_model['cross_images']:
inputs['cross_images'] = [[input_by_model['cross_images'][0].to(DEVICE).to(TORCH_TYPE)]]
input_echo_len = len(inputs["input_ids"][0])
streamer = TextIteratorStreamer(
tokenizer=tokenizer,
timeout=60.0,
skip_prompt=True,
skip_special_tokens=True
)
gen_kwargs = {
"repetition_penalty": repetition_penalty,
"max_new_tokens": max_new_tokens,
"do_sample": True if temperature > 1e-5 else False,
"top_p": top_p if temperature > 1e-5 else 0,
"top_k": 1,
'streamer': streamer,
}
if temperature > 1e-5:
gen_kwargs["temperature"] = temperature
generated_text = ""
def generate_text():
with torch.no_grad():
model.generate(**inputs, **gen_kwargs)
generation_thread = threading.Thread(target=generate_text)
generation_thread.start()
total_len = input_echo_len
for next_text in streamer:
generated_text += next_text
total_len = len(tokenizer.encode(generated_text))
yield {
"text": generated_text,
"usage": {
"prompt_tokens": input_echo_len,
"completion_tokens": total_len - input_echo_len,
"total_tokens": total_len,
},
}
generation_thread.join()
yield {
"text": generated_text,
"usage": {
"prompt_tokens": input_echo_len,
"completion_tokens": total_len - input_echo_len,
"total_tokens": total_len,
},
}
gc.collect()
torch.cuda.empty_cache()
if __name__ == "__main__":
# Argument parser
import argparse
parser = argparse.ArgumentParser(description="CogVLM2 Web Demo")
parser.add_argument('--quant', type=int, choices=[4, 8], help='Enable 4-bit or 8-bit precision loading', default=0)
args = parser.parse_args()
if 'int4' in MODEL_PATH:
args.quant = 4
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
# Load the model
if args.quant == 4:
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True,
load_in_4bit=True,
low_cpu_mem_usage=True
).eval()
elif args.quant == 8:
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True,
load_in_8bit=True, # Assuming transformers support this argument; check documentation if not
low_cpu_mem_usage=True
).eval()
else:
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True
).eval().to(DEVICE)
uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)
"""
This script is designed to mimic the OpenAI API interface with CogVLM2 Chat
It demonstrates how to integrate image and text-based input to generate a response.
Currently, the model can only handle a single image.
Therefore, do not use this script to process multiple images in one conversation. (includes images from history)
And it only works on the chat model, not the base model.
"""
import requests
import json
import base64
base_url = "http://127.0.0.1:8000"
def create_chat_completion(model, messages, temperature=0.8, max_tokens=2048, top_p=0.8, use_stream=False):
"""
This function sends a request to the chat API to generate a response based on the given messages.
Args:
model (str): The name of the model to use for generating the response.
messages (list): A list of message dictionaries representing the conversation history.
temperature (float): Controls randomness in response generation. Higher values lead to more random responses.
max_tokens (int): The maximum length of the generated response.
top_p (float): Controls diversity of response by filtering less likely options.
use_stream (bool): Determines whether to use a streaming response or a single response.
The function constructs a JSON payload with the specified parameters and sends a POST request to the API.
It then handles the response, either as a stream (for ongoing responses) or a single message.
"""
data = {
"model": model,
"messages": messages,
"stream": use_stream,
"max_tokens": max_tokens,
"temperature": temperature,
"top_p": top_p,
}
response = requests.post(f"{base_url}/v1/chat/completions", json=data, stream=use_stream)
if response.status_code == 200:
if use_stream:
# 处理流式响应
for line in response.iter_lines():
if line:
decoded_line = line.decode('utf-8')[6:]
try:
response_json = json.loads(decoded_line)
content = response_json.get("choices", [{}])[0].get("delta", {}).get("content", "")
print(content)
except:
print("Special Token:", decoded_line)
else:
# 处理非流式响应
decoded_line = response.json()
content = decoded_line.get("choices", [{}])[0].get("message", "").get("content", "")
print(content)
else:
print("Error:", response.status_code)
return None
def encode_image(image_path):
"""
Encodes an image file into a base64 string.
Args:
image_path (str): The path to the image file.
This function opens the specified image file, reads its content, and encodes it into a base64 string.
The base64 encoding is used to send images over HTTP as text.
"""
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
def simple_image_chat(use_stream=True, img_path=None):
"""
Facilitates a simple chat interaction involving an image.
Args:
use_stream (bool): Specifies whether to use streaming for chat responses.
img_path (str): Path to the image file to be included in the chat.
This function encodes the specified image and constructs a predefined conversation involving the image.
It then calls `create_chat_completion` to generate a response from the model.
The conversation includes asking about the content of the image and a follow-up question.
"""
img_url = f"data:image/jpeg;base64,{encode_image(img_path)}"
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What’s in this image?",
},
{
"type": "image_url",
"image_url": {
"url": img_url
},
},
],
},
{
"role": "assistant",
"content": "The image displays a wooden boardwalk extending through a vibrant green grassy wetland. The sky is partly cloudy with soft, wispy clouds, indicating nice weather. Vegetation is seen on either side of the boardwalk, and trees are present in the background, suggesting that this area might be a natural reserve or park designed for ecological preservation and outdoor recreation. The boardwalk allows visitors to explore the area without disturbing the natural habitat.",
},
{
"role": "user",
"content": "Do you think this is a spring or winter photo?"
},
]
create_chat_completion("cogvlm2", messages=messages, use_stream=use_stream)
if __name__ == "__main__":
simple_image_chat(use_stream=False, img_path="demo.jpg")
xformers
torch>=2.0.0
torchvision
transformers>=4.40
huggingface-hub>=0.23.0
pillow
chainlit>=1.0
pydantic>=2.7.1
timm>=0.9.16
openai>=1.30.1
loguru>=0.7.2
pydantic>=2.7.1
einops
sse-starlette>=2.1.0
bitsandbytes>=0.43.1 # for int4 quantization
\ No newline at end of file
"""
This is a simple chat demo using CogVLM2 model in ChainLit.
"""
import os
import dataclasses
from typing import List
from PIL import Image
import chainlit as cl
from chainlit.input_widget import Slider
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from huggingface_hub.inference._generated.types import TextGenerationStreamOutput, TextGenerationStreamOutputToken
import threading
import torch
MODEL_PATH = 'THUDM/cogvlm2-llama3-chat-19B'
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
0] >= 8 else torch.float16
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
quant = int(os.environ.get('QUANT', 0))
if 'int4' in MODEL_PATH:
quant = 4
print(f'Quant = {quant}')
# Load the model
if quant == 4:
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True,
load_in_4bit=True,
low_cpu_mem_usage=True
).eval()
elif quant == 8:
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True,
load_in_8bit=True, # Assuming transformers support this argument; check documentation if not
low_cpu_mem_usage=True
).eval()
else:
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True
).eval().to(DEVICE)
@cl.on_chat_start
def on_chat_start():
print("Welcome use CogVLM2 chat demo")
async def get_response(query, history, gen_kwargs, images=None):
if images is None:
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
template_version='chat'
)
else:
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
images=images[-1:], # only use the last image, CogVLM2 only support one image
template_version='chat'
)
inputs = {
'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if images is not None else None,
}
streamer = TextIteratorStreamer(tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True)
gen_kwargs['streamer'] = streamer
gen_kwargs = {**gen_kwargs, **inputs}
with torch.no_grad():
thread = threading.Thread(target=model.generate, kwargs=gen_kwargs)
thread.start()
for next_text in streamer:
yield TextGenerationStreamOutput(
index=0,
token=TextGenerationStreamOutputToken(
id=0,
logprob=0,
text=next_text,
special=False,
)
)
@dataclasses.dataclass
class Conversation:
"""A class that keeps all conversation history."""
roles: List[str]
messages: List[List[str]]
version: str = "Unknown"
def append_message(self, role, message):
self.messages.append([role, message])
def get_prompt(self):
if not self.messages:
return None, []
last_role, last_msg = self.messages[-2]
if isinstance(last_msg, tuple):
query, _ = last_msg
else:
query = last_msg
history = []
for role, msg in self.messages[:-2]:
if isinstance(msg, tuple):
text, _ = msg
else:
text = msg
if role == "USER":
history.append((text, ""))
else:
if history:
history[-1] = (history[-1][0], text)
return query, history
def get_images(self):
for role, msg in reversed(self.messages):
if isinstance(msg, tuple):
msg, image = msg
if image is None:
continue
if image.mode != 'RGB':
image = image.convert('RGB')
width, height = image.size
if width > 1344 or height > 1344:
max_len = 1344
aspect_ratio = width / height
if width > height:
new_width = max_len
new_height = int(new_width / aspect_ratio)
else:
new_height = max_len
new_width = int(new_height * aspect_ratio)
image = image.resize((new_width, new_height))
return [image]
return None
def copy(self):
return Conversation(
roles=self.roles,
messages=[[x, y] for x, y in self.messages],
version=self.version,
)
def dict(self):
if len(self.get_images()) > 0:
return {
"roles": self.roles,
"messages": [
[x, y[0] if type(y) is tuple else y] for x, y in self.messages
],
}
return {
"roles": self.roles,
"messages": self.messages,
}
default_conversation = Conversation(
roles=("USER", "ASSISTANT"),
messages=()
)
async def request(conversation: Conversation, settings):
gen_kwargs = {
"temperature": settings["temperature"],
"top_p": settings["top_p"],
"max_new_tokens": int(settings["max_token"]),
"top_k": int(settings["top_k"]),
"do_sample": True,
}
query, history = conversation.get_prompt()
images = conversation.get_images()
chainlit_message = cl.Message(content="", author="CogVLM2")
text = ""
async for response in get_response(query, history, gen_kwargs, images):
output = response.token.text
text += output
conversation.messages[-1][-1] = text
await chainlit_message.stream_token(text, is_sequence=True)
await chainlit_message.send()
return conversation
@cl.on_chat_start
async def start():
settings = await cl.ChatSettings(
[
Slider(id="temperature", label="Temperature", initial=0.5, min=0.01, max=1, step=0.05),
Slider(id="top_p", label="Top P", initial=0.7, min=0, max=1, step=0.1),
Slider(id="top_k", label="Top K", initial=5, min=0, max=50, step=1),
Slider(id="max_token", label="Max output tokens", initial=2048, min=0, max=8192, step=1),
]
).send()
conversation = default_conversation.copy()
cl.user_session.set("conversation", conversation)
cl.user_session.set("settings", settings)
@cl.on_settings_update
async def setup_agent(settings):
cl.user_session.set("settings", settings)
@cl.on_message
async def main(message: cl.Message):
image = next(
(
Image.open(file.path)
for file in message.elements or []
if "image" in file.mime and file.path is not None
),
None,
)
conv = cl.user_session.get("conversation") # type: Conversation
settings = cl.user_session.get("settings")
text = message.content
conv_message = (text, image)
conv.append_message(conv.roles[0], conv_message)
conv.append_message(conv.roles[1], None)
conv = await request(conv, settings)
cl.user_session.set("conversation", conv)
"""
This is a demo for using CogVLM2 in CLI using Single GPU.
Strongly suggest to use GPU with bfloat16 support, otherwise, it will be slow.
Mention that only one picture can be processed at one conversation, which means you can not replace or insert another picture during the conversation.
for multi-GPU, please use cli_demo_multi_gpus.py
"""
import torch
import argparse
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import os
MODEL_PATH = "./cogvlm2-llama3-chinese-chat-19B"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
0] >= 8 else torch.float16
# Argument parser
parser = argparse.ArgumentParser(description="CogVLM2 CLI Demo")
parser.add_argument('--quant', type=int, choices=[4, 8], help='Enable 4-bit or 8-bit precision loading', default=0)
args = parser.parse_args()
if 'int4' in MODEL_PATH:
args.quant = 4
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH,
trust_remote_code=True
)
# Check GPU memory
if torch.cuda.is_available() and torch.cuda.get_device_properties(0).total_memory < 48 * 1024 ** 3 and not args.quant:
print("GPU memory is less than 48GB. Please use cli_demo_multi_gpus.py or pass `--quant 4` or `--quant 8`.")
exit()
# Load the model
if args.quant == 4:
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True,
quantization_config=BitsAndBytesConfig(load_in_4bit=True),
low_cpu_mem_usage=True
).eval()
elif args.quant == 8:
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True,
quantization_config=BitsAndBytesConfig(load_in_8bit=True),
low_cpu_mem_usage=True
).eval()
else:
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True
).eval().to(DEVICE)
while True:
image_path = input("image path >>>>> ")
if image_path == '':
print('You did not enter image path, the following will be a plain text conversation.')
image = None
text_only_first_query = True
else:
image = Image.open(image_path).convert('RGB')
history = []
while True:
query = input("Human:")
if query == "clear":
break
if image is None:
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
template_version='chat'
)
else:
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
images=[image],
template_version='chat'
)
inputs = {
'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
}
gen_kwargs = {
"max_new_tokens": 2048,
"pad_token_id": 128002,
"top_k": 1,
}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\nCogVLM2:", response)
history.append((query, response))
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment