Initial commit

a3d43354 · wanglch · a3d43354 · a3d43354 · a3d43354 · a3d43354
Commit a3d43354 authored Aug 08, 2024 by wanglch
20 changed files
--- a/.github/ISSUE_TEMPLATE/bug_report.yaml
+++ b/.github/ISSUE_TEMPLATE/bug_report.yaml
+name: "\U0001F41B Bug Report"
+description: Submit a bug report to help us improve CogVLM1.5 / 提交一个 Bug 问题报告来帮助我们改进 CogVLM1.5 模型
+body:
+  - type: textarea
+    id: system-info
+    attributes:
+      label: System Info / 系統信息
+      description: Your operating environment / 您的运行环境信息
+      placeholder: Includes Cuda version, Transformers version, Python version, operating system, hardware information (if you suspect a hardware problem)... / 包括Cuda版本，Transformers版本，Python版本，操作系统，硬件信息(如果您怀疑是硬件方面的问题)...
+    validations:
+      required: true
+  - type: textarea
+    id: who-can-help
+    attributes:
+      label: Who can help? / 谁可以帮助到您？
+      description: |
+        Your issue will be replied to more quickly if you can figure out the right person to tag with @
+        All issues are read by one of the maintainers, so if you don't know who to tag, just leave this blank and our maintainer will ping the right person.
+        Please tag fewer than 3 people.
+        如果您能找到合适的标签 @，您的问题会更快得到回复。
+        所有问题都会由我们的维护者阅读，如果您不知道该标记谁，只需留空，我们的维护人员会找到合适的开发组成员来解决问题。
+        标记的人数应该不超过 1 个人。
+        Related demo leader / 相关demo负责人 :
+        - composite_demo: @zR
+        - openai_demo: @zR
+        If it's not a bug in these three subsections, you may not specify the helper. Our maintainer will find the right person in the development group to solve the problem.
+        如果不是这三个子版块的bug，您可以不指明帮助者，我们的维护人员会找到合适的开发组成员来解决问题。
+      placeholder: "@Username ..."
+  - type: checkboxes
+    id: information-scripts-examples
+    attributes:
+      label: Information / 问题信息
+      description: 'The problem arises when using: / 问题出现在'
+      options:
+        - label: "The official example scripts / 官方的示例脚本"
+        - label: "My own modified scripts / 我自己修改的脚本和任务"
+  - type: textarea
+    id: reproduction
+    validations:
+      required: true
+    attributes:
+      label: Reproduction / 复现过程
+      description: |
+        Please provide a code example that reproduces the problem you encountered, preferably with a minimal reproduction unit.
+        If you have code snippets, error messages, stack traces, please provide them here as well.
+        Please format your code correctly using code tags. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
+        Do not use screenshots, as they are difficult to read and (more importantly) do not allow others to copy and paste your code.
+        请提供能重现您遇到的问题的代码示例,最好是最小复现单元。
+        如果您有代码片段、错误信息、堆栈跟踪，也请在此提供。
+        请使用代码标签正确格式化您的代码。请参见 https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
+        请勿使用截图，因为截图难以阅读，而且（更重要的是）不允许他人复制粘贴您的代码。
+      placeholder: |
+        Steps to reproduce the behavior/复现Bug的步骤:
+          1.
+          2.
+          3.
+  - type: textarea
+    id: expected-behavior
+    validations:
+      required: true
+    attributes:
+      label: Expected behavior / 期待表现
+      description: "A clear and concise description of what you would expect to happen. /简单描述您期望发生的事情。"
\ No newline at end of file
--- a/.github/ISSUE_TEMPLATE/feature-request.yaml
+++ b/.github/ISSUE_TEMPLATE/feature-request.yaml
+name: "\U0001F680 Feature request"
+description: Submit a request for a new CogVLM1.5 feature / 提交一个新的 CogVLM1.5 模型的功能建议
+labels: [ "feature" ]
+body:
+  - type: textarea
+    id: feature-request
+    validations:
+      required: true
+    attributes:
+      label: Feature request  / 功能建议
+      description: |
+        A brief description of the functional proposal. Links to corresponding papers and code are desirable.
+        对功能建议的简述。最好提供对应的论文和代码链接
+  - type: textarea
+    id: motivation
+    validations:
+      required: true
+    attributes:
+      label: Motivation / 动机
+      description: |
+        Your motivation for making the suggestion. If that motivation is related to another GitHub issue, link to it here.
+        您提出建议的动机。如果该动机与另一个 GitHub 问题有关，请在此处提供对应的链接。
+  - type: textarea
+    id: contribution
+    validations:
+      required: true
+    attributes:
+      label: Your contribution / 您的贡献
+      description: |
+        Your PR link or any other link you can help with.
+        您的PR链接或者其他您能提供帮助的链接。
\ No newline at end of file
--- a/.github/PULL_REQUEST_TEMPLATE/pr_template.md
+++ b/.github/PULL_REQUEST_TEMPLATE/pr_template.md
+#  Raise valuable PR / 提出有价值的PR
+## Caution/ 注意事项:
+Users should keep the following points in mind when submitting PRs:
+1. The proposed PR should be about this project. 
+2. the proposed PR should be relevant, if there are multiple ideas and optimizations, they should be assigned to different PRs.
+用户在提交PR时候应该注意以下几点:
+1. 提出的PR应该是关于本项目的。
+2. 提出的PR应该具有针对性，如果具有多个不同的想法和优化方案，应该分配到不同的PR中。
+## 不应该提出的PR / PRs that should not be proposed
+If a developer proposes a PR about any of the following, it may be closed or Rejected.
+1. those that don't describe improvement options.
+2. multiple issues of different types combined in one PR.
+3. The proposed PR is highly duplicative of already existing PRs.
+如果开发者提出关于以下方面的PR，则可能会被直接关闭或拒绝通过。
+1. 没有说明改进方案的。
+2. 多个不同类型的问题合并在一个PR中的。
+3. 提出的PR与已经存在的PR高度重复的。
+# 检查您的PR
+- [ ] Have you read the Contributor Guidelines, Pull Request section? / 您是否阅读了贡献者指南、Pull Request 部分？
+- [ ] Has this been discussed/approved via a Github issue or forum? If so, add a link. / 是否通过 Github 问题或论坛讨论/批准过？如果是，请添加链接。
+- [ ] Did you make sure you updated the documentation with your changes? Here are the Documentation Guidelines, and here are the Documentation Formatting Tips. /您是否确保根据您的更改更新了文档？这里是文档指南，这里是文档格式化技巧。
+- [ ] Did you write new required tests? / 您是否编写了新的必要测试？
+- [ ]  Are your PRs for only one issue / 您的PR是否仅针对一个问题
\ No newline at end of file
--- a/.gitignore
+++ b/.gitignore
+venv
+.DS_Store
+model_image
+model_video
+*.idea/
--- a/LICENSE
+++ b/LICENSE
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright 2024 CogVLM Model Team @ Zhipu AI
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/MODEL_LICENSE
+++ b/MODEL_LICENSE
+The CogVLM License
+1. Definitions
+“Licensor” means the CogVLM Model Team that distributes its Software.
+“Software” means the CogVLM model parameters made available under this license.
+2. License Grant
+Under the terms and conditions of this license, the Licensor hereby grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license.
+This license permits you to use all open-source models in this repository for academic research free. Users who wish to use the models for commercial purposes must register [here](https://open.bigmodel.cn/mla/form).
+Registered users may use the models for commercial activities free of charge, but must comply with all terms and conditions of this license.
+The license notice shall be included in all copies or substantial portions of the Software.
+3. Restriction
+You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any military, or illegal purposes.
+You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
+4. Disclaimer
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+5. Limitation of Liability
+EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER BASED IN TORT, NEGLIGENCE, CONTRACT, LIABILITY, OR OTHERWISE WILL ANY LICENSOR BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES, OR ANY OTHER COMMERCIAL LOSSES, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
+6. Dispute Resolution
+This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
+Note that the license is subject to update to a more comprehensive version.  For any questions related to the license and copyright, please contact us at license@zhipuai.cn.
+7. Llama3 and EVA-CLIP2 License
+For the CogVLM2 open source model based on the LLama3 series model as the base model, the Llama3 license conditions (https://llama.meta.com/llama3/license/, a copy of this repository license conditions) and the EVA-CLIP2 license conditions (MIT , https://github.com/baaivision/EVA/blob/master/LICENSE) for model weights.
+1. 定义
+“许可方”是指分发其软件的 CogVLM 模型团队。
+“软件”是指根据本许可提供的 CogVLM 模型参数。
+2. 许可授予
+根据本许可的条款和条件，许可方特此授予您非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可。
+本许可允许您免费使用本仓库中的所有开源模型进行学术研究，对于希望将模型用于商业目的的用户，需在[这里](https://open.bigmodel.cn/mla/form)完成登记。
+经过登记的用户可以免费使用本模型进行商业活动，但必须遵守本许可的所有条款和条件。
+上述版权声明和本许可声明应包含在本软件的所有副本或重要部分中。
+3.限制
+您不得出于任何军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建本软件的全部或部分衍生作品。
+您不得利用本软件从事任何危害国家安全和国家统一、危害社会公共利益、侵犯人身权益的行为。
+4.免责声明
+本软件“按原样”提供，不提供任何明示或暗示的保证，包括但不限于对适销性、特定用途的适用性和非侵权性的保证。 在任何情况下，作者或版权持有人均不对任何索赔、损害或其他责任负责，无论是在合同诉讼、侵权行为还是其他方面，由软件或软件的使用或其他交易引起、由软件引起或与之相关 软件。
+5. 责任限制
+除适用法律禁止的范围外，在任何情况下且根据任何法律理论，无论是基于侵权行为、疏忽、合同、责任或其他原因，任何许可方均不对您承担任何直接、间接、特殊、偶然、示范性、 或间接损害，或任何其他商业损失，即使许可人已被告知此类损害的可能性。
+6.争议解决
+本许可受中华人民共和国法律管辖并按其解释。 因本许可引起的或与本许可有关的任何争议应提交北京市海淀区人民法院。
+请注意，许可证可能会更新到更全面的版本。 有关许可和版权的任何问题，请通过 license@zhipuai.cn 与我们联系。
+7. Llama3 和 EVA-CLIP2 许可
+针对基于以 LLama3 系列模型作为基座模型的 CogVLM2 开源模型， Llama3 许可条件 (https://llama.meta.com/llama3/license/ ，本仓库副本一份许可条件) 和 EVA-CLIP2 许可条件 (MIT, https://github.com/baaivision/EVA/blob/master/LICENSE) 适用于模型权重。
\ No newline at end of file
--- a/README.md
+++ b/README.md
+# CogVLM2
+CogVLM2是一个开源的多模态大型语言模型，旨在缩小开源模型与商业专有模型在多模态理解方面的能力差距，19B 即可比肩 GPT-4V, 可用于OCR、视频理解、文档问答。
+## 论文
+- [CogVLM:Visual Expert for Pretrained Language Models](https://arxiv.org/abs/2311.03079)
+## 模型结构
+CogVLM2 继承并优化了上一代模型的经典架构，采用了一个拥有50亿参数的强大视觉编码器，并创新性地在大语言模型中整合了一个70亿参数的视觉专家模块。这一模块通过独特的参数设置，精细地建模了视觉与语言序列的交互，确保了在增强视觉理解能力的同时，不会削弱模型在语言处理上的原有优势。这种深度融合的策略，使得视觉模态与语言模态能够更加紧密地结合。Cogvlm模型共包含四个基本组件：ViT 编码器，MLP 适配器，预训练大语言模型（GPT-style）和视觉专家模块。
+<div align="center">
+    <img src="./images/architecture.png"/>
+</div>
+## 算法原理
+当前主流的浅层对齐方法不佳在于视觉和语言信息之间缺乏深度融合，而cogvlm在attention和FFN layers引入一个可训练的视觉专家模块，将图像特征与文本特征分别处理，并在每一层中使用新的QKV矩阵和MLP层。通过引入视觉专家模块弥补预训练语言模型和图像编码器之间的差距,这种设计既保证了模型的强大性能，又显著提高了推理的效率。
+为了更好地处理和理解高分辨率的文档或网页图片，CogVLM2能够支持高达1344分辨率的图像输入。为了提高处理此类高分辨率图像的效率，模型在视觉编码器后引入了一个专门的降采样模块。这个模块能够有效地提取视觉序列中的关键信息，大幅减少输入到语言模型中的序列长度，从而在确保模型性能的同时，显著提升了推理速度，实现了性能与效率的最佳平衡。
+<div align=center>
+    <img src="./images/theory.png"/>
+</div>
+## 环境配置
+### Docker（方法一）
+[光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
+docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=128G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name cogvlm2 <your imageID> bash
+cd /path/your_code_data/
+pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
+```
+### Dockerfile（方法二）
+```
+cd /path/your_code_data/docker
+docker build --no-cache -t  cogvlm2:latest .
+docker run --shm-size=128G --name  cogvlm2 -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v /path/your_code_data/:/path/your_code_data/ -it  cogvlm2 bash
+```
+### Anaconda（方法三）
+关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
+```
+DTK驱动：dtk24.04.1
+python：python3.10
+torch:2.1
+torchvision: 0.16.0
+deepspped: 0.12.3
+```
+`Tips：以上dtk驱动、python、paddle等DCU相关工具版本需要严格一一对应`
+关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
+```
+conda create -n cogvlm2 python=3.10
+conda activate  cogvlm2
+cd /path/your_code_data/
+pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple
+```
+## 数据集
+测试数据集 [CogVLM-SFT-311K](https://huggingface.co/datasets/THUDM/CogVLM-SFT-311K) 
+数据格式如下:
+数据集由 images 和 labels 两个文件夹组成 （在 CogVLM-SFT-311K 中 为 labels_en 和 labels_zh，分别对应中英文标签。 在微调代码中，你可以修改这两行代码来修改文件夹名称。
+```
+CogVLM-SFT-311K/
+├── images
+│   │   ├── .....
+├── labels_en
+│   │   ├── .....
+├── labels_zh
+│   │   ├── .....
+```
+images 文件夹中存放了图片文件，labels 文件夹中存放了对应的标签文件。图片和标签文件的名称一一对应。图片文件的格式为 jpg，标签文件的格式为 json。
+每个标签文件中包含了一段对话，对话由 user 和 assistant 两个角色组成，每个角色的对话内容由 role 和 content 两个字段组成。如下字段所示。
+```
+{
+  "conversations": [
+    {
+      "role": "user",
+      "content": "What can be inferred about the zebras' behavior and surroundings?"
+    },
+    {
+      "role": "assistant",
+      "content": "Based on the image, we can infer that the two zebras are likely seeking relief from the sun's heat, as they are standing side by side under the branches of a thorny tree. This shade-providing tree offers some respite from the sun, possibly during the hottest part of the day. The zebras are in a green field with grass, providing them with an ideal environment to graze and eat while staying near their source of shelter. This shows that the zebras' behavior is influenced by the conditions and available resources in their surroundings. It also highlights that these animals adopt strategies to adapt to the fluctuating conditions of their environment, such as cooperation and seeking shelter, to survive and thrive in their natural habitat."
+    }
+  ]
+}
+```
+## 训练
+## 推理
+### 单卡推理
+推理前需要修改模型路径
+MODEL_PATH = "/home/wanglch/projects/CogVLM2/cogvlm2-llama3-chinese-chat-19B"
+```
+sh cli_demo.sh
+```
+### web 推理
+修改 web_demo.py 中模型文件地址
+```
+sh web_demo.sh
+```
+## result
+### OCR
+<div align=center>
+    <img src="./images/result2.png"/>
+</div>
+### 问答
+<div align=center>
+    <img src="./images/result1.png"/>
+</div>
+### 精度
+暂无
+## 应用场景
+### 算法类别
+`OCR`
+### 热点应用行业
+`金融,教育,交通,政府`
+## 预训练权重
+- [THUDM/cogvlm2-llama3-chinese-chat-19B](https://modelscope.cn/models/Duxiaoman-DI/XuanYuan-13B-Chat/files)
+预训练权重快速下载中心：[SCNet AIModels](http://113.200.138.88:18080/aimodels)
+项目中的预训练权重可从快速下载通道下载： [THUDM/cogvlm2-llama3-chinese-chat-19B](http://113.200.138.88:18080/aimodels/cogvlm2-llama3-chinese-chat-19B)
+## 源码仓库及问题反馈
+- https://developer.hpccube.com/codes/modelzoo/cogvlm2_pytorch
+## 参考资料
+- [CogVLM:Visual Expert for Pretrained Language Models](https://arxiv.org/abs/2311.03079)
--- a/README_zh.md
+++ b/README_zh.md
--- a/basic_demo/README.md
+++ b/basic_demo/README.md
+# Basic Demo
+[中文版README](./README_zh.md)
+### Minimum Requirements
+Python: 3.10.12 or above
+OS: It is recommended to run on a Linux operating system with NVIDIA GPU to avoid installation issues with
+the `xformers` library.
+GPU requirements are as shown in the table below:
+| Model Name                                 | 19B Series Model                          | Remarks                       |
+|--------------------------------------------|-------------------------------------------|-------------------------------|
+| BF16 inference                      | 42GB                                      | Tested with 2K dialogue text  |
+| Int4 inference                             | 16GB                                      | Tested with 2K dialogue text  |
+| BF16 Lora Tuning (With Vision Expert Part) | 73GB(8 GPUs with A100 x 80G using zero 2) | Trained with 2K dialogue text |
+Before running any code, make sure you have all dependencies installed. You can install all dependency packages with the
+following command:
+```shell
+pip install -r requirements.txt
+```
+## Using CLI Demo
+Run this code to start a conversation at the command line. Please note that the model must be loaded on a GPU
+```shell
+CUDA_VISIBLE_DEVICES=0 python cli_demo.py
+```
+If you want to use `int4` (or `int8`) quantization, please use 
+```shell
+CUDA_VISIBLE_DEVICES=0 python cli_demo.py --quant 4
+```
+If you have multiple GPUs, you can use the following code to perform multiple pull-up models and distribute different
+layers of the model on different GPUs.
+```shell
+python cli_demo_multi_gpus.py
+```
+In `cli_demo_multi_gpus.py`, we use the `infer_auto_device_map` function to automatically allocate different layers of
+the model to different GPUs. You need to set the `max_memory` parameter to specify the maximum memory for each GPU. For
+example, if you have two GPUs, each with 23GiB of memory, you can set it like this:
+```python
+device_map = infer_auto_device_map(
+    model=model,
+    max_memory={i: "23GiB" for i in range(torch.cuda.device_count())},
+    # set 23GiB for each GPU, depends on your GPU memory, you can adjust this value
+    no_split_module_classes=["CogVLMDecoderLayer"]
+)
+```
+## Using Web Demo
+Run this code to start a conversation in the WebUI.
+```shell
+chainlit run web_demo.py
+```
+if you want to use int4 or in8 quantization, you can launch it like:
+```shell
+CUDA_VISIBLE_DEVICES=0 QUANT=4 chainlit run web_demo.py
+```
+After starting the conversation, you will be able to interact with the model, as shown below:
+<img src="../resources/web_demo.png" alt="web_demo" width="600" />
+## Using OpenAI API format
+We provide a simple example to pull up the model through the following code. 
+After that, you can use the OpenAI API
+format to request a conversation with the model (optionally `--quant 4`).
+```shell
+python openai_api_demo.py
+```
+Developers can call the model through the following code:
+```shell
+python openai_api_request.py
+```
--- a/basic_demo/README_zh.md
+++ b/basic_demo/README_zh.md
+# Basic Demo
+[Read this in English.](./README.md)
+### 最低配置要求
+Python: 3.10.12 以上版本。
+操作系统: 建议使用 Linux 操作系统运行以避免`xformers`库安装问题。建议使用 NVIDIA GPU 以防止兼容性问题。
+GPU要求如下表格所示
+| 模型名称                        | 19B 系列模型                        | 备注           |
+|-----------------------------|---------------------------------|--------------|
+| BF16 / FP16 推理              | 42GB                            | 测试对话文本长度为2K  | 
+| Int4    推理                  | 16GB                            | 测试对话文本长度为2K  |
+| BF16 Lora Tuning (带有视觉专家微调) | 73GB (使用8卡A100 80G显卡并使用 Zero 2) | 训练对话文本长度为 2K |
+在运行任何代码之前，请确保你已经安装好了所有的依赖包。你可以通过以下命令来安装所有的依赖包：
+```shell
+pip install -r requirements.txt
+```
+## CLI 调用模型
+运行本代码以开始在命令行中对话。请注意，模型必须在一张GPU上载入
+```shell
+CUDA_VISIBLE_DEVICES=0 python cli_demo.py
+```
+如果您有多张GPU，您可以通过以下代码执行多卡拉起模型，并将模型的不同层分布在不同的GPU上。
+```shell
+python cli_demo_multi_gpu.py
+```
+在 `cli_demo_multi_gpu.py` 中，我们使用了 `infer_auto_device_map`
+函数来自动分配模型的不同层到不同的GPU上。你需要设置 `max_memory` 参数来指定每张GPU的最大内存。例如，如果你有两张GPU，每张GPU的内存为23GiB，你可以这样设置：
+```python
+device_map = infer_auto_device_map(
+    model=model,
+    max_memory={i: "23GiB" for i in range(torch.cuda.device_count())},
+    # set 23GiB for each GPU, depends on your GPU memory, you can adjust this value
+    no_split_module_classes=["CogVLMDecoderLayer"]
+)
+```
+## Web端在线调用模型
+运行本代码以开始在 WebUI 中对话。
+```shell
+chainlit run web_demo.py
+```
+拉起对话后，你将能和模型进行对话，效果如下：
+<img src="../resources/web_demo.png" alt="web_demo" width="600" />
+## OpenAI API
+我们提供了一个简单的示例，通过以下代码拉起模型，之后，您可以使用 OpenAI API格式的方式请求和模型的对话。
+```shell
+python openai_api_demo.py
+```
+开发者可以通过以下代码来调用模型：
+```shell
+python openai_api_request.py
+```
--- a/basic_demo/chainlit.md
+++ b/basic_demo/chainlit.md
+## CogVLM2
+Welcome to use the CogVLM2. Please note:
+ The model supports two languages: Chinese and English version(cogvlm2-llama3-chinese-chat-19B), and English Only
+  version(cogvlm2-llama3-chat-19B).
+  该模型支持两种语言：中文版（cogvlm2-llama3-chinese-chat-19B）和纯英语版（cogvlm2-llama3-chat-19B）。
+ For the first conversation, you must upload a picture. The model only supports dialogue with one picture. Uploading
+  the next picture will overwrite the previous picture information.
+  在第一次对话时，您必须上传一张图片。该模型只支持与一张图片对话。上传下一张图片会覆盖之前的图片信息。
+ Do not interrupt the conversation halfway, as this may cause CUDA kernel errors.
+  请勿在对话中途打断，因为这可能会导致CUDA内核错误。
+ In Firefox browser, there might be issues with Chinese input. Please use Chrome browser.
+  在Firefox浏览器中可能出现中文问题无法输入，请使用Chrome浏览器。
\ No newline at end of file
--- a/basic_demo/cli_demo.py
+++ b/basic_demo/cli_demo.py
+"""
+This is a demo for using CogVLM2 in CLI using Single GPU.
+Strongly suggest to use GPU with bfloat16 support, otherwise, it will be slow.
+Mention that only one picture can be processed at one conversation, which means you can not replace or insert another picture during the conversation.
+for multi-GPU, please use cli_demo_multi_gpus.py
+"""
+import torch
+import argparse
+from PIL import Image
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"
+DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
+TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
+    0] >= 8 else torch.float16
+# Argument parser
+parser = argparse.ArgumentParser(description="CogVLM2 CLI Demo")
+parser.add_argument('--quant', type=int, choices=[4, 8], help='Enable 4-bit or 8-bit precision loading', default=0)
+args = parser.parse_args()
+if 'int4' in MODEL_PATH:
+    args.quant = 4
+tokenizer = AutoTokenizer.from_pretrained(
+    MODEL_PATH,
+    trust_remote_code=True
+)
+# Check GPU memory
+if torch.cuda.is_available() and torch.cuda.get_device_properties(0).total_memory < 48 * 1024 ** 3 and not args.quant:
+    print("GPU memory is less than 48GB. Please use cli_demo_multi_gpus.py or pass `--quant 4` or `--quant 8`.")
+    exit()
+# Load the model
+if args.quant == 4:
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_PATH,
+        torch_dtype=TORCH_TYPE,
+        trust_remote_code=True,
+        quantization_config=BitsAndBytesConfig(load_in_4bit=True),
+        low_cpu_mem_usage=True
+    ).eval()
+elif args.quant == 8:
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_PATH,
+        torch_dtype=TORCH_TYPE,
+        trust_remote_code=True,
+        quantization_config=BitsAndBytesConfig(load_in_8bit=True),
+        low_cpu_mem_usage=True
+    ).eval()
+else:
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_PATH,
+        torch_dtype=TORCH_TYPE,
+        trust_remote_code=True
+    ).eval().to(DEVICE)
+while True:
+    image_path = input("image path >>>>> ")
+    if image_path == '':
+        print('You did not enter image path, the following will be a plain text conversation.')
+        image = None
+        text_only_first_query = True
+    else:
+        image = Image.open(image_path).convert('RGB')
+    history = []
+    while True:
+        query = input("Human:")
+        if query == "clear":
+            break
+        if image is None:
+            input_by_model = model.build_conversation_input_ids(
+                tokenizer,
+                query=query,
+                history=history,
+                template_version='chat'
+            )
+        else:
+            input_by_model = model.build_conversation_input_ids(
+                tokenizer,
+                query=query,
+                history=history,
+                images=[image],
+                template_version='chat'
+            )
+        inputs = {
+            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
+            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
+            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
+            'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
+        }
+        gen_kwargs = {
+            "max_new_tokens": 2048,
+            "pad_token_id": 128002,
+            "top_k": 1,
+        }
+        with torch.no_grad():
+            outputs = model.generate(**inputs, **gen_kwargs)
+            outputs = outputs[:, inputs['input_ids'].shape[1]:]
+            response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+            print("\nCogVLM2:", response)
+        history.append((query, response))
--- a/basic_demo/cli_demo_batch_inference.py
+++ b/basic_demo/cli_demo_batch_inference.py
+import os
+import time
+import torch
+from PIL import Image
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import time
+MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"
+TORCH_TYPE = torch.bfloat16
+device = 'cuda:0'
+tokenizer = AutoTokenizer.from_pretrained(
+    MODEL_PATH,
+    trust_remote_code=True
+)
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL_PATH,
+    torch_dtype=TORCH_TYPE,
+    trust_remote_code=True,
+    device_map=device,
+    # load_in_4bit=True,
+    # low_cpu_mem_usage=True
+).eval()
+def recur_move_to(item, tgt, criterion_func):
+    if criterion_func(item):
+        device_copy = item.to(tgt)
+        return device_copy
+    elif isinstance(item, list):
+        return [recur_move_to(v, tgt, criterion_func) for v in item]
+    elif isinstance(item, tuple):
+        return tuple([recur_move_to(v, tgt, criterion_func) for v in item])
+    elif isinstance(item, dict):
+        return {k: recur_move_to(v, tgt, criterion_func) for k, v in item.items()}
+    else:
+        return item
+def collate_fn(features, tokenizer) -> dict:
+    images = [feature.pop('images', None) for feature in features if 'images' in feature]
+    tokenizer.pad_token = tokenizer.eos_token
+    max_length = max(len(feature['input_ids']) for feature in features)
+    def pad_to_max_length(feature, max_length):
+        padding_length = max_length - len(feature['input_ids'])
+        feature['input_ids'] = torch.cat([feature['input_ids'], torch.full((padding_length,), tokenizer.pad_token_id)])
+        feature['token_type_ids'] = torch.cat([feature['token_type_ids'], torch.zeros(padding_length, dtype=torch.long)])
+        feature['attention_mask'] = torch.cat([feature['attention_mask'], torch.zeros(padding_length, dtype=torch.long)])
+        if feature['labels'] is not None:
+            feature['labels'] = torch.cat([feature['labels'], torch.full((padding_length,), tokenizer.pad_token_id)])
+        else:
+            feature['labels'] = torch.full((max_length,), tokenizer.pad_token_id)
+        return feature
+    features = [pad_to_max_length(feature, max_length) for feature in features]
+    batch = {
+        key: torch.stack([feature[key] for feature in features])
+        for key in features[0].keys()
+    }
+    if images:
+        batch['images'] = images
+    return batch
+image_folder = "/path/to/folder"
+data = []
+for root, dirs, files in os.walk(image_folder):
+    for file in files:
+        if file.endswith(('.jpg', '.jpeg', '.png', '.bmp', '.tiff', '.webp')):
+            data.append({"image": os.path.join(root, file)})
+length = len(data)
+batch_size = 14
+query = 'Describe this image in detail, and the description should be between 15 to 80 words.'
+for idx in range(0, length, batch_size):
+    i_list = []
+    for i in range(batch_size):
+        if idx + i < length:
+            i_list.append(data[idx + i])
+        else:
+            break
+    input_sample_list = []
+    start = time.time()
+    for i in i_list:
+        image = Image.open(i["image"]).convert('RGB')
+        input_sample = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image], template_version='chat')
+        input_sample_list.append(input_sample)
+    print(f"Prepare input time: {time.time() - start}")
+    start = time.time()
+    input_batch = collate_fn(input_sample_list, tokenizer)
+    input_batch = recur_move_to(input_batch, device, lambda x: isinstance(x, torch.Tensor))
+    input_batch = recur_move_to(input_batch, torch.bfloat16, lambda x: isinstance(x, torch.Tensor) and torch.is_floating_point(x))
+    print(f"Prepare batch time: {time.time() - start}")
+    gen_kwargs = {
+        "max_new_tokens": 2048,
+        "pad_token_id": 128002,
+        "top_k": 1,
+    }
+    start = time.time()
+    with torch.no_grad():
+        outputs = model.generate(**input_batch, **gen_kwargs)
+        outputs = outputs[:, input_batch['input_ids'].shape[1]:]
+        outputs = tokenizer.batch_decode(outputs)
+    outlist = [output.split("<|end_of_text|>")[0].strip() for output in outputs]
+    print(outlist)
+    print(f"Generate time: {time.time() - start}")
\ No newline at end of file
--- a/basic_demo/cli_demo_multi_gpus.py
+++ b/basic_demo/cli_demo_multi_gpus.py
+"""
+This is a demo for using CogVLM2 in CLI using multi-GPU with lower memory.
+If your single GPU is not enough to drive this model, you can use this demo to run this model on multiple graphics cards with limited video memory.
+Here, we default that your graphics card has 24GB of video memory, which is not enough to load the FP16 / BF16 model.
+so , need to use two graphics cards to load. We set '23GiB' for each GPU to avoid out of memory.
+GPUs less than 2 is recommended and need more than 16GB of video memory.
+test success in 3 GPUs with 16GB video memory.
+---------------------------------------------------------------------------------------+
+| Processes:                                                                            |
+|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
+|        ID   ID                                                             Usage      |
+|=======================================================================================|
+|    1   N/A  N/A   1890574      C   python                                    13066MiB |
+|    2   N/A  N/A   1890574      C   python                                    14560MiB |
+|    3   N/A  N/A   1890574      C   python                                    11164MiB |
+---------------------------------------------------------------------------------------+
+"""
+import torch
+from PIL import Image
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from accelerate import init_empty_weights, load_checkpoint_and_dispatch, infer_auto_device_map
+MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"
+DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
+TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
+    0] >= 8 else torch.float16
+tokenizer = AutoTokenizer.from_pretrained(
+    MODEL_PATH,
+    trust_remote_code=True
+)
+with init_empty_weights():
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_PATH,
+        torch_dtype=TORCH_TYPE,
+        trust_remote_code=True,
+    )
+num_gpus = torch.cuda.device_count()
+max_memory_per_gpu = "16GiB"
+if num_gpus > 2:
+    max_memory_per_gpu = f"{round(42 / num_gpus)}GiB"
+device_map = infer_auto_device_map(
+    model=model,
+    max_memory={i: max_memory_per_gpu for i in range(num_gpus)},
+    no_split_module_classes=["CogVLMDecoderLayer"]
+)
+model = load_checkpoint_and_dispatch(model, MODEL_PATH, device_map=device_map, dtype=TORCH_TYPE)
+model = model.eval()
+text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"
+while True:
+    image_path = input("image path >>>>> ")
+    if image_path == '':
+        print('You did not enter image path, the following will be a plain text conversation.')
+        image = None
+        text_only_first_query = True
+    else:
+        image = Image.open(image_path).convert('RGB')
+    history = []
+    while True:
+        query = input("Human:")
+        if query == "clear":
+            break
+        if image is None:
+            if text_only_first_query:
+                query = text_only_template.format(query)
+                text_only_first_query = False
+            else:
+                old_prompt = ''
+                for _, (old_query, response) in enumerate(history):
+                    old_prompt += old_query + " " + response + "\n"
+                query = old_prompt + "USER: {} ASSISTANT:".format(query)
+        if image is None:
+            input_by_model = model.build_conversation_input_ids(
+                tokenizer,
+                query=query,
+                history=history,
+                template_version='chat'
+            )
+        else:
+            input_by_model = model.build_conversation_input_ids(
+                tokenizer,
+                query=query,
+                history=history,
+                images=[image],
+                template_version='chat'
+            )
+        inputs = {
+            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
+            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
+            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
+            'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
+        }
+        gen_kwargs = {
+            "max_new_tokens": 2048,
+            "pad_token_id": 128002,
+            "top_k": 1,
+        }
+        with torch.no_grad():
+            outputs = model.generate(**inputs, **gen_kwargs)
+            outputs = outputs[:, inputs['input_ids'].shape[1]:]
+            response = tokenizer.decode(outputs[0])
+            response = response.split("")[0]
+            print("\nCogVLM2:", response)
+        history.append((query, response))
--- a/basic_demo/demo.jpg
+++ b/basic_demo/demo.jpg
--- a/basic_demo/openai_api_demo.py
+++ b/basic_demo/openai_api_demo.py
+import gc
+import threading
+import time
+import base64
+from contextlib import asynccontextmanager
+from typing import List, Literal, Union, Tuple, Optional
+import torch
+import uvicorn
+import requests
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from loguru import logger
+from pydantic import BaseModel, Field
+from sse_starlette.sse import EventSourceResponse
+from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
+from PIL import Image
+from io import BytesIO
+MODEL_PATH = 'THUDM/cogvlm2-llama3-chat-19B'
+DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
+TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
+    0] >= 8 else torch.float16
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    """
+    An asynchronous context manager for managing the lifecycle of the FastAPI app.
+    It ensures that GPU memory is cleared after the app's lifecycle ends, which is essential for efficient resource management in GPU environments.
+    """
+    yield
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.cuda.ipc_collect()
+app = FastAPI(lifespan=lifespan)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+class ModelCard(BaseModel):
+    """
+    A Pydantic model representing a model card, which provides metadata about a machine learning model.
+    It includes fields like model ID, owner, and creation time.
+    """
+    id: str
+    object: str = "model"
+    created: int = Field(default_factory=lambda: int(time.time()))
+    owned_by: str = "owner"
+    root: Optional[str] = None
+    parent: Optional[str] = None
+    permission: Optional[list] = None
+class ModelList(BaseModel):
+    object: str = "list"
+    data: List[ModelCard] = []
+class ImageUrl(BaseModel):
+    url: str
+class TextContent(BaseModel):
+    type: Literal["text"]
+    text: str
+class ImageUrlContent(BaseModel):
+    type: Literal["image_url"]
+    image_url: ImageUrl
+ContentItem = Union[TextContent, ImageUrlContent]
+class ChatMessageInput(BaseModel):
+    role: Literal["user", "assistant", "system"]
+    content: Union[str, List[ContentItem]]
+    name: Optional[str] = None
+class ChatMessageResponse(BaseModel):
+    role: Literal["assistant"]
+    content: str = None
+    name: Optional[str] = None
+class DeltaMessage(BaseModel):
+    role: Optional[Literal["user", "assistant", "system"]] = None
+    content: Optional[str] = None
+class ChatCompletionRequest(BaseModel):
+    model: str
+    messages: List[ChatMessageInput]
+    temperature: Optional[float] = 0.8
+    top_p: Optional[float] = 0.8
+    max_tokens: Optional[int] = None
+    stream: Optional[bool] = False
+    # Additional parameters
+    repetition_penalty: Optional[float] = 1.0
+class ChatCompletionResponseChoice(BaseModel):
+    index: int
+    message: ChatMessageResponse
+class ChatCompletionResponseStreamChoice(BaseModel):
+    index: int
+    delta: DeltaMessage
+class UsageInfo(BaseModel):
+    prompt_tokens: int = 0
+    total_tokens: int = 0
+    completion_tokens: Optional[int] = 0
+class ChatCompletionResponse(BaseModel):
+    model: str
+    object: Literal["chat.completion", "chat.completion.chunk"]
+    choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]
+    created: Optional[int] = Field(default_factory=lambda: int(time.time()))
+    usage: Optional[UsageInfo] = None
+@app.get("/v1/models", response_model=ModelList)
+async def list_models():
+    """
+    An endpoint to list available models. It returns a list of model cards.
+    This is useful for clients to query and understand what models are available for use.
+    """
+    model_card = ModelCard(id="cogvlm2-19b")
+    return ModelList(data=[model_card])
+@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
+async def create_chat_completion(request: ChatCompletionRequest):
+    global model, tokenizer
+    if len(request.messages) < 1 or request.messages[-1].role == "assistant":
+        raise HTTPException(status_code=400, detail="Invalid request")
+    gen_params = dict(
+        messages=request.messages,
+        temperature=request.temperature,
+        top_p=request.top_p,
+        max_tokens=request.max_tokens or 1024,
+        echo=False,
+        stream=request.stream,
+        repetition_penalty=request.repetition_penalty
+    )
+    if request.stream:
+        generate = predict(request.model, gen_params)
+        return EventSourceResponse(generate, media_type="text/event-stream")
+    response = generate_cogvlm(model, tokenizer, gen_params)
+    usage = UsageInfo()
+    message = ChatMessageResponse(
+        role="assistant",
+        content=response["text"],
+    )
+    logger.debug(f"==== message ====\n{message}")
+    choice_data = ChatCompletionResponseChoice(
+        index=0,
+        message=message,
+    )
+    task_usage = UsageInfo.model_validate(response["usage"])
+    for usage_key, usage_value in task_usage.model_dump().items():
+        setattr(usage, usage_key, getattr(usage, usage_key) + usage_value)
+    return ChatCompletionResponse(model=request.model, choices=[choice_data], object="chat.completion", usage=usage)
+def predict(model_id: str, params: dict):
+    global model, tokenizer
+    choice_data = ChatCompletionResponseStreamChoice(
+        index=0,
+        delta=DeltaMessage(role="assistant"),
+        finish_reason=None
+    )
+    chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
+    yield "{}".format(chunk.model_dump_json(exclude_unset=True))
+    previous_text = ""
+    for new_response in generate_stream_cogvlm(model, tokenizer, params):
+        decoded_unicode = new_response["text"]
+        delta_text = decoded_unicode[len(previous_text):]
+        previous_text = decoded_unicode
+        delta = DeltaMessage(content=delta_text, role="assistant")
+        choice_data = ChatCompletionResponseStreamChoice(index=0, delta=delta)
+        chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
+        yield "{}".format(chunk.model_dump_json(exclude_unset=True))
+    choice_data = ChatCompletionResponseStreamChoice(index=0, delta=DeltaMessage())
+    chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
+    yield "{}".format(chunk.model_dump_json(exclude_unset=True))
+def generate_cogvlm(model: AutoModelForCausalLM, tokenizer: AutoTokenizer, params: dict):
+    """
+    Generates a response using the CogVLM2 model. It processes the chat history and image data, if any,
+    and then invokes the model to generate a response.
+    """
+    response = None
+    for response in generate_stream_cogvlm(model, tokenizer, params):
+        pass
+    return response
+def process_history_and_images(messages: List[ChatMessageInput]) -> Tuple[
+    Optional[str], Optional[List[Tuple[str, str]]], Optional[List[Image.Image]]]:
+    """
+    Process history messages to extract text, identify the last user query,
+    and convert base64 encoded image URLs to PIL images.
+    Args:
+        messages(List[ChatMessageInput]): List of ChatMessageInput objects.
+    return: A tuple of three elements:
+             - The last user query as a string.
+             - Text history formatted as a list of tuples for the model.
+             - List of PIL Image objects extracted from the messages.
+    """
+    formatted_history = []
+    image_list = []
+    last_user_query = ''
+    for i, message in enumerate(messages):
+        role = message.role
+        content = message.content
+        if isinstance(content, list):  # text
+            text_content = ' '.join(item.text for item in content if isinstance(item, TextContent))
+        else:
+            text_content = content
+        if isinstance(content, list):  # image
+            for item in content:
+                if isinstance(item, ImageUrlContent):
+                    image_url = item.image_url.url
+                    if image_url.startswith("data:image/jpeg;base64,"):
+                        base64_encoded_image = image_url.split("data:image/jpeg;base64,")[1]
+                        image_data = base64.b64decode(base64_encoded_image)
+                        image = Image.open(BytesIO(image_data)).convert('RGB')
+                    else:
+                        response = requests.get(image_url, verify=False)
+                        image = Image.open(BytesIO(response.content)).convert('RGB')
+                    image_list.append(image)
+        if role == 'user':
+            if i == len(messages) - 1:  # 最后一条用户消息
+                last_user_query = text_content
+            else:
+                formatted_history.append((text_content, ''))
+        elif role == 'assistant':
+            if formatted_history:
+                if formatted_history[-1][1] != '':
+                    assert False, f"the last query is answered. answer again. {formatted_history[-1][0]}, {formatted_history[-1][1]}, {text_content}"
+                formatted_history[-1] = (formatted_history[-1][0], text_content)
+            else:
+                assert False, f"assistant reply before user"
+        else:
+            assert False, f"unrecognized role: {role}"
+    return last_user_query, formatted_history, image_list
+@torch.inference_mode()
+def generate_stream_cogvlm(model: AutoModelForCausalLM, tokenizer: AutoTokenizer, params: dict):
+    messages = params["messages"]
+    temperature = float(params.get("temperature", 1.0))
+    repetition_penalty = float(params.get("repetition_penalty", 1.0))
+    top_p = float(params.get("top_p", 1.0))
+    max_new_tokens = int(params.get("max_tokens", 256))
+    query, history, image_list = process_history_and_images(messages)
+    input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history,
+                                                        images=[image_list[-1]])
+    inputs = {
+        'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
+        'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
+        'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
+        'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]],
+    }
+    if 'cross_images' in input_by_model and input_by_model['cross_images']:
+        inputs['cross_images'] = [[input_by_model['cross_images'][0].to(DEVICE).to(TORCH_TYPE)]]
+    input_echo_len = len(inputs["input_ids"][0])
+    streamer = TextIteratorStreamer(
+        tokenizer=tokenizer,
+        timeout=60.0,
+        skip_prompt=True,
+        skip_special_tokens=True
+    )
+    gen_kwargs = {
+        "repetition_penalty": repetition_penalty,
+        "max_new_tokens": max_new_tokens,
+        "do_sample": True if temperature > 1e-5 else False,
+        "top_p": top_p if temperature > 1e-5 else 0,
+        "top_k": 1,
+        'streamer': streamer,
+    }
+    if temperature > 1e-5:
+        gen_kwargs["temperature"] = temperature
+    generated_text = ""
+    def generate_text():
+        with torch.no_grad():
+            model.generate(**inputs, **gen_kwargs)
+    generation_thread = threading.Thread(target=generate_text)
+    generation_thread.start()
+    total_len = input_echo_len
+    for next_text in streamer:
+        generated_text += next_text
+        total_len = len(tokenizer.encode(generated_text))
+        yield {
+            "text": generated_text,
+            "usage": {
+                "prompt_tokens": input_echo_len,
+                "completion_tokens": total_len - input_echo_len,
+                "total_tokens": total_len,
+            },
+        }
+    generation_thread.join()
+    yield {
+        "text": generated_text,
+        "usage": {
+            "prompt_tokens": input_echo_len,
+            "completion_tokens": total_len - input_echo_len,
+            "total_tokens": total_len,
+        },
+    }
+gc.collect()
+torch.cuda.empty_cache()
+if __name__ == "__main__":
+    # Argument parser
+    import argparse
+    parser = argparse.ArgumentParser(description="CogVLM2 Web Demo")
+    parser.add_argument('--quant', type=int, choices=[4, 8], help='Enable 4-bit or 8-bit precision loading', default=0)
+    args = parser.parse_args()
+    if 'int4' in MODEL_PATH:
+        args.quant = 4
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+    # Load the model
+    if args.quant == 4:
+        model = AutoModelForCausalLM.from_pretrained(
+            MODEL_PATH,
+            torch_dtype=TORCH_TYPE,
+            trust_remote_code=True,
+            load_in_4bit=True,
+            low_cpu_mem_usage=True
+        ).eval()
+    elif args.quant == 8:
+        model = AutoModelForCausalLM.from_pretrained(
+            MODEL_PATH,
+            torch_dtype=TORCH_TYPE,
+            trust_remote_code=True,
+            load_in_8bit=True,  # Assuming transformers support this argument; check documentation if not
+            low_cpu_mem_usage=True
+        ).eval()
+    else:
+        model = AutoModelForCausalLM.from_pretrained(
+            MODEL_PATH,
+            torch_dtype=TORCH_TYPE,
+            trust_remote_code=True
+        ).eval().to(DEVICE)
+    uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)
--- a/basic_demo/openai_api_request.py
+++ b/basic_demo/openai_api_request.py
+"""
+This script is designed to mimic the OpenAI API interface with CogVLM2 Chat
+It demonstrates how to integrate image and text-based input to generate a response.
+Currently, the model can only handle a single image.
+Therefore, do not use this script to process multiple images in one conversation. (includes images from history)
+And it only works on the chat model, not the base model.
+"""
+import requests
+import json
+import base64
+base_url = "http://127.0.0.1:8000"
+def create_chat_completion(model, messages, temperature=0.8, max_tokens=2048, top_p=0.8, use_stream=False):
+    """
+    This function sends a request to the chat API to generate a response based on the given messages.
+    Args:
+        model (str): The name of the model to use for generating the response.
+        messages (list): A list of message dictionaries representing the conversation history.
+        temperature (float): Controls randomness in response generation. Higher values lead to more random responses.
+        max_tokens (int): The maximum length of the generated response.
+        top_p (float): Controls diversity of response by filtering less likely options.
+        use_stream (bool): Determines whether to use a streaming response or a single response.
+    The function constructs a JSON payload with the specified parameters and sends a POST request to the API.
+    It then handles the response, either as a stream (for ongoing responses) or a single message.
+    """
+    data = {
+        "model": model,
+        "messages": messages,
+        "stream": use_stream,
+        "max_tokens": max_tokens,
+        "temperature": temperature,
+        "top_p": top_p,
+    }
+    response = requests.post(f"{base_url}/v1/chat/completions", json=data, stream=use_stream)
+    if response.status_code == 200:
+        if use_stream:
+            # 处理流式响应
+            for line in response.iter_lines():
+                if line:
+                    decoded_line = line.decode('utf-8')[6:]
+                    try:
+                        response_json = json.loads(decoded_line)
+                        content = response_json.get("choices", [{}])[0].get("delta", {}).get("content", "")
+                        print(content)
+                    except:
+                        print("Special Token:", decoded_line)
+        else:
+            # 处理非流式响应
+            decoded_line = response.json()
+            content = decoded_line.get("choices", [{}])[0].get("message", "").get("content", "")
+            print(content)
+    else:
+        print("Error:", response.status_code)
+        return None
+def encode_image(image_path):
+    """
+    Encodes an image file into a base64 string.
+    Args:
+        image_path (str): The path to the image file.
+    This function opens the specified image file, reads its content, and encodes it into a base64 string.
+    The base64 encoding is used to send images over HTTP as text.
+    """
+    with open(image_path, "rb") as image_file:
+        return base64.b64encode(image_file.read()).decode("utf-8")
+def simple_image_chat(use_stream=True, img_path=None):
+    """
+    Facilitates a simple chat interaction involving an image.
+    Args:
+        use_stream (bool): Specifies whether to use streaming for chat responses.
+        img_path (str): Path to the image file to be included in the chat.
+    This function encodes the specified image and constructs a predefined conversation involving the image.
+    It then calls `create_chat_completion` to generate a response from the model.
+    The conversation includes asking about the content of the image and a follow-up question.
+    """
+    img_url = f"data:image/jpeg;base64,{encode_image(img_path)}"
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "What’s in this image?",
+                },
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": img_url
+                    },
+                },
+            ],
+        },
+        {
+            "role": "assistant",
+            "content": "The image displays a wooden boardwalk extending through a vibrant green grassy wetland. The sky is partly cloudy with soft, wispy clouds, indicating nice weather. Vegetation is seen on either side of the boardwalk, and trees are present in the background, suggesting that this area might be a natural reserve or park designed for ecological preservation and outdoor recreation. The boardwalk allows visitors to explore the area without disturbing the natural habitat.",
+        },
+        {
+            "role": "user",
+            "content": "Do you think this is a spring or winter photo?"
+        },
+    ]
+    create_chat_completion("cogvlm2", messages=messages, use_stream=use_stream)
+if __name__ == "__main__":
+    simple_image_chat(use_stream=False, img_path="demo.jpg")
--- a/basic_demo/requirements.txt
+++ b/basic_demo/requirements.txt
+xformers
+torch>=2.0.0
+torchvision
+transformers>=4.40
+huggingface-hub>=0.23.0
+pillow
+chainlit>=1.0
+pydantic>=2.7.1
+timm>=0.9.16
+openai>=1.30.1
+loguru>=0.7.2
+pydantic>=2.7.1
+einops
+sse-starlette>=2.1.0
+bitsandbytes>=0.43.1 # for int4 quantization
\ No newline at end of file
--- a/basic_demo/web_demo.py
+++ b/basic_demo/web_demo.py
+"""
+This is a simple chat demo using CogVLM2 model in ChainLit.
+"""
+import os
+import dataclasses
+from typing import List
+from PIL import Image
+import chainlit as cl
+from chainlit.input_widget import Slider
+from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
+from huggingface_hub.inference._generated.types import TextGenerationStreamOutput, TextGenerationStreamOutputToken
+import threading
+import torch
+MODEL_PATH = 'THUDM/cogvlm2-llama3-chat-19B'
+DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
+TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
+    0] >= 8 else torch.float16
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+quant = int(os.environ.get('QUANT', 0))
+if 'int4' in MODEL_PATH:
+    quant = 4
+print(f'Quant = {quant}')
+# Load the model
+if quant == 4:
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_PATH,
+        torch_dtype=TORCH_TYPE,
+        trust_remote_code=True,
+        load_in_4bit=True,
+        low_cpu_mem_usage=True
+    ).eval()
+elif quant == 8:
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_PATH,
+        torch_dtype=TORCH_TYPE,
+        trust_remote_code=True,
+        load_in_8bit=True,  # Assuming transformers support this argument; check documentation if not
+        low_cpu_mem_usage=True
+    ).eval()
+else:
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_PATH,
+        torch_dtype=TORCH_TYPE,
+        trust_remote_code=True
+    ).eval().to(DEVICE)
+@cl.on_chat_start
+def on_chat_start():
+    print("Welcome use CogVLM2 chat demo")
+async def get_response(query, history, gen_kwargs, images=None):
+    if images is None:
+        input_by_model = model.build_conversation_input_ids(
+            tokenizer,
+            query=query,
+            history=history,
+            template_version='chat'
+        )
+    else:
+        input_by_model = model.build_conversation_input_ids(
+            tokenizer,
+            query=query,
+            history=history,
+            images=images[-1:],  # only use the last image, CogVLM2 only support one image
+            template_version='chat'
+        )
+    inputs = {
+        'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
+        'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
+        'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
+        'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if images is not None else None,
+    }
+    streamer = TextIteratorStreamer(tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True)
+    gen_kwargs['streamer'] = streamer
+    gen_kwargs = {**gen_kwargs, **inputs}
+    with torch.no_grad():
+        thread = threading.Thread(target=model.generate, kwargs=gen_kwargs)
+        thread.start()
+        for next_text in streamer:
+            yield TextGenerationStreamOutput(
+                index=0,
+                token=TextGenerationStreamOutputToken(
+                    id=0,
+                    logprob=0,
+                    text=next_text,
+                    special=False,
+                )
+            )
+@dataclasses.dataclass
+class Conversation:
+    """A class that keeps all conversation history."""
+    roles: List[str]
+    messages: List[List[str]]
+    version: str = "Unknown"
+    def append_message(self, role, message):
+        self.messages.append([role, message])
+    def get_prompt(self):
+        if not self.messages:
+            return None, []
+        last_role, last_msg = self.messages[-2]
+        if isinstance(last_msg, tuple):
+            query, _ = last_msg
+        else:
+            query = last_msg
+        history = []
+        for role, msg in self.messages[:-2]:
+            if isinstance(msg, tuple):
+                text, _ = msg
+            else:
+                text = msg
+            if role == "USER":
+                history.append((text, ""))
+            else:
+                if history:
+                    history[-1] = (history[-1][0], text)
+        return query, history
+    def get_images(self):
+        for role, msg in reversed(self.messages):
+            if isinstance(msg, tuple):
+                msg, image = msg
+                if image is None:
+                    continue
+                if image.mode != 'RGB':
+                    image = image.convert('RGB')
+                width, height = image.size
+                if width > 1344 or height > 1344:
+                    max_len = 1344
+                    aspect_ratio = width / height
+                    if width > height:
+                        new_width = max_len
+                        new_height = int(new_width / aspect_ratio)
+                    else:
+                        new_height = max_len
+                        new_width = int(new_height * aspect_ratio)
+                    image = image.resize((new_width, new_height))
+                return [image]
+        return None
+    def copy(self):
+        return Conversation(
+            roles=self.roles,
+            messages=[[x, y] for x, y in self.messages],
+            version=self.version,
+        )
+    def dict(self):
+        if len(self.get_images()) > 0:
+            return {
+                "roles": self.roles,
+                "messages": [
+                    [x, y[0] if type(y) is tuple else y] for x, y in self.messages
+                ],
+            }
+        return {
+            "roles": self.roles,
+            "messages": self.messages,
+        }
+default_conversation = Conversation(
+    roles=("USER", "ASSISTANT"),
+    messages=()
+)
+async def request(conversation: Conversation, settings):
+    gen_kwargs = {
+        "temperature": settings["temperature"],
+        "top_p": settings["top_p"],
+        "max_new_tokens": int(settings["max_token"]),
+        "top_k": int(settings["top_k"]),
+        "do_sample": True,
+    }
+    query, history = conversation.get_prompt()
+    images = conversation.get_images()
+    chainlit_message = cl.Message(content="", author="CogVLM2")
+    text = ""
+    async for response in get_response(query, history, gen_kwargs, images):
+        output = response.token.text
+        text += output
+        conversation.messages[-1][-1] = text
+        await chainlit_message.stream_token(text, is_sequence=True)
+    await chainlit_message.send()
+    return conversation
+@cl.on_chat_start
+async def start():
+    settings = await cl.ChatSettings(
+        [
+            Slider(id="temperature", label="Temperature", initial=0.5, min=0.01, max=1, step=0.05),
+            Slider(id="top_p", label="Top P", initial=0.7, min=0, max=1, step=0.1),
+            Slider(id="top_k", label="Top K", initial=5, min=0, max=50, step=1),
+            Slider(id="max_token", label="Max output tokens", initial=2048, min=0, max=8192, step=1),
+        ]
+    ).send()
+    conversation = default_conversation.copy()
+    cl.user_session.set("conversation", conversation)
+    cl.user_session.set("settings", settings)
+@cl.on_settings_update
+async def setup_agent(settings):
+    cl.user_session.set("settings", settings)
+@cl.on_message
+async def main(message: cl.Message):
+    image = next(
+        (
+            Image.open(file.path)
+            for file in message.elements or []
+            if "image" in file.mime and file.path is not None
+        ),
+        None,
+    )
+    conv = cl.user_session.get("conversation")  # type: Conversation
+    settings = cl.user_session.get("settings")
+    text = message.content
+    conv_message = (text, image)
+    conv.append_message(conv.roles[0], conv_message)
+    conv.append_message(conv.roles[1], None)
+    conv = await request(conv, settings)
+    cl.user_session.set("conversation", conv)
--- a/cli_demo.py
+++ b/cli_demo.py
+"""
+This is a demo for using CogVLM2 in CLI using Single GPU.
+Strongly suggest to use GPU with bfloat16 support, otherwise, it will be slow.
+Mention that only one picture can be processed at one conversation, which means you can not replace or insert another picture during the conversation.
+for multi-GPU, please use cli_demo_multi_gpus.py
+"""
+import torch
+import argparse
+from PIL import Image
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+import os
+MODEL_PATH = "./cogvlm2-llama3-chinese-chat-19B"
+DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
+TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
+    0] >= 8 else torch.float16
+# Argument parser
+parser = argparse.ArgumentParser(description="CogVLM2 CLI Demo")
+parser.add_argument('--quant', type=int, choices=[4, 8], help='Enable 4-bit or 8-bit precision loading', default=0)
+args = parser.parse_args()
+if 'int4' in MODEL_PATH:
+    args.quant = 4
+tokenizer = AutoTokenizer.from_pretrained(
+    MODEL_PATH,
+    trust_remote_code=True
+)
+# Check GPU memory
+if torch.cuda.is_available() and torch.cuda.get_device_properties(0).total_memory < 48 * 1024 ** 3 and not args.quant:
+    print("GPU memory is less than 48GB. Please use cli_demo_multi_gpus.py or pass `--quant 4` or `--quant 8`.")
+    exit()
+# Load the model
+if args.quant == 4:
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_PATH,
+        torch_dtype=TORCH_TYPE,
+        trust_remote_code=True,
+        quantization_config=BitsAndBytesConfig(load_in_4bit=True),
+        low_cpu_mem_usage=True
+    ).eval()
+elif args.quant == 8:
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_PATH,
+        torch_dtype=TORCH_TYPE,
+        trust_remote_code=True,
+        quantization_config=BitsAndBytesConfig(load_in_8bit=True),
+        low_cpu_mem_usage=True
+    ).eval()
+else:
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_PATH,
+        torch_dtype=TORCH_TYPE,
+        trust_remote_code=True
+    ).eval().to(DEVICE)
+while True:
+    image_path = input("image path >>>>> ")
+    if image_path == '':
+        print('You did not enter image path, the following will be a plain text conversation.')
+        image = None
+        text_only_first_query = True
+    else:
+        image = Image.open(image_path).convert('RGB')
+    history = []
+    while True:
+        query = input("Human:")
+        if query == "clear":
+            break
+        if image is None:
+            input_by_model = model.build_conversation_input_ids(
+                tokenizer,
+                query=query,
+                history=history,
+                template_version='chat'
+            )
+        else:
+            input_by_model = model.build_conversation_input_ids(
+                tokenizer,
+                query=query,
+                history=history,
+                images=[image],
+                template_version='chat'
+            )
+        inputs = {
+            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
+            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
+            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
+            'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
+        }
+        gen_kwargs = {
+            "max_new_tokens": 2048,
+            "pad_token_id": 128002,
+            "top_k": 1,
+        }
+        with torch.no_grad():
+            outputs = model.generate(**inputs, **gen_kwargs)
+            outputs = outputs[:, inputs['input_ids'].shape[1]:]
+            response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+            print("\nCogVLM2:", response)
+        history.append((query, response))