v1.0

112bf76b · chenzk · 112bf76b · 112bf76b · 112bf76b · 112bf76b
Commit 112bf76b authored Oct 31, 2024 by chenzk
20 changed files
--- a/InternViT-300M-448px/README.md
+++ b/InternViT-300M-448px/README.md
+---
+license: mit
+datasets:
+- laion/laion2B-en
+- laion/laion-coco
+- laion/laion2B-multi
+- kakaobrain/coyo-700m
+- conceptual_captions
+- wanng/wukong100m
+pipeline_tag: image-feature-extraction
+---
+
+# InternViT-300M-448px
+
+[\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL)  [\[🆕 Blog\]](https://internvl.github.io/blog/)  [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238)  [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821)
+
+[\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)  [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL)  [\[🚀 Quick Start\]](#quick-start)  [\[📖 中文解读\]](https://zhuanlan.zhihu.com/p/706547971)  [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
+
+This update primarily focuses on enhancing the efficiency of the vision foundation model. We developed InternViT-300M-448px by distilling knowledge from the robust vision foundation model, [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5). Like its predecessor, InternViT-300M-448px features a dynamic input resolution of 448×448, with a basic tile size of 448×448. During training, it allows for 1 to 12 tiles, and expands to 1 to 40 tiles during testing. Additionally, it inherits the powerful robustness, OCR capability, and high-resolution processing capacity from InternViT-6B-448px-V1-5.
+
+## Model Details
+- **Model Type:** vision foundation model, feature backbone
+- **Model Stats:**
+  - Params (M): 304
+  - Image size: 448 x 448, training with 1 - 12 tiles
+- **Pretrain Dataset:** LAION-en, LAION-zh, COYO, GRIT, COCO, TextCaps, Objects365, OpenImages, All-Seeing, Wukong-OCR, LaionCOCO-OCR, and other OCR-related datasets. 
+To enhance the OCR capability of the model, we have incorporated additional OCR data alongside the general caption datasets. Specifically, we utilized PaddleOCR to perform Chinese OCR on images from Wukong and English OCR on images from LAION-COCO. 
+
+## Model Usage (Image Embeddings)
+
+```python
+import torch
+from PIL import Image
+from transformers import AutoModel, CLIPImageProcessor
+
+model = AutoModel.from_pretrained(
+    'OpenGVLab/InternViT-300M-448px',
+    torch_dtype=torch.bfloat16,
+    low_cpu_mem_usage=True,
+    trust_remote_code=True).cuda().eval()
+
+image = Image.open('./examples/image1.jpg').convert('RGB')
+
+image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-300M-448px')
+
+pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
+pixel_values = pixel_values.to(torch.bfloat16).cuda()
+
+outputs = model(pixel_values)
+```
+
+## Citation
+
+If you find this project useful in your research, please consider citing:
+
+```BibTeX
+@article{chen2023internvl,
+  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
+  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
+  journal={arXiv preprint arXiv:2312.14238},
+  year={2023}
+}
+@article{chen2024far,
+  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
+  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
+  journal={arXiv preprint arXiv:2404.16821},
+  year={2024}
+}
+
+```
--- a/License.txt
+++ b/License.txt
+Tencent is pleased to support the open source community by making VITA available. 
+
+Copyright (C) 2024 THL A29 Limited, a Tencent company.  All rights reserved. 
+
+VITA is licensed under the Apache License Version 2.0 except for the third-party components listed below.
+
+
+Terms of the Apache License Version 2.0:
+--------------------------------------------------------------------
+Apache License 
+
+Version 2.0, January 2004
+
+http://www.apache.org/licenses/ 
+
+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+1. Definitions.
+
+"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
+
+"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
+
+"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
+
+"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
+
+"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
+
+"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
+
+"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
+
+"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
+
+"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
+
+"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
+
+2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
+
+3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
+
+4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
+
+You must give any other recipients of the Work or Derivative Works a copy of this License; and 
+
+You must cause any modified files to carry prominent notices stating that You changed the files; and 
+
+You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and 
+
+If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. 
+
+You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 
+
+5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
+
+6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
+
+7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
+
+8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
+
+9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
+
+END OF TERMS AND CONDITIONS
+
+
+
+Other dependencies and licenses:
+
+
+Open Source Software Licensed under the BSD 3-Clause License and Other Licenses of the Third-Party Components therein:
+--------------------------------------------------------------------
+1. Pytorch
+Copyright (c) 2016-present, Facebook Inc. All rights reserved.
+
+All contributions by Facebook:
+Copyright (c) 2016 Facebook Inc.
+
+All contributions by Google:
+Copyright (c) 2015 Google Inc.
+All rights reserved.
+
+All contributions by Yangqing Jia:
+Copyright (c) 2015 Yangqing Jia
+All rights reserved.
+
+All contributions by Kakao Brain:
+Copyright 2019-2020 Kakao Brain
+
+All contributions by Cruise LLC:
+Copyright (c) 2022 Cruise LLC.
+All rights reserved.
+
+All contributions from Caffe:
+Copyright(c) 2013, 2014, 2015, the respective contributors
+All rights reserved.
+
+All other contributions:
+Copyright(c) 2015, 2016 the respective contributors
+All rights reserved.
+
+
+Terms of the BSD 3-Clause License:
+--------------------------------------------------------------------
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+
+3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+For the license of other third party components, please refer to the following URL:
+https://github.com/pytorch/pytorch/blob/v2.3.1/NOTICE
+
+
+
+Open Source Software Licensed under the Apache License Version 2.0:
+--------------------------------------------------------------------
+1. Transformers
+Copyright 2018- The Hugging Face team. All rights reserved.
+Source code of this software can be obtained from: https://github.com/huggingface/transformers
+
+2. mistralai/Mixtral-8x7B-Instruct-v0.1
+Copyright (c) mistralai/Mixtral-8x7B-Instruct-v0.1 original author and authors
+Source code of this software can be obtained from: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
+
+3. vllm
+Copyright 2023 The vLLM team.
+Source code of this software can be obtained from: https://github.com/vllm-project/vllm
+
+4. gradio
+Copyright (c) gradio original author and authors
+Source code of this software can be obtained from: https://github.com/gradio-app/gradio
+
+5. tencentcloud
+ Copyright (c) 2017-2021 THL A29 Limited, a Tencent company. All Rights Reserved.
+Source code of this software can be obtained from: https://github.com/TencentCloud/tencentcloud-sdk-python
+
+
+Terms of the Apache License Version 2.0:
+--------------------------------------------------------------------
+Apache License 
+
+Version 2.0, January 2004
+
+http://www.apache.org/licenses/ 
+
+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+1. Definitions.
+
+"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
+
+"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
+
+"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
+
+"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
+
+"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
+
+"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
+
+"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
+
+"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
+
+"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
+
+"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
+
+2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
+
+3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
+
+4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
+
+You must give any other recipients of the Work or Derivative Works a copy of this License; and 
+
+You must cause any modified files to carry prominent notices stating that You changed the files; and 
+
+You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and 
+
+If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. 
+
+You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 
+
+5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
+
+6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
+
+7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
+
+8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
+
+9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
+
+END OF TERMS AND CONDITIONS
+
+
+
+Open Source Software Licensed under the PSF 2.0:
+--------------------------------------------------------------------
+1. Python
+Copyright ©2001-2024.  Python Software Foundation
+
+
+Terms of the PSF 2.0:
+--------------------------------------------------------------------
+A. HISTORY OF THE SOFTWARE
+==========================
+
+Python was created in the early 1990s by Guido van Rossum at Stichting
+Mathematisch Centrum (CWI, see https://www.cwi.nl) in the Netherlands
+as a successor of a language called ABC.  Guido remains Python's
+principal author, although it includes many contributions from others.
+
+In 1995, Guido continued his work on Python at the Corporation for
+National Research Initiatives (CNRI, see https://www.cnri.reston.va.us)
+in Reston, Virginia where he released several versions of the
+software.
+
+In May 2000, Guido and the Python core development team moved to
+BeOpen.com to form the BeOpen PythonLabs team.  In October of the same
+year, the PythonLabs team moved to Digital Creations, which became
+Zope Corporation.  In 2001, the Python Software Foundation (PSF, see
+https://www.python.org/psf/) was formed, a non-profit organization
+created specifically to own Python-related Intellectual Property.
+Zope Corporation was a sponsoring member of the PSF.
+
+All Python releases are Open Source (see https://opensource.org for
+the Open Source Definition).  Historically, most, but not all, Python
+releases have also been GPL-compatible; the table below summarizes
+the various releases.
+
+    Release         Derived     Year        Owner       GPL-
+                    from                                compatible? (1)
+
+    0.9.0 thru 1.2              1991-1995   CWI         yes
+    1.3 thru 1.5.2  1.2         1995-1999   CNRI        yes
+    1.6             1.5.2       2000        CNRI        no
+    2.0             1.6         2000        BeOpen.com  no
+    1.6.1           1.6         2001        CNRI        yes (2)
+    2.1             2.0+1.6.1   2001        PSF         no
+    2.0.1           2.0+1.6.1   2001        PSF         yes
+    2.1.1           2.1+2.0.1   2001        PSF         yes
+    2.1.2           2.1.1       2002        PSF         yes
+    2.1.3           2.1.2       2002        PSF         yes
+    2.2 and above   2.1.1       2001-now    PSF         yes
+
+Footnotes:
+
+(1) GPL-compatible doesn't mean that we're distributing Python under
+    the GPL.  All Python licenses, unlike the GPL, let you distribute
+    a modified version without making your changes open source.  The
+    GPL-compatible licenses make it possible to combine Python with
+    other software that is released under the GPL; the others don't.
+
+(2) According to Richard Stallman, 1.6.1 is not GPL-compatible,
+    because its license has a choice of law clause.  According to
+    CNRI, however, Stallman's lawyer has told CNRI's lawyer that 1.6.1
+    is "not incompatible" with the GPL.
+
+Thanks to the many outside volunteers who have worked under Guido's
+direction to make these releases possible.
+
+
+B. TERMS AND CONDITIONS FOR ACCESSING OR OTHERWISE USING PYTHON
+===============================================================
+
+Python software and documentation are licensed under the
+Python Software Foundation License Version 2.
+
+Starting with Python 3.8.6, examples, recipes, and other code in
+the documentation are dual licensed under the PSF License Version 2
+and the Zero-Clause BSD license.
+
+Some software incorporated into Python is under different licenses.
+The licenses are listed with code falling under that license.
+
+
+PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2
+--------------------------------------------
+
+1. This LICENSE AGREEMENT is between the Python Software Foundation
+("PSF"), and the Individual or Organization ("Licensee") accessing and
+otherwise using this software ("Python") in source or binary form and
+its associated documentation.
+
+2. Subject to the terms and conditions of this License Agreement, PSF hereby
+grants Licensee a nonexclusive, royalty-free, world-wide license to reproduce,
+analyze, test, perform and/or display publicly, prepare derivative works,
+distribute, and otherwise use Python alone or in any derivative version,
+provided, however, that PSF's License Agreement and PSF's notice of copyright,
+i.e., "Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
+2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023 Python Software Foundation;
+All Rights Reserved" are retained in Python alone or in any derivative version
+prepared by Licensee.
+
+3. In the event Licensee prepares a derivative work that is based on
+or incorporates Python or any part thereof, and wants to make
+the derivative work available to others as provided herein, then
+Licensee hereby agrees to include in any such work a brief summary of
+the changes made to Python.
+
+4. PSF is making Python available to Licensee on an "AS IS"
+basis.  PSF MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
+IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PSF MAKES NO AND
+DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
+FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON WILL NOT
+INFRINGE ANY THIRD PARTY RIGHTS.
+
+5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
+FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
+A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON,
+OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
+
+6. This License Agreement will automatically terminate upon a material
+breach of its terms and conditions.
+
+7. Nothing in this License Agreement shall be deemed to create any
+relationship of agency, partnership, or joint venture between PSF and
+Licensee.  This License Agreement does not grant permission to use PSF
+trademarks or trade name in a trademark sense to endorse or promote
+products or services of Licensee, or any third party.
+
+8. By copying, installing or otherwise using Python, Licensee
+agrees to be bound by the terms and conditions of this License
+Agreement.
+
+
+BEOPEN.COM LICENSE AGREEMENT FOR PYTHON 2.0
+-------------------------------------------
+
+BEOPEN PYTHON OPEN SOURCE LICENSE AGREEMENT VERSION 1
+
+1. This LICENSE AGREEMENT is between BeOpen.com ("BeOpen"), having an
+office at 160 Saratoga Avenue, Santa Clara, CA 95051, and the
+Individual or Organization ("Licensee") accessing and otherwise using
+this software in source or binary form and its associated
+documentation ("the Software").
+
+2. Subject to the terms and conditions of this BeOpen Python License
+Agreement, BeOpen hereby grants Licensee a non-exclusive,
+royalty-free, world-wide license to reproduce, analyze, test, perform
+and/or display publicly, prepare derivative works, distribute, and
+otherwise use the Software alone or in any derivative version,
+provided, however, that the BeOpen Python License is retained in the
+Software, alone or in any derivative version prepared by Licensee.
+
+3. BeOpen is making the Software available to Licensee on an "AS IS"
+basis.  BEOPEN MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
+IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, BEOPEN MAKES NO AND
+DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
+FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE SOFTWARE WILL NOT
+INFRINGE ANY THIRD PARTY RIGHTS.
+
+4. BEOPEN SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF THE
+SOFTWARE FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS
+AS A RESULT OF USING, MODIFYING OR DISTRIBUTING THE SOFTWARE, OR ANY
+DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
+
+5. This License Agreement will automatically terminate upon a material
+breach of its terms and conditions.
+
+6. This License Agreement shall be governed by and interpreted in all
+respects by the law of the State of California, excluding conflict of
+law provisions.  Nothing in this License Agreement shall be deemed to
+create any relationship of agency, partnership, or joint venture
+between BeOpen and Licensee.  This License Agreement does not grant
+permission to use BeOpen trademarks or trade names in a trademark
+sense to endorse or promote products or services of Licensee, or any
+third party.  As an exception, the "BeOpen Python" logos available at
+http://www.pythonlabs.com/logos.html may be used according to the
+permissions granted on that web page.
+
+7. By copying, installing or otherwise using the software, Licensee
+agrees to be bound by the terms and conditions of this License
+Agreement.
+
+
+CNRI LICENSE AGREEMENT FOR PYTHON 1.6.1
+---------------------------------------
+
+1. This LICENSE AGREEMENT is between the Corporation for National
+Research Initiatives, having an office at 1895 Preston White Drive,
+Reston, VA 20191 ("CNRI"), and the Individual or Organization
+("Licensee") accessing and otherwise using Python 1.6.1 software in
+source or binary form and its associated documentation.
+
+2. Subject to the terms and conditions of this License Agreement, CNRI
+hereby grants Licensee a nonexclusive, royalty-free, world-wide
+license to reproduce, analyze, test, perform and/or display publicly,
+prepare derivative works, distribute, and otherwise use Python 1.6.1
+alone or in any derivative version, provided, however, that CNRI's
+License Agreement and CNRI's notice of copyright, i.e., "Copyright (c)
+1995-2001 Corporation for National Research Initiatives; All Rights
+Reserved" are retained in Python 1.6.1 alone or in any derivative
+version prepared by Licensee.  Alternately, in lieu of CNRI's License
+Agreement, Licensee may substitute the following text (omitting the
+quotes): "Python 1.6.1 is made available subject to the terms and
+conditions in CNRI's License Agreement.  This Agreement together with
+Python 1.6.1 may be located on the internet using the following
+unique, persistent identifier (known as a handle): 1895.22/1013.  This
+Agreement may also be obtained from a proxy server on the internet
+using the following URL: http://hdl.handle.net/1895.22/1013".
+
+3. In the event Licensee prepares a derivative work that is based on
+or incorporates Python 1.6.1 or any part thereof, and wants to make
+the derivative work available to others as provided herein, then
+Licensee hereby agrees to include in any such work a brief summary of
+the changes made to Python 1.6.1.
+
+4. CNRI is making Python 1.6.1 available to Licensee on an "AS IS"
+basis.  CNRI MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
+IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, CNRI MAKES NO AND
+DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
+FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON 1.6.1 WILL NOT
+INFRINGE ANY THIRD PARTY RIGHTS.
+
+5. CNRI SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
+1.6.1 FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
+A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON 1.6.1,
+OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
+
+6. This License Agreement will automatically terminate upon a material
+breach of its terms and conditions.
+
+7. This License Agreement shall be governed by the federal
+intellectual property law of the United States, including without
+limitation the federal copyright law, and, to the extent such
+U.S. federal law does not apply, by the law of the Commonwealth of
+Virginia, excluding Virginia's conflict of law provisions.
+Notwithstanding the foregoing, with regard to derivative works based
+on Python 1.6.1 that incorporate non-separable material that was
+previously distributed under the GNU General Public License (GPL), the
+law of the Commonwealth of Virginia shall govern this License
+Agreement only as to issues arising under or with respect to
+Paragraphs 4, 5, and 7 of this License Agreement.  Nothing in this
+License Agreement shall be deemed to create any relationship of
+agency, partnership, or joint venture between CNRI and Licensee.  This
+License Agreement does not grant permission to use CNRI trademarks or
+trade name in a trademark sense to endorse or promote products or
+services of Licensee, or any third party.
+
+8. By clicking on the "ACCEPT" button where indicated, or by copying,
+installing or otherwise using Python 1.6.1, Licensee agrees to be
+bound by the terms and conditions of this License Agreement.
+
+        ACCEPT
+
+
+CWI LICENSE AGREEMENT FOR PYTHON 0.9.0 THROUGH 1.2
+--------------------------------------------------
+
+Copyright (c) 1991 - 1995, Stichting Mathematisch Centrum Amsterdam,
+The Netherlands.  All rights reserved.
+
+Permission to use, copy, modify, and distribute this software and its
+documentation for any purpose and without fee is hereby granted,
+provided that the above copyright notice appear in all copies and that
+both that copyright notice and this permission notice appear in
+supporting documentation, and that the name of Stichting Mathematisch
+Centrum or CWI not be used in advertising or publicity pertaining to
+distribution of the software without specific, written prior
+permission.
+
+STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO
+THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
+FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE
+FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
+OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+
+ZERO-CLAUSE BSD LICENSE FOR CODE IN THE PYTHON DOCUMENTATION
+----------------------------------------------------------------------
+
+Permission to use, copy, modify, and/or distribute this software for any
+purpose with or without fee is hereby granted.
+
+THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
+REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
+AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
+INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
+LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
+OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
+PERFORMANCE OF THIS SOFTWARE.
+
+
+
+Open Source Software Licensed under the BSD 3-Clause License:
+--------------------------------------------------------------------
+1. numpy
+Copyright (c) 2005-2023, NumPy Developers.
+All rights reserved.
+
+2. torchvision
+Copyright (c) Soumith Chintala 2016, 
+All rights reserved.
+
+
+A copy of the BSD 3-Clause License is included in this file.
+
+
+
+Open Source Software Licensed under the HPND License:
+--------------------------------------------------------------------
+1. Pillow
+Copyright © 2010-2024 by Jeffrey A. Clark and contributors
+
+
+Terms of the HPND License:
+--------------------------------------------------------------------
+The Python Imaging Library (PIL) is
+
+    Copyright © 1997-2011 by Secret Labs AB
+    Copyright © 1995-2011 by Fredrik Lundh and contributors
+
+Pillow is the friendly PIL fork. It is
+
+    Copyright © 2010-2024 by Jeffrey A. Clark and contributors
+
+Like PIL, Pillow is licensed under the open source HPND License:
+
+By obtaining, using, and/or copying this software and/or its associated
+documentation, you agree that you have read, understood, and will comply
+with the following terms and conditions:
+
+Permission to use, copy, modify and distribute this software and its
+documentation for any purpose and without fee is hereby granted,
+provided that the above copyright notice appears in all copies, and that
+both that copyright notice and this permission notice appear in supporting
+documentation, and that the name of Secret Labs AB or the author not be
+used in advertising or publicity pertaining to distribution of the software
+without specific, written prior permission.
+
+SECRET LABS AB AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS
+SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS.
+IN NO EVENT SHALL SECRET LABS AB OR THE AUTHOR BE LIABLE FOR ANY SPECIAL,
+INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
+LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE
+OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
+PERFORMANCE OF THIS SOFTWARE.
+
+
+
+Open Source Software Licensed under the Apache License Version 2.0 and Other Licenses of the Third-Party Components therein:
+--------------------------------------------------------------------
+1. decord
+Copyright (c) 2019 by Contributors if not otherwise specified
+Source code of this software can be obtained from: https://github.com/dmlc/decord
+
+
+A copy of the Apache License Version 2.0 is included in this file.
+
+For the license of other third party components, please refer to the following URL:
+https://github.com/dmlc/decord/tree/master/3rdparty
+
+
+
+Open Source Software Licensed under the BSD 2-Clause License and Other Licenses of the Third-Party Components therein:
+--------------------------------------------------------------------
+1. torchaudio
+Copyright (c) 2017 Facebook Inc. (Soumith Chintala), 
+All rights reserved.
+
+
+Terms of the BSD 2-Clause License:
+--------------------------------------------------------------------
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+
+* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+For the license of other third party components, please refer to the following URL:
+https://github.com/pytorch/audio/blob/v2.3.1/third_party/LICENSES_BUNDLED.txt
+
+
+
+Open Source Software Licensed under the MIT License and Other Licenses of the Third-Party Components therein:
+--------------------------------------------------------------------
+1. tqdm
+Copyright (c) 2013 noamraph
+
+
+A copy of the MIT License is included in this file.
+
+For the license of other third party components, please refer to the following URL:
+https://github.com/tqdm/tqdm/blob/v4.66.4/LICENCE
+
+
+Open Source Software Licensed under the MIT License:
+--------------------------------------------------------------------
+1. einops
+Copyright (c) 2018 Alex Rogozhnikov
+
+2. pyaudio
+Copyright (c) 2006 Hubert Pham
+
+3. onnxruntime
+Copyright (c) Microsoft Corporation
+
+
+A copy of the MIT License is included in this file.
\ No newline at end of file
--- a/README.md
+++ b/README.md
+# VITA
+VITA能够处理视频、图像、文本和音频，具备先进的多模态交互体验，无需使用唤醒词或按钮即可被激活。
+## 论文
+`VITA: Towards Open-Source Interactive Omni Multimodal LLM`
+- https://arxiv.org/pdf/2408.05211
+
+## 模型结构
+VITA提取特征的主体部分是Mixtral 8×7B，外加多个分别编码音频、图像、视频的编码器，编码器与Mixtral之间用MLP进行连接。
+<div align=center>
+    <img src="./doc/Mixtral.png"/>
+</div>
+
+Mixtral 8×7B与llama不同的主要地方是SMoE。
+<div align=center>
+    <img src="./doc/SMoE.png"/>
+</div>
+
+## 算法原理
+VITA以文本模态为基础无缝集成音频、图像、视频三种模态，主要采用方法是用微调实现文本与其它模态对齐，实现主要包括三个步骤：LLM的双语指令微调、多模态对齐和指令微调，联合pipeline开发。
+<div align=center>
+    <img src="./doc/vita.png"/>
+</div>
+
+## 环境配置
+```
+mv vita_pytorch VITA # 去框架名后缀
+```
+
+### Docker（方法一）
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu22.04-dtk24.04.2-py3.10
+# <your IMAGE ID>为以上拉取的docker的镜像ID替换，本镜像为：83714c19d308
+docker run -it --shm-size=64G -v $PWD/VITA:/home/VITA-v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name vita <your IMAGE ID> bash
+cd /home/VITA
+pip install -r requirements.txt # requirements.txt
+# 安装ffmpeg
+apt update
+apt-get install ffmpeg
+# 安装gradio
+pip install gradio==5.4.0 # gradio
+cp -r frpc_linux_amd64 /usr/local/lib/python3.10/site-packages/gradio/frpc_linux_amd64_v0.3
+chmod +x /usr/local/lib/python3.10/site-packages/gradio/frpc_linux_amd64_v0.3
+```
+### Dockerfile（方法二）
+```
+cd VITA/docker
+docker build --no-cache -t vita:latest .
+docker run --shm-size=64G --name vita -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../VITA:/home/VITA -it vita bash
+# 若遇到Dockerfile启动的方式安装环境需要长时间等待，可注释掉里面的pip安装，启动容器后再安装python库：pip install -r requirements.txt。
+cd /home/VITA
+# 安装ffmpeg
+apt update
+apt-get install ffmpeg
+# 安装gradio
+pip install gradio==5.4.0 # gradio
+cp -r frpc_linux_amd64 /usr/local/lib/python3.10/site-packages/gradio/frpc_linux_amd64_v0.3
+chmod +x /usr/local/lib/python3.10/site-packages/gradio/frpc_linux_amd64_v0.3
+```
+### Anaconda（方法三）
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
+- https://developer.hpccube.com/tool/
+```
+DTK驱动:dtk24.04.2
+python:python3.10
+torch:2.3.0
+torchvision:0.18.1
+torchaudio:2.1.2
+triton:2.1.0
+flash-attn:2.0.4
+deepspeed:0.14.2
+apex:1.3.0
+xformers:0.0.25
+```
+
+`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
+
+2、其它非特殊库参照requirements.txt安装
+```
+cd VITA
+pip install -r requirements.txt # requirements.txt
+# 安装ffmpeg
+apt update
+apt-get install ffmpeg
+# 安装gradio
+pip install gradio==5.4.0 # gradio
+cp -r frpc_linux_amd64 /usr/local/lib/python3.10/site-packages/gradio/frpc_linux_amd64_v0.3
+chmod +x /usr/local/lib/python3.10/site-packages/gradio/frpc_linux_amd64_v0.3
+```
+
+## 数据集
+[ShareGPT4V](http://113.200.138.88:18080/aidatasets/project-dependency/lin-chen/ShareGPT4V.git)、[coco2017](http://113.200.138.88:18080/aidatasets/coco2017.git)、[LLaVA-Pretrain](http://113.200.138.88:18080/aidatasets/liuhaotian/llava-pretrain.git)、`sam`、`web-celebrity`等为需要的公共数据集，其中，后面一些数据集可向论文作者咨询下载源，`自建数据集custom`为用户在自己应用场景微调需要自己制作的数据集，`input_wavs`为custom需要的音频文件，`input_imgs`为custom需要的图像文件，它们用于prompt，以上数据集皆不影响推理。
+
+1、用户在自己应用场景微调所需数据集按如下方式制作json文件`custom.json`，json中的数据为多模态配对数据，其中set：`sharegpt4`是提示加载图像或视频数据的关键字。
+```
+[
+    ...
+    {
+        "set": "sharegpt4",
+        "id": "000000000164",
+        "conversations": [
+            {
+                "from": "human",
+                "value": "<image>\n<audio>\n"
+            },
+            {
+                "from": "gpt",  // follow the setting of llave, "gpt" is only used to indicate that this is the ground truth of the model output
+                "value": "This is a well-organized kitchen with a clean, modern aesthetic. The kitchen features a white countertop against a white wall, creating a bright and airy atmosphere. "
+            }
+        ],
+        "image": "coco/images/train2017/000000000164.jpg",
+        "audio": [
+            "new_value_dict_0717/output_wavs/f61cf238b7872b4903e1fc15dcb5a50c.wav"
+        ]
+    },
+    ...
+]
+```
+2、完成以上json文件后，修改配置文件[./vita/config/dataset_config.py](./vita/config/dataset_config.py)，将音频文件夹填到`AudioFolder`、set（图像视频）文件夹填到`sharegpt4`、json文件填到`chat_path`。
+```
+AudioFolder = ""
+FolderDict = {
+    #### NaturalCap
+    "sharegpt4": "",
+}
+#### NaturalCap
+ShareGPT4V = {"chat_path": ""}
+```
+
+3、将自制数据集移动到ShareGPT4V下面进行使用。
+```
+mv custom/* ShareGPT4V/
+```
+数据的放置目录结构如下：
+```
+/home/VITA
+    ├── input_wavs
+        ├── xxx.wav
+        └── audio
+            ├── xxx.wav
+            ...
+    ├── input_imgs
+        ├── xxx.jpg
+        ...
+    ├── ShareGPT4V
+        ├── custom.json
+        ├── coco/images/train2017
+            ├── xxx.jpg
+            ...
+        ...
+        └── sharegpt4v_instruct_gpt4-vision_cap100k.json
+    ├── coco
+        └── train2017
+            ├── xxx.jpg
+            ...
+    ├── llava
+        └── llava_pretrain
+            └── images
+    ...
+    ├── sam
+        └── images
+    └── web-celebrity
+        └── images
+```
+更多资料可参考源项目的[`README_origin`](./README_origin.md)
+
+## 训练
+无（配置需求：作者预估需6台8卡机器）
+
+## 推理
+```
+# 方法一：pytorch
+# Text query
+HIP_VISIBLE_DEVICES=0,1 python video_audio_demo.py --model_path VITA/VITA_ckpt --image_path asset/vita_log2.png --model_type mixtral-8x7b --conv_mode mixtral_two --question "请描述这张图片。"
+# infer.sh 中的audio推理功能敬请期待后续开放，torchaudio库适配.wav等音频文件的功能正在DCU上适配中，待适配后可从光合开发者社区下载。
+
+# 方法二：vllm（优化中）
+# 先准备vllm推理环境
+apt update
+apt-get install portaudio19-dev
+apt-get install libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0
+
+pip install -r web_demo/web_demo_requirements.txt
+pip install whl/vllm-0.6.2+das.opt1.85def94.dtk24042-cp310-cp310-linux_x86_64.whl
+pip install whl/flash_attn-2.6.1+das.opt2.08f8827.dtk24042-cp310-cp310-linux_x86_64.whl
+
+cp -r  VITA/VITA_ckpt/ demo_VITA_ckpt/
+cd ./web_demo/vllm_tools
+cp -rf model_weight_file/*  ../../demo_VITA_ckpt/
+cp -rf vllm_file/* cp -rf vllm_file/* /usr/local/lib/python3.10/site-packages/vllm/model_executor/models/ #通过pip show vllm可以查看vllm安装位置
+cp -rf multimodal/* /usr/local/lib/python3.10/site-packages/vllm/multimodal/
+
+# 推理
+python -m web_demo.web_ability_demo  demo_VITA_ckpt/
+```
+更多资料可参考源项目的[`README_origin`](./README_origin.md)
+
+## result
+文本对话问答的推理效果示例：
+
+`输入: `
+```
+图片：asset/vita_log2.png
+文本："请描述这张图片。"
+```
+
+`输出:`
+```
+<3> 这张图片展示了一个标志和一段文字。标志位于图片的上方，由一个橙色的“VITA”字样组成，字体设计独特，具有现代感。标志下方是一段蓝色的文字，内容为“Towards Open-Source Interactive Omni Multimodal LLM”。这段文字传达了一个信息，即该项目或产品正在向开源、交互式、全模态的语言模型（LLM）发展。
+整体来看，这张图片传达了一种科技感和未来感，暗示着该项目或产品在语言模型领域的创新和进步。
+```
+
+### 精度
+DCU与GPU精度一致，推理框架：pytorch。
+
+## 应用场景
+### 算法类别
+`对话问答`
+### 热点应用行业
+`制造,广媒,金融,能源,医疗,家居,教育`
+## 预训练权重
+下载预权重后，放置目录结构如下：
+```
+/home/VITA
+    ├── VITA/VITA_ckpt
+    ├── audio-encoder-2wh_zh_en_audioset_Mixtral-8x7B_New-base-tunning
+    └── InternViT-300M-448px
+```
+
+预训练权重快速下载中心：[SCNet AIModels](http://113.200.138.88:18080/aimodels) ，项目中的预训练权重可从快速下载通道下载：[VITA/VITA_ckpt](http://113.200.138.88:18080/aimodels/vita-mllm/VITA.git) 、[InternViT-300M-448px](http://113.200.138.88:18080/aimodels/opengvlab/InternViT-300M-448px.git)。
+
+Hugging Face下载地址为：[VITA/VITA_ckpt](https://huggingface.co/VITA-MLLM/VITA) 、[InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px)。
+## 源码仓库及问题反馈
+- http://developer.sourcefind.cn/codes/modelzoo/vita_pytorch.git
+## 参考资料
+- https://github.com/VITA-MLLM/VITA.git
+
--- a/README_origin.md
+++ b/README_origin.md
+# VITA: Towards Open-Source Interactive Omni Multimodal LLM
+
+
+<p align="center">
+    <img src="./asset/vita_log2.png" width="100%" height="100%">
+</p>
+
+<font size=7><div align='center' > [[🍎 Project Page](https://vita-home.github.io/)] [[📖 arXiv Paper](https://arxiv.org/pdf/2408.05211)] [[🤗 Hugging Face](https://huggingface.co/VITA-MLLM)] [[💬 WeChat (微信)](./asset/wechat_4.jpg)]</div></font>
+
+
+---
+
+<p align="center">
+    <img src="./asset/vita.png" width="85%" height="85%">
+</p>
+
+## 🔥 News
+* **`2024.09.06`** 🌟 The training code, deployment code, and model weights **have been released**. Long wait!
+* **`2024.08.12`** 🌟 We are very proud to launch VITA, the First-Ever open-source interactive omni multimodal LLM! We have submitted the open-source code, yet it is under review internally. We are moving the process forward as quickly as possible, stay tuned!
+
+
+## Contents <!-- omit in toc -->
+
+
+- [VITA Overview](#-vita-overview)
+- [Experimental Results](#-experimental-results)
+- [Training](#-training)
+  - [Requirements and Installation](#requirements-and-installation)
+  - [Data Preparation](#data-preparation)
+  - [Continual Training](#continual-training)
+- [Inference](#-inference)
+  - [Quick Start](#quick-start)
+  - [Demo](#demo)
+    - [Basic Demo](#-basic-demo)
+    - [Real-Time Interactive Demo](#-real-time-interactive-demo)
+
+
+
+## 👀 VITA Overview
+The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce **VITA**, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of **V**ideo, **I**mage, **T**ext, and **A**udio modalities, and meanwhile has an advanced multimodal interactive experience. Our work distinguishes from existing open-source MLLM through **three key features**:
+
+- **Omni Multimodal Understanding**. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks.  
+- **Non-awakening Interaction**. VITA can be activated and respond to user audio questions in the environment without the need for a wake-up word or button. 
+- **Audio Interrupt Interaction**. VITA is able to simultaneously track and filter external queries in real-time. This allows users to interrupt the model's generation at any time with new questions, and VITA will respond to the new query accordingly.
+
+<p align="center">
+    <img src="./asset/VITA_features.png" width="88%" height="88%">
+</p>
+
+VITA is capable of processing inputs in the form of pure text/audio, as well as video/image combined with text/audio. Besides, two key techniques are adopted to advance the multimodal interactive experience: 
+ 
+ - **State Token**. We set different state tokens for different query inputs. <1> corresponds to the effective query audio, such as “what is the biggest animal in the world?”, for which we expect a response from the model. <2> corresponds to the noisy audio, such as someone in the environment calls me to eat, for which we expect the model not to reply. <3> corresponds to the query text, i.e., the question given by the user in text form. During the training phase, we try to teach the model to automatically distinguish different input queries. During the deployment phase, with <2> we can implement non-awakening interaction. 
+ - **Duplex Scheme**. We further introduce a duplex scheme for the audio interrupt interaction. Two models are running at the same time, where the generation model is responsible for handling user queries. When the generation model starts working, the other model monitors the environment. If the user interrupts with another effective audio query, the monitoring model aggregates the historical context to respond to the latest query, while the generation model is paused and tune to monitor, i.e., the two models swap identities.
+<p align="center">
+    <img src="./asset/VITA_duplex.png" width="88%" height="88%">
+</p>
+
+
+
+## 📈 Experimental Results
+- **Comparison of official Mixtral 8x7B Instruct and our trained Mixtral 8x7B**.
+
+<p align="center">
+    <img src="./asset/language_eval2.png" width="68%" height="50%">
+</p>
+
+
+- **Evaluation of Error Rate on ASR tasks.**
+
+<p align="center">
+    <img src="./asset/audio_eval.jpg" width="96%" height="96%">
+</p>
+
+- **Evaluation on image and video understanding.**
+
+<p align="center">
+    <img src="./asset/visual_eval.jpg" width="100%" height="100%">
+</p>
+
+
+## ⭐ Training
+### Requirements and Installation
+```
+git clone https://github.com/VITA-MLLM/VITA
+cd VITA
+conda create -n vita python=3.10 -y
+conda activate vita
+pip install --upgrade pip
+pip install -r requirements.txt
+pip install flash-attn --no-build-isolation
+```
+
+### Data Preparation
+- An example json file of the training data:
+```
+[
+    ...
+    {
+        "set": "sharegpt4",
+        "id": "000000000164",
+        "conversations": [
+            {
+                "from": "human",
+                "value": "<image>\n<audio>\n"
+            },
+            {
+                "from": "gpt",  // follow the setting of llave, "gpt" is only used to indicate that this is the ground truth of the model output
+                "value": "This is a well-organized kitchen with a clean, modern aesthetic. The kitchen features a white countertop against a white wall, creating a bright and airy atmosphere. "
+            }
+        ],
+        "image": "coco/images/train2017/000000000164.jpg",
+        "audio": [
+            "new_value_dict_0717/output_wavs/f61cf238b7872b4903e1fc15dcb5a50c.wav"
+        ]
+    },
+    ...
+]
+```
+
+- The `set` field is used to retrieve the image or video folder for data loading. You should add its key-value pair to the `FolderDict` in [./vita/config/dataset_config.py](./vita/config/dataset_config.py):
+```
+AudioFolder = ""
+FolderDict = {
+    #### NaturalCap
+    "sharegpt4": "",
+}
+#### NaturalCap
+ShareGPT4V = {"chat_path": ""}
+```
+
+- Set the JSON path for `"chat_path"` in the corresponding dictionary in [./vita/config/dataset_config.py](./vita/config/dataset_config.py).
+- Set the audio folder path for `AudioFolder` in [./vita/config/dataset_config.py](./vita/config/dataset_config.py).
+- Add the data class in `DataConfig` in [./vita/config/init.py](./vita/config/__init__.py):
+```
+from .dataset_config import *
+
+NaturalCap = [ShareGPT4V]
+
+DataConfig = {
+    "Pretrain_video": NaturalCap,
+}
+```
+
+
+### Continual Training
+- Download the required weights: (1) [VITA checkpoint](https://huggingface.co/VITA-MLLM/VITA), (2) [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px), and (3) [Our pretrained audio encoder](https://huggingface.co/VITA-MLLM/VITA) in Stage-2 audio-language alignment (refer to Fig. 3 in the paper).
+
+- Replace the paths in [./script/train/finetuneTask_nodes.sh](./script/train/finetuneTask_nodes.sh):
+```
+    ...
+    --model_name_or_path VITA_ckpt \
+    ...
+    --vision_tower InternViT-300M-448px \
+    ...
+    --audio_encoder audio-encoder-2wh_zh_en_audioset_Mixtral-8x7B_New-base-tunning \
+    ...
+```
+
+- Execute the following commands to start the training process:
+
+```
+export PYTHONPATH=./
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+OUTPUT_DIR=/mnt/cfs/lhj/videomllm_ckpt/outputs/vita_video_audio
+bash script/train/finetuneTask_nodes.sh ${OUTPUT_DIR}
+```
+
+
+## 📐 Inference
+### Quick Start
+- Text query
+```
+CUDA_VISIBLE_DEVICES=0,1 python video_audio_demo.py \
+    --model_path [vita/path] \
+    --image_path asset/vita_log2.png \
+    --model_type mixtral-8x7b \
+    --conv_mode mixtral_two \
+    --question "请描述这张图片。" \
+```
+
+- Audio query
+```
+CUDA_VISIBLE_DEVICES=0,1 python video_audio_demo.py \
+    --model_path [vita/path] \
+    --image_path asset/vita_log2.png \
+    --model_type mixtral-8x7b \
+    --conv_mode mixtral_two \
+    --audio_path asset/q1.wav
+```
+
+-  Noisy audio query
+```
+CUDA_VISIBLE_DEVICES=0,1 python video_audio_demo.py \
+    --model_path [vita/path] \
+    --image_path asset/vita_log2.png \
+    --model_type mixtral-8x7b \
+    --conv_mode mixtral_two \
+    --audio_path asset/q2.wav
+```
+
+
+### Demo
+
+We have accelerated the model using [vLLM](https://github.com/vllm-project/vllm). 
+Since VITA has not yet been integrated into vLLM, you need to make some modifications to the vLLM code to adapt it for VITA.
+
+
+```bash
+conda create -n vita_demo python==3.10
+conda activate vita_demo
+pip install -r web_demo/web_demo_requirements.txt
+
+# Backup a new weight file
+cp -r  VITA_ckpt/ demo_VITA_ckpt/
+
+cd ./web_demo/vllm_tools
+cp -rf model_weight_file/*  ../../demo_VITA_ckpt/
+cp -rf vllm_file/*  your_anaconda/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/models/
+```
+
+
+
+
+#### 📍 Basic Demo
+
+https://github.com/user-attachments/assets/bdc7e9d1-a7d3-432e-aae8-5de493a5c042
+
+
+```bash
+python -m web_demo.web_ability_demo  demo_VITA_ckpt/
+```
+
+
+#### 📍 Real-Time Interactive Demo
+
+To have a good interactive experience, please pay attention to the following three points:
+
+- **Ensure a high-speed network connection**.
+- **Use high-performance GPUs for deployment**. In the demo video, we use 4 Nvidia H20 GPUs. A800, H800, or A100 will be much better.
+- **Maintain a quiet environment**.
+
+https://github.com/user-attachments/assets/5f375464-a77c-4dce-b2b5-7897c230bb9b
+
+
+To run the real-time interactive demo, you need to make the following preparations:
+
+- Prepare a VAD (Voice Activity Detection) module. 
+You can choose to download [silero_vad.onnx](https://github.com/snakers4/silero-vad/tree/v4.0/files) and [silero_vad.jit](https://github.com/snakers4/silero-vad/tree/v4.0/files), and place these files in the `./web_demo/wakeup_and_vad/resource/` directory.
+
+- Prepare a TTS (Text-to-Speech) module and modify the `tts_transform_text` function in [./web_demo/web_interactive_demo.py](./web_demo/web_interactive_demo.py). 
+The demo uses a TencentCloud API by default. 
+You can register on the [Tencent Cloud](https://cloud.tencent.com/product/tts) to obtain a TTS API, 
+then fill in your API key on line 43 of [./web_demo/web_interactive_demo.py](./web_demo/web_interactive_demo.py).
+You can also try to use other API or open-source modules.
+
+- For a better real-time interactive experience, you need to set `max_dynamic_patch` to 1 in `demo_VITA_ckpt/config.json`. 
+When you run the basic demo, you can set it to the default value of 12 to enhance the model's visual capabilities.
+
+```bash
+python -m web_demo.web_interactive_demo
+```
+
+## ✒️ Citation
+
+If you find our work helpful for your research, please consider citing our work.   
+
+```bibtex
+@article{fu2024vita,
+  title={Vita: Towards open-source interactive omni multimodal llm},
+  author={Fu, Chaoyou and Lin, Haojia and Long, Zuwei and Shen, Yunhang and Zhao, Meng and Zhang, Yifan and Wang, Xiong and Yin, Di and Ma, Long and Zheng, Xiawu and others},
+  journal={arXiv preprint arXiv:2408.05211},
+  year={2024}
+}
+```
+
+
+## &#x1F4E3; Statement
+
+**VITA is trained on large-scale open-source corpus, and its output has randomness. Any content generated by VITA does not represent the views of the model developers. We are not responsible for any problems arising from the use, misuse, and dissemination of VITA, including but not limited to public opinion risks and data security issues.**
+
+
+## 📜 Related Works
+
+Explore our related researches:
+-  **[Awesome-MLLM]** [A Survey on Multimodal Large Language Models](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)
+-  **[MME]** [MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation)
+-  **[Video-MME]** [Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis](https://github.com/BradyFU/Video-MME) 
+
+
+## 👍 Acknowledgement
+VITA is built with reference to the following outstanding works: [LLaVA-1.5](https://github.com/haotian-liu/LLaVA), [Bunny](https://github.com/BAAI-DCAI/Bunny), [ChatUnivi](https://github.com/PKU-YuanGroup/Chat-UniVi), [InternVL](https://github.com/OpenGVLab/InternVL), [InternViT](https://huggingface.co/OpenGVLab/InternViT-300M-448px), and [Mixtral 8*7B](https://mistral.ai/news/mixtral-of-experts/).
+Thanks！
+
--- a/ShareGPT4V/README.md
+++ b/ShareGPT4V/README.md
+---
+license: cc-by-nc-4.0
+task_categories:
+- visual-question-answering
+- question-answering
+language:
+- en
+pretty_name: ShareGPT4V Captions 1.2M Dataset Card
+size_categories:
+- 1M<n
+configs:
+- config_name: ShareGPT4V
+  data_files: sharegpt4v_instruct_gpt4-vision_cap100k.json
+- config_name: ShareGPT4V-PT
+  data_files: share-captioner_coco_lcs_sam_1246k_1107.json
+---
+
+# News
+
+**[2024/5/8]** We released **[ShareGPT4Video](https://sharegpt4video.github.io/)**, a large-scale video-caption dataset, with **40K** captions annotated by GPT4V and **4.8M** captions annotated by our ShareCaptioner-Video. The total videos last with **300** hours and **3000** hours separately!
+
+# ShareGPT4V 1.2M Dataset Card
+
+## Dataset details
+
+**Dataset type:**
+ShareGPT4V Captions 1.2M is a set of GPT4-Vision-powered multi-modal captions data.
+
+It is constructed to enhance modality alignment and fine-grained visual concept perception in Large Multi-Modal Models (LMMs) during both the pre-training and supervised fine-tuning stages. This advancement aims to bring LMMs towards GPT4-Vision capabilities.
+
+* sharegpt4v_instruct_gpt4-vision_cap100k.json is generated by GPT4-Vision (ShareGPT4V).
+* share-captioner_coco_lcs_sam_1246k_1107.json is generated by our Share-Captioner trained on GPT4-Vision-generated data (ShareGPT4V-PT).
+* sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json is curated from sharegpt4v_instruct_gpt4-vision_cap100k.json for the supervised fine-tuning stage.
+
+**Dataset date:**
+ShareGPT4V Captions 1.2M was collected in 11.07 2023.
+
+**Paper or resources for more information:**
+[[Project](https://ShareGPT4V.github.io/)] [[Paper](https://huggingface.co/papers/2311.12793)] [[Code](https://github.com/ShareGPT4Omni/ShareGPT4V)]
+
+**License:**
+Attribution-NonCommercial 4.0 International
+It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
+
+## Intended use
+**Primary intended uses:**
+The primary use of ShareGPT4V Captions 1.2M is research on large multimodal models and chatbots.
+
+**Primary intended users:**
+The primary intended users of this dataset are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
--- a/VITA/README.md
+++ b/VITA/README.md
+---
+license: apache-2.0
+---
+
+
+## Quick Start
+Please Refer to this [Repo](https://github.com/VITA-MLLM/VITA).
+
+## ACCEPTABLE USE POLICY
+
+Any license on the model is subject to your compliance with the Acceptable Use Policy, and You must not violate (or encourage or permit anyone else to violate) any term of the Acceptable Use Policy. Tencent reserves the right to update this Acceptable Use Policy from time to time.
+
+Tencent endeavors to promote safe and fair use of its tools and features, including VITA. You agree not to use VITA or any of its derivatives:
+1. In any way that violates any applicable national, federal, state, local, international or any other law or regulation;
+2. To harm Yourself or others;
+3. To repurpose or distribute output from VITA or any of its derivatives to harm Yourself or others; 
+4. To override or circumvent the safety guardrails and safeguards We have put in place;
+5. For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
+6. To generate or disseminate verifiably false information and/or content with the purpose of harming others or influencing elections;
+7. To generate or facilitate false online engagement, including fake reviews and other means of fake online engagement;
+8. To intentionally defame, disparage or otherwise harass others;
+9. To generate and/or disseminate malware (including ransomware) or any other content to be used for the purpose of harming electronic systems;
+10. To generate or disseminate personal identifiable information with the purpose of harming others;
+11. To generate or disseminate information (including images, code, posts, articles), and place the information in any public context (including –through the use of bot generated tweets), without expressly and conspicuously identifying that the information and/or content is machine generated;
+12. To impersonate another individual without consent, authorization, or legal right;
+13. To make high-stakes automated decisions in domains that affect an individual’s safety, rights or wellbeing (e.g., law enforcement, migration, medicine/health, management of critical infrastructure, safety components of products, essential services, credit, employment, housing, education, social scoring, or insurance);
+14. In a manner that violates or disrespects the social ethics and moral standards of other countries or regions;
+15. To perform, facilitate, threaten, incite, plan, promote or encourage violent extremism or terrorism;
+16. For any use intended to discriminate against or harm individuals or groups based on protected characteristics or categories, online or offline social behavior or known or predicted personal or personality characteristics;
+17. To intentionally exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
+18. For military purposes;
+19. To engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or other professional practices.
\ No newline at end of file
--- a/VITA/VITA_ckpt/config.json
+++ b/VITA/VITA_ckpt/config.json
+{
+  "_name_or_path": "Mixtral-8x7B_modVocab/mg2hg",
+  "architectures": [
+    "VITAMixtralForCausalLM"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "freeze_audio_encoder": true,
+  "freeze_audio_encoder_adapter": true,
+  "freeze_mm_mlp_adapter": false,
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "image_aspect_ratio": "square",
+  "initializer_range": 0.02,
+  "intermediate_size": 14336,
+  "max_position_embeddings": 32768,
+  "mm_audio_encoder": "audio-encoder-2wh_zh_en_audioset_Mixtral-8x7B_New-base-tunning",
+  "mm_hidden_size": 4096,
+  "mm_projector_lr": null,
+  "mm_projector_type": "mlp2x_gelu",
+  "mm_vision_tower": "InternViT-300M-448px",
+  "model_type": "vita-mixtral",
+  "num_attention_heads": 32,
+  "num_experts_per_tok": 2,
+  "num_hidden_layers": 32,
+  "num_key_value_heads": 8,
+  "num_local_experts": 8,
+  "output_router_logits": false,
+  "rms_norm_eps": 1e-05,
+  "rope_theta": 1000000.0,
+  "router_aux_loss_coef": 0.02,
+  "router_jitter_noise": 0.0,
+  "sliding_window": null,
+  "tie_word_embeddings": false,
+  "tokenizer_model_max_length": 9100,
+  "tokenizer_padding_side": "right",
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.41.1",
+  "tune_mm_mlp_adapter": false,
+  "unfreeze_vision_tower": false,
+  "use_cache": false,
+  "use_mm_proj": true,
+  "use_s2": false,
+  "vocab_size": 51760
+}
--- a/VITA/__pycache__/constants.cpython-310.pyc
+++ b/VITA/__pycache__/constants.cpython-310.pyc
--- a/VITA/__pycache__/conversation.cpython-310.pyc
+++ b/VITA/__pycache__/conversation.cpython-310.pyc
--- a/VITA/config/__init__.py
+++ b/VITA/config/__init__.py
+from .dataset_config import *
+
+NaturalCap = [ShareGPT4V]
+
+DataConfig = {
+    "Pretrain_video": NaturalCap,
+}
+
+NoPatchSets = ["khair", "jester"]
--- a/VITA/config/__pycache__/__init__.cpython-310.pyc
+++ b/VITA/config/__pycache__/__init__.cpython-310.pyc
--- a/VITA/config/__pycache__/dataset_config.cpython-310.pyc
+++ b/VITA/config/__pycache__/dataset_config.cpython-310.pyc
--- a/VITA/config/dataset_config.py
+++ b/VITA/config/dataset_config.py
+AudioFolder = "input_wavs"
+FolderDict = {
+    #### NaturalCap
+    "sharegpt4": "ShareGPT4V",
+}
+#### NaturalCap
+# ShareGPT4V = {"chat_path": "ShareGPT4V/sharegpt4v_instruct_gpt4-vision_cap100k.json"}
+ShareGPT4V = {"chat_path": "ShareGPT4V/custom.json"}
--- a/VITA/constants.py
+++ b/VITA/constants.py
+# Model Constants
+MAX_IMAGE_LENGTH = 16  # 8#16#32
+MIN_IMAGE_LENGTH = 4
+IGNORE_INDEX = -100
+IMAGE_TOKEN_INDEX = -200
+AUDIO_TOKEN_INDEX = -500
+DEFAULT_IMAGE_TOKEN = "<image>"
+DEFAULT_VIDEO_TOKEN = "<video>"
+DEFAULT_AUDIO_TOKEN = "<audio>"
+CONTROLLER_HEART_BEAT_EXPIRATION = 30
+LOGDIR = "gradio-logs"
+WORKER_HEART_BEAT_INTERVAL = 15
+DEFAULT_DATA_RATIO = 1.0
+GLOBAL_WEIGHTS_PATH = ""
--- a/VITA/conversation.py
+++ b/VITA/conversation.py
+import dataclasses
+from enum import Enum, auto
+from typing import List
+
+
+class SeparatorStyle(Enum):
+    """Different separator style."""
+
+    TWO = auto()
+    PLAIN = auto()
+    MixtralZh = auto()
+    MixtralTwo = auto
+
+
+@dataclasses.dataclass
+class Conversation:
+    """A class that keeps all conversation history."""
+
+    system: str
+    roles: List[str]
+    messages: List[List[str]]
+    offset: int
+    sep_style: SeparatorStyle
+    sep: str = "###"
+    sep2: str = None
+    version: str = "Unknown"
+
+    skip_next: bool = False
+
+    def get_prompt(self, modality=None):
+        messages = self.messages
+        if len(messages) > 0 and type(messages[0][1]) is tuple:
+            messages = self.messages.copy()
+            init_role, init_msg = messages[0].copy()
+            init_msg = init_msg[0].replace("<image>", "").strip()
+            if "mmtag" in self.version:
+                messages[0] = (init_role, init_msg)
+                messages.insert(0, (self.roles[0], "<Image><image></Image>"))
+                messages.insert(1, (self.roles[1], "Received."))
+            else:
+                messages[0] = (init_role, "<image>\n" + init_msg)
+
+        if self.sep_style == SeparatorStyle.TWO:
+            seps = [self.sep, self.sep2]
+            ret = self.system + seps[0]
+            for i, (role, message) in enumerate(messages):
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += role + ": " + message + seps[i % 2]
+                else:
+                    ret += role + ":"
+
+        elif self.sep_style == SeparatorStyle.MixtralZh:
+            seps = [self.sep, self.sep2]
+            ret = "system:" + self.system + seps[0]
+            for i, (role, message) in enumerate(messages):
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += "\n" + role + ":" + message + seps[i % 2]
+                else:
+                    ret += "\n" + role + ":"
+
+        elif self.sep_style == SeparatorStyle.MixtralTwo:
+            seps = [self.sep, self.sep2]
+            has_image = False
+            for i, (role, message) in enumerate(messages):
+                if message and "<image>" in message:
+                    has_image = True
+                    break
+            if has_image:
+                assert modality == "image" or modality == "video"
+                if modality == "image":
+                    self.system = self.system[0]
+                elif modality == "video":
+                    self.system = self.system[1]
+                else:
+                    raise ValueError
+            else:
+                assert modality == "lang"
+                self.system = self.system[2]
+            ret = "system:" + self.system + seps[0]
+            for i, (role, message) in enumerate(messages):
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += "\n" + role + ":" + message + seps[i % 2]
+                else:
+                    ret += "\n" + role + ":"
+
+        elif self.sep_style == SeparatorStyle.PLAIN:
+            seps = [self.sep, self.sep2]
+            ret = self.system
+            for i, (role, message) in enumerate(messages):
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += message + seps[i % 2]
+                else:
+                    ret += ""
+        else:
+            raise ValueError(f"Invalid style: {self.sep_style}")
+
+        return ret
+
+    def append_message(self, role, message):
+        self.messages.append([role, message])
+
+    def get_images(self, return_pil=False):
+        images = []
+        for i, (role, msg) in enumerate(self.messages[self.offset :]):
+            if i % 2 == 0:
+                if type(msg) is tuple:
+                    import base64
+                    from io import BytesIO
+                    from PIL import Image
+
+                    msg, image, image_process_mode = msg
+                    if image_process_mode == "Pad":
+
+                        def expand2square(pil_img, background_color=(122, 116, 104)):
+                            width, height = pil_img.size
+                            if width == height:
+                                return pil_img
+                            elif width > height:
+                                result = Image.new(pil_img.mode, (width, width), background_color)
+                                result.paste(pil_img, (0, (width - height) // 2))
+                                return result
+                            else:
+                                result = Image.new(pil_img.mode, (height, height), background_color)
+                                result.paste(pil_img, ((height - width) // 2, 0))
+                                return result
+
+                        image = expand2square(image)
+                    elif image_process_mode in ["Default", "Crop"]:
+                        pass
+                    elif image_process_mode == "Resize":
+                        image = image.resize((336, 336))
+                    else:
+                        raise ValueError(f"Invalid image_process_mode: {image_process_mode}")
+
+                    if return_pil:
+                        images.append(image)
+                    else:
+                        buffered = BytesIO()
+                        image.save(buffered, format="PNG")
+                        img_b64_str = base64.b64encode(buffered.getvalue()).decode()
+                        images.append(img_b64_str)
+        return images
+
+    def to_gradio_chatbot(self):
+        ret = []
+        for i, (role, msg) in enumerate(self.messages[self.offset :]):
+            if i % 2 == 0:
+                if type(msg) is tuple:
+                    import base64
+                    from io import BytesIO
+
+                    msg, image, image_process_mode = msg
+                    max_hw, min_hw = max(image.size), min(image.size)
+                    aspect_ratio = max_hw / min_hw
+                    max_len, min_len = 800, 400
+                    shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
+                    longest_edge = int(shortest_edge * aspect_ratio)
+                    W, H = image.size
+                    if H > W:
+                        H, W = longest_edge, shortest_edge
+                    else:
+                        H, W = shortest_edge, longest_edge
+                    image = image.resize((W, H))
+                    buffered = BytesIO()
+                    image.save(buffered, format="JPEG")
+                    img_b64_str = base64.b64encode(buffered.getvalue()).decode()
+                    img_str = (
+                        f'<img src="data:image/png;base64,{img_b64_str}" alt="user upload image" />'
+                    )
+                    msg = img_str + msg.replace("<image>", "").strip()
+                    ret.append([msg, None])
+                else:
+                    ret.append([msg, None])
+            else:
+                ret[-1][-1] = msg
+        return ret
+
+    def copy(self):
+        return Conversation(
+            system=self.system,
+            roles=self.roles,
+            messages=[[x, y] for x, y in self.messages],
+            offset=self.offset,
+            sep_style=self.sep_style,
+            sep=self.sep,
+            sep2=self.sep2,
+            version=self.version,
+        )
+
+    def dict(self):
+        if len(self.get_images()) > 0:
+            return {
+                "system": self.system,
+                "roles": self.roles,
+                "messages": [[x, y[0] if type(y) is tuple else y] for x, y in self.messages],
+                "offset": self.offset,
+                "sep": self.sep,
+                "sep2": self.sep2,
+            }
+        return {
+            "system": self.system,
+            "roles": self.roles,
+            "messages": self.messages,
+            "offset": self.offset,
+            "sep": self.sep,
+            "sep2": self.sep2,
+        }
+
+
+conv_mixtral_zh = Conversation(
+    system="你是一个人工智能机器人。\n- 你是研究社区开发的大语言模型。你的设计宗旨是有益、诚实且无害。\n- 你支持使用用户选择的多种语言流利地进行交流并解答用户的问题。\n- 如果用户更正你生成的错误答案，你会向用户致歉并与用户探讨正确的答案。",
+    roles=("user", "bot"),
+    version="mixtral_zh",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.MixtralZh,
+    sep="</s>",
+    sep2="</s>",
+)
+
+conv_mixtral_two = Conversation(
+    system=[
+        "You are an AI robot and your name is VITA. \n- You are a multimodal large language model developed by the open source community. Your aim is to be helpful, honest and harmless. \n- You support the ability to communicate fluently and answer user questions in multiple languages of the user's choice. \n- If the user corrects the wrong answer you generated, you will apologize and discuss the correct answer with the user. \n- You must answer the question strictly according to the content of the image given by the user, and it is strictly forbidden to answer the question without the content of the image. Please note that you are seeing the image, not the video.",
+        "You are an AI robot and your name is VITA. \n- You are a multimodal large language model developed by the open source community. Your aim is to be helpful, honest and harmless. \n- You support the ability to communicate fluently and answer user questions in multiple languages of the user's choice. \n- If the user corrects the wrong answer you generated, you will apologize and discuss the correct answer with the user. \n- You must answer the question strictly according to the content of the video given by the user, and it is strictly forbidden to answer the question without the content of the video. Please note that you are seeing the video, not the image.",
+        "You are an AI robot and your name is VITA. \n- You are a multimodal large language model developed by the open source community. Your aim is to be helpful, honest and harmless. \n- You support the ability to communicate fluently and answer user questions in multiple languages of the user's choice. \n- If the user corrects the wrong answer you generated, you will apologize and discuss the correct answer with the user.",
+    ],
+    roles=("user", "bot"),
+    version="mixtral_two",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.MixtralTwo,
+    sep="</s>",
+    sep2="</s>",
+)
+
+conv_phi3 = Conversation(
+    system="A chat between a curious user and an artificial intelligence assistant. "
+    "The assistant gives helpful, detailed, and polite answers to the user's questions.",
+    roles=("USER", "ASSISTANT"),
+    version="phi3",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.TWO,
+    sep=" ",
+    sep2="<|endoftext|>",
+)
+
+conv_minicpm = Conversation(
+    system="A chat between a curious user and an artificial intelligence assistant. "
+    "The assistant gives helpful, detailed, and polite answers to the user's questions.",
+    roles=("USER", "ASSISTANT"),
+    version="minicpm",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.TWO,
+    sep=" ",
+    sep2="</s>",
+)
+
+conv_llama = Conversation(
+    system="A chat between a curious user and an artificial intelligence assistant. "
+    "The assistant gives helpful, detailed, and polite answers to the user's questions.",
+    roles=("USER", "ASSISTANT"),
+    version="llama",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.TWO,
+    sep=" ",
+    sep2="<|end_of_text|>",
+)
+
+conv_plain = Conversation(
+    system="",
+    roles=("", ""),
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.PLAIN,
+    sep="\n",
+)
+
+default_conversation = conv_mixtral_two
+conv_templates = {
+    "default": conv_mixtral_two,
+    "mixtral_zh": conv_mixtral_zh,
+    "mixtral_two": conv_mixtral_two,
+    "phi3": conv_phi3,
+    "plain": conv_plain,
+    "minicpm": conv_minicpm,
+    "llama": conv_llama,
+}
+
+if __name__ == "__main__":
+    print(default_conversation.get_prompt())
--- a/VITA/model/__init__.py
+++ b/VITA/model/__init__.py
+from .language_model.vita_mixtral import VITAMixtralConfig, VITAMixtralForCausalLM
--- a/VITA/model/__pycache__/__init__.cpython-310.pyc
+++ b/VITA/model/__pycache__/__init__.cpython-310.pyc
--- a/VITA/model/__pycache__/builder.cpython-310.pyc
+++ b/VITA/model/__pycache__/builder.cpython-310.pyc
--- a/VITA/model/__pycache__/vita_arch.cpython-310.pyc
+++ b/VITA/model/__pycache__/vita_arch.cpython-310.pyc
--- a/VITA/model/builder.py
+++ b/VITA/model/builder.py
+import os
+import warnings
+
+import torch
+from transformers import AutoConfig, AutoTokenizer, BitsAndBytesConfig, logging
+
+from vita.constants import GLOBAL_WEIGHTS_PATH
+from vita.model import *
+
+logging.set_verbosity_error()
+warnings.filterwarnings("ignore")
+
+
+def load_pretrained_model(
+    model_path,
+    model_base,
+    model_name,
+    model_type,
+    load_8bit=False,
+    load_4bit=False,
+    device_map="auto",
+    device="cuda",
+    **kwargs,
+):
+    if model_type not in {"mixtral-8x7b"}:
+        raise ValueError(f"Unknown Model Type {model_type}")
+
+    kwargs = {"device_map": device_map, **kwargs}
+
+    if device != "cuda":
+        kwargs["device_map"] = {"": device}
+
+    if load_8bit:
+        kwargs["load_in_8bit"] = True
+    elif load_4bit:
+        kwargs["load_in_4bit"] = True
+        kwargs["quantization_config"] = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_compute_dtype=torch.float16,
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_quant_type="nf4",
+        )
+    else:
+        kwargs["torch_dtype"] = torch.float16
+
+    # Load VITA model
+    if "lora" in model_name.lower() and model_base is None:
+        warnings.warn(
+            "There is `lora` in model name but no `model_base` is provided. If you are loading a LoRA model, please provide the `model_base` argument."
+        )
+    if "lora" in model_name.lower() and model_base is not None:
+        lora_cfg_pretrained = AutoConfig.from_pretrained(model_path)
+
+        print("Loading VITA from base model...")
+        if model_type == "mixtral-8x7b":
+            # import pdb; pdb.set_trace()
+            device_map = {
+                "model.embed_tokens": 0,
+                "model.layers.0": 0,
+                "model.layers.1": 0,
+                "model.layers.2": 0,
+                "model.layers.3": 0,
+                "model.layers.4": 0,
+                "model.layers.5": 0,
+                "model.layers.6": 0,
+                "model.layers.7": 0,
+                "model.layers.8": 0,
+                "model.layers.9": 0,
+                "model.layers.10": 0,
+                "model.layers.11": 0,
+                "model.layers.12": 0,
+                "model.layers.13": 0,
+                "model.layers.14": 0,
+                "model.layers.15": 0,
+                "model.layers.16": 1,
+                "model.layers.17": 1,
+                "model.layers.18": 1,
+                "model.layers.19": 1,
+                "model.layers.20": 1,
+                "model.layers.21": 1,
+                "model.layers.22": 1,
+                "model.layers.23": 1,
+                "model.layers.24": 1,
+                "model.layers.25": 1,
+                "model.layers.26": 1,
+                "model.layers.27": 1,
+                "model.layers.28": 1,
+                "model.layers.29": 1,
+                "model.layers.30": 1,
+                "model.layers.31": 1,
+                "model.norm": 1,
+                "model.vision_tower": 1,
+                "model.mm_projector": 1,
+                "model.audio_encoder": 1,
+                "lm_head": 1,
+            }
+            device_map["model.audio_encoder"] = 0
+            kwargs.update(device_map=device_map)
+            tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
+            model = VITAMixtralForCausalLM.from_pretrained(
+                model_path, low_cpu_mem_usage=True, **kwargs
+            )
+
+        token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features
+        if model.lm_head.weight.shape[0] != token_num:
+            model.lm_head.weight = torch.nn.Parameter(
+                torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype)
+            )
+            model.model.embed_tokens.weight = torch.nn.Parameter(
+                torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype)
+            )
+
+        print("Loading additional VITA weights...")
+        if os.path.exists(os.path.join(model_path, "non_lora_trainables.bin")):
+            non_lora_trainables = torch.load(
+                os.path.join(model_path, "non_lora_trainables.bin"), map_location="cpu"
+            )
+        else:
+            # this is probably from HF Hub
+            from huggingface_hub import hf_hub_download
+
+            def load_from_hf(repo_id, filename, subfolder=None):
+                cache_file = hf_hub_download(
+                    repo_id=repo_id, filename=filename, subfolder=subfolder
+                )
+                return torch.load(cache_file, map_location="cpu")
+
+            non_lora_trainables = load_from_hf(model_path, "non_lora_trainables.bin")
+
+        non_lora_trainables = {
+            (k[11:] if k.startswith("base_model.") else k): v
+            for k, v in non_lora_trainables.items()
+        }
+        if any(k.startswith("model.model.") for k in non_lora_trainables):
+            non_lora_trainables = {
+                (k[6:] if k.startswith("model.") else k): v for k, v in non_lora_trainables.items()
+            }
+        model.load_state_dict(non_lora_trainables, strict=False)
+
+        from peft import PeftModel
+
+        print("Loading LoRA weights...")
+        model = PeftModel.from_pretrained(model, model_path)
+        print("Merging LoRA weights...")
+        model = model.merge_and_unload()
+        print("Model is loaded...")
+    elif model_base is not None:
+        # this may be mm projector only
+        print("Loading VITA from base model...")
+
+        cfg_pretrained = AutoConfig.from_pretrained(model_path)
+        if model_type == "mixtral-8x7b":
+            tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=True)
+            model = VITAMixtralForCausalLM.from_pretrained(
+                model_base, low_cpu_mem_usage=True, **kwargs
+            )
+            from types import SimpleNamespace
+
+            model_args = {
+                "vision_tower": f"{GLOBAL_WEIGHTS_PATH}/InternViT-300M-448px",
+                "pretrain_mm_mlp_adapter": None,
+                "mm_projector_type": "mlp2x_gelu",
+            }
+            model_args = SimpleNamespace(**model_args)
+            model.get_model().initialize_vision_modules(model_args=model_args)
+
+        mm_projector_weights = torch.load(
+            os.path.join(model_path, "mm_projector.bin"), map_location="cpu"
+        )
+        mm_projector_weights = {k: v.to(torch.float16) for k, v in mm_projector_weights.items()}
+        model.load_state_dict(mm_projector_weights, strict=False)
+        model.model.mm_projector.to(device="cuda", dtype=torch.float16)
+        model.model.vision_tower.to(device="cuda", dtype=torch.float16)
+    else:
+        if model_type == "mixtral-8x7b":
+            # import pdb; pdb.set_trace()
+            device_map = {
+                "model.embed_tokens": 0,
+                "model.layers.0": 0,
+                "model.layers.1": 0,
+                "model.layers.2": 0,
+                "model.layers.3": 0,
+                "model.layers.4": 0,
+                "model.layers.5": 0,
+                "model.layers.6": 0,
+                "model.layers.7": 0,
+                "model.layers.8": 0,
+                "model.layers.9": 0,
+                "model.layers.10": 0,
+                "model.layers.11": 0,
+                "model.layers.12": 0,
+                "model.layers.13": 0,
+                "model.layers.14": 0,
+                "model.layers.15": 0,
+                "model.layers.16": 1,
+                "model.layers.17": 1,
+                "model.layers.18": 1,
+                "model.layers.19": 1,
+                "model.layers.20": 1,
+                "model.layers.21": 1,
+                "model.layers.22": 1,
+                "model.layers.23": 1,
+                "model.layers.24": 1,
+                "model.layers.25": 1,
+                "model.layers.26": 1,
+                "model.layers.27": 1,
+                "model.layers.28": 1,
+                "model.layers.29": 1,
+                "model.layers.30": 1,
+                "model.layers.31": 1,
+                "model.norm": 1,
+                "model.vision_tower": 1,
+                "model.mm_projector": 1,
+                "model.audio_encoder": 1,
+                "lm_head": 1,
+            }
+            device_map["model.audio_encoder"] = 0
+            kwargs.update(device_map=device_map)
+            tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
+            model = VITAMixtralForCausalLM.from_pretrained(
+                model_path, low_cpu_mem_usage=True, **kwargs
+            )
+            # model.hf_device_map
+
+    model.resize_token_embeddings(len(tokenizer))
+
+    vision_tower = model.get_vision_tower()
+    if not vision_tower.is_loaded:
+        vision_tower.load_model()
+
+    num_params = sum(p.numel() for p in vision_tower.parameters())
+    print("the number of vision encoder params: {}M".format(num_params / 1024 / 1024))
+
+    if getattr(model.config, "unfreeze_vision_tower", False):
+        if "lora" in model_name.lower():
+            assert model_base is not None
+            vision_non_lora_trainables = {
+                k[19:]: v
+                for k, v in non_lora_trainables.items()
+                if k.startswith("model.vision_tower.")
+            }
+            vision_tower.load_state_dict(vision_non_lora_trainables, strict=False)
+        else:
+            assert model_base is None
+            from safetensors.torch import load_file
+
+            vision_weights = {}
+            for file_name in os.listdir(model_path):
+                if file_name.endswith("safetensors"):
+                    vision_weights.update(
+                        {
+                            k[19:]: v
+                            for k, v in load_file(os.path.join(model_path, file_name)).items()
+                            if k.startswith("model.vision_tower.")
+                        }
+                    )
+            vision_tower.load_state_dict(vision_weights, strict=True)
+
+    # from types import SimpleNamespace
+    # model_args = {
+    #    'audio_encoder': f"{GLOBAL_WEIGHTS_PATH}/audio-encoder-2wh_zh_en_audioset_Mixtral-8x7B_New-base-tunning',
+    #    'freeze_audio_encoder': True,
+    #    'freeze_audio_encoder_adapter': True
+    # }
+    # model_args = SimpleNamespace(**model_args)
+    # model.get_model().initialize_audio_modules(model_args=model_args)
+    # audio_encoder = model.get_audio_encoder()
+
+    # import pdb; pdb.set_trace()
+    # if (not getattr(model.config, "freeze_audio_encoder", True)) and (not getattr(model.config, "freeze_audio_encoder_adapter", True)):
+    #    from safetensors.torch import load_file
+    #    audio_weights = {}
+    #    for file_name in os.listdir(model_path):
+    #        if file_name.endswith('safetensors'):
+    #            audio_weights.update(
+    #                {k[20:]: v for k, v in load_file(os.path.join(model_path, file_name)).items() if
+    #                    k.startswith('model.audio_encoder.')})
+    #    audio_encoder.load_state_dict(audio_weights, strict=True)
+    #    audio_encoder.eval()
+    # import pdb; pdb.set_trace()
+
+    # import pdb; pdb.set_trace()
+    # from safetensors.torch import load_file
+    # audio_weights = {}
+    # for file_name in os.listdir(model_path):
+    #    if file_name.endswith('safetensors'):
+    #        audio_weights.update(
+    #            {k[20:]: v for k, v in load_file(os.path.join(model_path, file_name)).items() if
+    #                k.startswith('model.audio_encoder.')})
+    # import pdb; pdb.set_trace()
+
+    vision_tower.to(device=device, dtype=torch.float16)
+    image_processor = vision_tower.image_processor
+
+    if hasattr(model.config, "max_sequence_length"):
+        context_len = model.config.max_sequence_length
+    else:
+        context_len = 2048
+
+    if model.generation_config.pad_token_id is None:
+        model.generation_config.pad_token_id = model.generation_config.eos_token_id
+
+    if model_type == "phi-3":
+        model.generation_config.eos_token_id = tokenizer.eos_token_id
+
+    return tokenizer, model, image_processor, context_len