init submission

ab9c00af · yangzhong · ab9c00af · ab9c00af · ab9c00af · ab9c00af
Commit ab9c00af authored Jan 07, 2026 by yangzhong
20 changed files
--- a/DISCLAIMER
+++ b/DISCLAIMER
+TTS语音合成技术免责声明
+
+1. 总则
+本声明适用于 Index-TTS（以下简称"本项目"）的所有用户和使用者。使用本项目即表示您已阅读、理解并同意遵守本免责声明的全部内容。
+
+2. 使用限制
+2.1 本项目仅供用户进行技术研究、学习和合法的创意应用，不得用于任何违反法律法规的活动。
+
+2.2 用户不得使用本项目：
+   a) 合成政治人物、公众人物或任何未经授权的个人声音；
+   b) 创建诋毁、侮辱、歧视或损害他人名誉和权益的内容；
+   c) 进行欺诈、身份盗用或任何形式的违法活动；
+   d) 传播虚假信息或制造社会恐慌；
+   e) 侵犯他人知识产权、肖像权或隐私权；
+   f) 未经授权将合成声音用于商业目的；
+   g) 违反特定行业（如金融、医疗等）的法规要求；
+   h) 创建或使用涉及未成年人的不当声音内容；
+   i) 制作可能威胁国家安全的内容；
+   j) 违反任何地区关于深度伪造技术的法律法规。
+
+3. 知识产权与授权
+3.1 本项目以[开源许可证类型]许可证开源。
+3.2 用户在使用本项目过程中产生的所有内容及其法律责任由用户自行承担。
+
+4. 责任限制
+4.1 项目开发者不对用户使用本项目所产生的任何直接或间接后果承担责任。
+4.2 项目开发者不保证本项目的功能满足用户的所有需求，也不保证运行不会中断或出错。
+4.3 用户因使用本项目而产生的任何法律纠纷、损失或损害，项目开发者概不负责。
+
+5. 法律适用
+5.1 本免责声明受[国家/地区]法律管辖。
+5.2 如本声明的任何条款与适用法律相抵触，则以适用法律为准。
+
+6. 声明更新
+6.1 项目开发者保留随时更新本免责声明的权利，更新后的声明自发布之日起生效。
+6.2 用户应定期查阅本声明以了解任何变更。
+
+7. 其他条款
+7.1 用户在使用本项目前，应确保其使用行为符合所在地区的法律法规。
+7.2 如用户对本项目的使用引起任何法律纠纷，用户应积极配合相关调查并承担相应责任。
+
+最后更新日期：2025.3.17
+开发者：Bilibili Index Team
\ No newline at end of file
--- a/Dockerfile
+++ b/Dockerfile
+# FROM vllm/vllm-openai:latest
+FROM vllm/vllm-openai:v0.9.0
+
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+    ffmpeg \
+    build-essential \
+    libsndfile1 \
+    libsm6 \
+    libxext6 \
+    && \
+    rm -rf /var/lib/apt/lists/* && \
+    ln -sf /usr/bin/python3 /usr/bin/python
+
+WORKDIR /app
+
+COPY requirements.txt requirements.txt
+RUN pip install --no-cache-dir --break-system-packages -r requirements.txt
+
+# COPY assets /app/assets
+COPY indextts /app/indextts
+COPY tools /app/tools
+COPY patch_vllm.py /app/patch_vllm.py
+COPY api_server.py /app/api_server.py
+COPY convert_hf_format.py /app/convert_hf_format.py
+COPY convert_hf_format.sh /app/convert_hf_format.sh
+COPY entrypoint.sh /app/entrypoint.sh
+
+ENTRYPOINT /app/entrypoint.sh
--- a/INDEX_MODEL_LICENSE
+++ b/INDEX_MODEL_LICENSE
+bilibili Index-TTS 模型许可协议
+版本 1.0，2025 年 3 月 17 日
+版权所有 (c) 2025 bilibili Index
+第一部分：前言
+大型生成模型正在被广泛采用和使用，但也存在对其潜在滥用的担忧，无论是由于其技术限制还是伦理考虑。本许可证旨在促进所附模型的开放和负责任的下游使用。
+因此，现在您和 bilibili Index 同意如下：
+1. 定义
+“许可证”是指本文件中定义的使用、复制和分发的条款和条件。
+“数据”是指从与模型一起使用的数据集提取的信息和/或内容的集合，包括用于训练、预训练或以其他方式评估模型的数据。数据不受本许可证的许可。
+“输出”是指操作模型的结果，以由此产生的信息内容体现。
+“模型”是指任何伴随的机器学习基础组件（包括检查点），由学习的权重、参数（包括优化器状态）组成。
+“模型的衍生品”是指对bilibili Index在该许可证下开放的模型的所有修改、基于模型的作品或任何其他通过将模型的权重、参数、激活或输出的模式转移到另一个模型而创建或初始化的模型，以便使另一个模型的性能类似于本模型，包括但不限于涉及使用中间数据表示的蒸馏方法或基于模型生成合成数据用于训练另一个模型的方法。
+“补充材料”是指用于定义、运行、加载、基准测试或评估模型的伴随源代码和脚本，如果有，还包括用于准备数据进行训练或评估的任何伴随文档、教程、示例等。
+“分发”是指将模型或模型的衍生物传输、复制、发布或以其他方式共享给第三方，包括通过电子或其他远程方式提供模型作为托管服务 - 例如基于 API 或 Web 访问。
+“bilibili Index”（或“我们”）是指上海宽娱数码科技有限公司或其任何关联公司。
+“您”（或“您的”）是指行使本许可证授予的权限并/或出于任何目的和在任何使用领域使用模型的个人或法律实体，包括在最终使用应用程序（例如聊天机器人、翻译器等）中使用模型。
+“第三方”是指与 bilibili Index 或您没有共同控制的个人或法律实体。
+“商业用途”是指使用 bilibili Index-TTS 模型，直接或间接为实体或个人进行运营、推广或产生收入，或用于任何其他盈利目的。
+
+第二部分：许可及许可限制
+根据本许可协议的条款和条件，许可方特此授予您一个非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可。您可以出于非商业用途使用此许可。许可方对您使用bilibili Index-TTS模型的输出或基于bilibili Index-TTS模型得到的模型衍生品不主张任何权利，但您必须满足如下许可限制条件：
+1. 您不得出于任何军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建bilibili Index-TTS 模型的全部或部分衍生品。您同意在使用bilibili Index许可的模型或其模型的衍生物品时，严格遵守本协议附件A所列举的各项使用限制。
+2. 如果您计划将 bilibili Index-TTS 模型及模型衍生品用作商业用途，应当按照本协议附则提供的联络方式，事先向许可方登记并获得许可方的书面授权。
+3. 您对 bilibili Index-TTS 模型的使用和修改（包括使用 bilibili Index-TTS 模型的输出或者基于 bilibili Index-TTS 模型得到的模型衍生品）不得违反任何国家的法律法规，尤其是中华人民共和国的法律法规，不得侵犯任何第三方的合法权益，包括但不限于肖像权、名誉权、隐私权等人格权，著作权、专利权、商业秘密等知识产权，或者其他财产权益。
+4. 您必须向 bilibili Index-TTS 模型或其模型衍生品的任何第三方使用者提供 bilibili Index-TTS 模型的来源以及本协议的副本。
+5. 您修改 bilibili Index-TTS 模型得到模型衍生品，必须以显著的方式说明修改的内容，且上述修改不得违反本协议的许可限制条件，也不能允许、协助或以其他方式使得第三方违反本协议中的许可限制条件。
+
+第三部分：知识产权
+1. bilibili Index-TTS 模型的所有权及其相关知识产权，由许可方单独所有。
+2. 在任何情况下，未经许可方事先书面同意，您不得使用许可方任何商标、服务标记、商号、域名、网站名称或其他显著品牌特征（以下统称为"标识"），包括但不限于明示或暗示您自身为“许可方”。未经许可方事先书面同意，您不得将本条款前述标识以单独或结合的任何方式展示、使用或申请注册商标、进行域名注册等，也不得向他人明示或暗示有权展示、使用、或以其他方式处理这些标识的权利。由于您违反本协议使用许可方上述标识等给许可方或他人造成损失的，由您承担全部法律责任。
+3. 在许可范围内，您可以对 bilibili Index-TTS 模型进行修改以得到模型衍生品，对于模型衍生品中您付出创造性劳动的部分，您可以主张该部分的知识产权。
+
+第四部分：免责声明及责任限制
+1. 在任何情况下，许可方不对您根据本协议使用 bilibili Index-TTS 模型而产生或与之相关的任何直接、间接、附带的后果、以及其他损失或损害承担责任。若由此导致许可方遭受损失，您应当向许可方承担全部赔偿责任。
+2. 模型中的模型参数仅仅是一种示例，如果您需要满足其他要求，需自行训练，并遵守相应数据集的许可协议。您将对 bilibili Index-TTS 模型的输出及模型衍生品所涉及的知识产权风险或与之相关的任何直接、间接、附带的后果、以及其他损失或损害负责。
+3. 尽管许可方在 bilibili Index-TTS 模型训练的所有阶段，都坚持努力维护数据的合规性和准确性，但受限于 bilibili Index-TTS 模型的规模及其概率固有的随机性因素影响，其输出结果的准确性无法得到保证，bilibili Index-TTS模型存在被误导的可能。因此，许可方在此声明，许可方不承担您因使用 bilibili Index-TTS 模型及其源代码而导致的数据安全问题、声誉风险，或任何涉及 bilibili Index-TTS 模型被误导、误用、传播或不正当使用而产生的任何风险和责任。
+4. 本协议所称损失或损害包括但不限于下列任何损失或损害（无论此类损失或损害是不可预见的、可预见的、已知的或其他的）:(i)收入损失;(ii)实际或预期利润损失；(ii)货币使用损失；(iv)预期节约的损失；(v)业务损失；(vi)机会损失；(vii)商誉、声誉损失；(viii)软件的使用损失；或(x)任何间接、附带的特殊或间接损害损失。
+5. 除非适用的法律另有要求或经过许可方书面同意，否则许可方将按“现状”授予bilibili Index-TTS 模型的许可。针对本协议中的 bilibili Index-TTS 模型，许可方不提供任何明示、暗示的保证，包括但不限于：关于所有权的任何保证或条件、关于适销性的保证或条件、适用于任何特定目的的保证或条件、过去、现在或未来关于 bilibili Index-TTS 模型不侵权的任何类型的保证、以及因任何交易过程、贸易使用（如建议书、规范或样品）而产生的任何保证。您将对其通过使用、复制或再分发等方式利用 bilibili Index-TTS 模型所产生的风险与后果，独自承担责任。
+6. 您充分知悉并理解同意，bilibili Index-TTS 模型中可能包含个人信息。您承诺将遵守所有适用的法律法规进行个人信息的处理，特别是遵守《中华人民共和国个人信息保护法》的相关规定。请注意，许可方给予您使用 bilibili Index-TTS 模型的授权，并不意味着您已经获得处理相关个人信息的合法性基础。您作为独立的个人信息处理者，需要保证在处理 bilibili Index-TTS 模型中可能包含的个人信息时，完全符合相关法律法规的要求，包括但不限于获得个人信息主体的授权同意等，并愿意独自承担由此可能产生的任何风险和后果。
+7. 您充分理解并同意，许可方有权依合理判断对违反有关法律法规或本协议规定的行为进行处理，对您的违法违规行为采取适当的法律行动，并依据法律法规保存有关信息向有关部门报告等，您应独自承担由此而产生的一切法律责任。
+
+第五部分：品牌曝光与显著标识
+1. 您同意并理解，如您将您基于 bilibili Index-TTS 模型二次开发的模型衍生品在国内外的开源社区提供开源许可的，您需要在该开源社区以显著方式标注该模型衍生品系基于 bilibili Index-TTS 模型进行的二次开发，标注内容包括但不限于“bilibili Index ”以及与 bilibili Index-TTS 模型相关的品牌的其他元素。
+2. 您同意并理解，如您将 bilibili Index-TTS 模型二次开发的模型衍生品参加国内外任何组织和个人举行的排名活动，包括但不限于针对模型性能、准确度、算法、算力等任何维度的排名活动，您均需在模型说明中以显著方式标注该模型衍生品系基于 bilibili Index-TTS 模型进行的二次开发，标注内容包括但不限于“bilibili Index Inside”以及与 bilibili Index-TTS 模型相关的品牌的其他元素。
+
+第六部分：其他
+1.许可方在法律法规许可的范围内对协议条款享有最终解释权。
+2.本协议的订立、效力、解释、履行、修改和终止，使用 bilibili Index-TTS 模型以及争议的解决均适用中华人民共和国大陆地区（仅为本协议之目的，不包括香港、澳门和台湾）法律，并排除冲突法的适用。
+3.因使用 bilibili Index-TTS 模型而发生的任何争议，各方应首先通过友好协商的方式加以解决。协商不成时，向许可方所在地人民法院提起诉讼。
+4.本协议的英文版本如若在理解上与中文版本产生冲突的，以中文版本为准。
+5.若您期望基于本协议的许可条件与限制，将 bilibili Index-TTS 模型或其衍生品用作商业用途，请您按照如下方式联系许可方，以进行登记并向许可方申请书面授权：联系邮箱：xuanwu@bilibili.com
+
+附件 A ：使用限制
+您同意不以下述目的和方式使用模型或模型的衍生物：
+以任何违反任何适用的国家或国际法律或法规或侵犯任何第三方合法权益的方式；
+用于任何军事目的；
+以任何方式用于剥削、伤害或企图剥削或伤害未成年人；
+生成或传播可验证的虚假信息和/或内容，意图伤害他人；
+生成或传播受适用监管要求限制的不适当内容；
+在未经适当授权或不合理使用的情况下生成或传播个人可识别信息；
+诽谤、贬低或以其他方式骚扰他人；
+用于对个人的法律权利产生不利影响或创建或修改具有约束力的可执行义务的完全自动化决策；
+用于基于在线或离线社会行为或已知或预测的个人或个性特征对个人或群体进行歧视或伤害的任何目的；
+为了对特定群体的个人造成或可能造成身体或心理伤害，利用该群体的年龄、社会、身体或心理特征的任何漏洞，从而严重扭曲属于该群体的个人的行为；
+用于任何旨在或具有基于法律保护的特征或类别对个人或群体进行歧视的目的
--- a/LICENSE
+++ b/LICENSE
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
\ No newline at end of file
--- a/README.md
+++ b/README.md
+<a href="README.md">中文</a> ｜ <a href="README_EN.md">English</a>
+
+<div align="center">
+
+# IndexTTS-vLLM
+</div>
+
+## 项目简介
+该项目在 [index-tts](https://github.com/index-tts/index-tts) 的基础上使用 vllm 库重新实现了 gpt 模型的推理，加速了 index-tts 的推理过程。
+
+推理速度（Index-TTS-v1/v1.5）在单卡 RTX 4090 上的提升为：
+- 单个请求的 RTF (Real-Time Factor)：≈0.3 -> ≈0.1
+- 单个请求的 gpt 模型 decode 速度：≈90 token / s -> ≈280 token / s
+- 并发量：gpu_memory_utilization 设置为 0.25（约5GB显存）的情况下，实测 16 左右的并发无压力（测速脚本参考 `simple_test.py`）
+
+## 更新日志
+
+- **[2025-09-22]** 支持了 vllm v1 版本，IndexTTS2 正在兼容中
+
+- **[2025-09-28]** 支持了 IndexTTS2 的 webui 推理，并整理了权重文件，现在部署更加方便了！ \0.0/ ；但当前版本对于 IndexTTS2 的 gpt 似乎并没有加速效果，待研究
+
+- **[2025-09-29]** 解决了 IndexTTS2 的 gpt 模型推理加速无效的问题
+
+- **[2025-10-09]** 兼容 IndexTTS2 的 api 接口调用，请参考 [API](#api)；v1/1.5 的 api 接口以及 openai 兼容的接口可能还有 bug，晚点再修
+
+- **[2025-10-19]** 支持 qwen0.6bemo4-merge 的 vllm 推理
+
+## TODO list
+- V2 api 的并发优化：目前只有 gpt2 模型的推理是并行的，其他模块均是串行，而其中 s2mel 的推理开销大（需要 DiT 迭代 25 步），十分影响并发性能
+
+- s2mel 的推理加速
+
+## 使用步骤
+
+### 1. git 本项目
+```bash
+git clone https://github.com/Ksuriuri/index-tts-vllm.git
+cd index-tts-vllm
+```
+
+
+### 2. 创建并激活 conda 环境
+```bash
+conda create -n index-tts-vllm python=3.12
+conda activate index-tts-vllm
+```
+
+
+### 3. 安装 pytorch
+
+需要 pytorch 版本 2.8.0（对应 vllm 0.10.2），具体安装指令请参考：[pytorch 官网](https://pytorch.org/get-started/locally/)
+
+
+### 4. 安装依赖
+```bash
+pip install -r requirements.txt
+```
+
+
+### 5. 下载模型权重
+
+#### 自动下载（推荐）
+
+选择对应版本的模型权重下载到 `checkpoints/` 路径下：
+
+```bash
+# Index-TTS
+modelscope download --model kusuriuri/Index-TTS-vLLM --local_dir ./checkpoints/Index-TTS-vLLM
+
+# IndexTTS-1.5
+modelscope download --model kusuriuri/Index-TTS-1.5-vLLM --local_dir ./checkpoints/Index-TTS-1.5-vLLM
+
+# IndexTTS-2
+modelscope download --model kusuriuri/IndexTTS-2-vLLM --local_dir ./checkpoints/IndexTTS-2-vLLM
+```
+
+#### 手动下载
+
+- ModelScope：[Index-TTS](https://www.modelscope.cn/models/kusuriuri/Index-TTS-vLLM) | [IndexTTS-1.5](https://www.modelscope.cn/models/kusuriuri/Index-TTS-1.5-vLLM) | [IndexTTS-2](https://www.modelscope.cn/models/kusuriuri/IndexTTS-2-vLLM)
+
+#### 自行转换原权重（可选，不推荐）
+
+可以使用 `convert_hf_format.sh` 自行转换官方权重文件：
+
+```bash
+bash convert_hf_format.sh /path/to/your/model_dir
+```
+
+### 6. webui 启动！
+
+运行对应版本（第一次启动可能会久一些，因为要对 bigvgan 进行 cuda 核编译）：
+
+```bash
+# Index-TTS 1.0
+python webui.py
+
+# IndexTTS-1.5
+python webui.py --version 1.5
+
+# IndexTTS-2
+python webui_v2.py
+```
+
+
+## API
+
+使用 fastapi 封装了 api 接口，启动示例如下：
+
+```bash
+# Index-TTS-1.0/1.5
+python api_server.py
+
+# IndexTTS-2
+python api_server_v2.py
+```
+
+### 启动参数
+- `--model_dir`: 必填，模型权重路径
+- `--host`: 服务ip地址，默认为 `0.0.0.0`
+- `--port`: 服务端口，默认为 `6006`
+- `--gpu_memory_utilization`: vllm 显存占用率，默认设置为 `0.25`
+
+### API 请求示例
+- v1/1.5 请参考 `api_example.py`
+- v2 请参考 `api_example_v2.py`
+
+### OpenAI API
+- 添加 /audio/speech api 路径，兼容 OpenAI 接口
+- 添加 /audio/voices api 路径， 获得 voice/character 列表
+
+详见：[createSpeech](https://platform.openai.com/docs/api-reference/audio/createSpeech)
+
+## 新特性
+- **v1/v1.5:** 支持多角色音频混合：可以传入多个参考音频，TTS 输出的角色声线为多个参考音频的混合版本（输入多个参考音频会导致输出的角色声线不稳定，可以抽卡抽到满意的声线再作为参考音频）
+
+## 性能
+Word Error Rate (WER) Results for IndexTTS and Baseline Models on the [**seed-test**](https://github.com/BytedanceSpeech/seed-tts-eval)
+
+| model                   | zh    | en    |
+| ----------------------- | ----- | ----- |
+| Human                   | 1.254 | 2.143 |
+| index-tts (num_beams=3) | 1.005 | 1.943 |
+| index-tts (num_beams=1) | 1.107 | 2.032 |
+| index-tts-vllm      | 1.12  | 1.987 |
+
+基本保持了原项目的性能
+
+## 并发测试
+参考 [`simple_test.py`](simple_test.py)，需先启动 API 服务
--- a/README_EN.md
+++ b/README_EN.md
+<a href="README.md">中文</a> ｜ <a href="README_EN.md">English</a>
+
+<div align="center">
+
+# IndexTTS-vLLM
+</div>
+
+## Introduction
+This project re-implements the GPT model's inference from [index-tts](https://github.com/index-tts/index-tts) using the vllm library, accelerating the inference process of index-tts.
+
+Inference speed improvement (Index-TTS-v1/v1.5) on a single RTX 4090:
+- RTF (Real-Time Factor) for a single request: ≈0.3 -> ≈0.1
+- GPT model decode speed for a single request: ≈90 tokens/s -> ≈280 tokens/s
+- Concurrency: With `gpu_memory_utilization` set to 0.25 (approx. 5GB VRAM), it can handle a concurrency of around 16 without pressure (refer to `simple_test.py` for the benchmark script).
+
+## Update Log
+
+- **[2025-09-22]** Added support for vllm v1. Compatibility with IndexTTS2 is in progress.
+
+- **[2025-09-28]** Supported web UI inference for IndexTTS2 and organized the weight files for easier deployment! \0.0/ ; However, the current version doesn't seem to accelerate the GPT of IndexTTS2, which is under investigation.
+
+- **[2025-09-29]** Resolved the issue of ineffective GPT model inference acceleration for IndexTTS2.
+
+- **[2025-10-09]** Compatible with IndexTTS2 API calls, please refer to [API](#api); APIs for v1/1.5 and the OpenAI-compatible interfaces may still have bugs, to be fixed later.
+
+- **[2025-10-19]** Supported vllm inference for qwen0.6bemo4-merge.
+
+## TODO list
+- Concurrency optimization for V2 API: Currently, only the gpt2 model inference is parallel, while other modules run serially. The s2mel inference has a large overhead (requiring 25 DiT iterations), which significantly impacts concurrency performance.
+
+- Acceleration of s2mel inference.
+
+## Usage Steps
+
+### 1. Clone this project
+```bash
+git clone https://github.com/Ksuriuri/index-tts-vllm.git
+cd index-tts-vllm
+```
+
+
+### 2. Create and activate a conda environment
+```bash
+conda create -n index-tts-vllm python=3.12
+conda activate index-tts-vllm
+```
+
+
+### 3. Install PyTorch
+
+PyTorch version 2.8.0 is required (corresponding to vllm 0.10.2). For specific installation instructions, please refer to the [PyTorch official website](https://pytorch.org/get-started/locally/).
+
+
+### 4. Install dependencies
+```bash
+pip install -r requirements.txt
+```
+
+
+### 5. Download model weights
+
+#### Automatic Download (Recommended)
+
+Download the corresponding version of the model weights to the `checkpoints/` directory:
+
+```bash
+# Index-TTS
+modelscope download --model kusuriuri/Index-TTS-vLLM --local_dir ./checkpoints/Index-TTS-vLLM
+
+# IndexTTS-1.5
+modelscope download --model kusuriuri/Index-TTS-1.5-vLLM --local_dir ./checkpoints/Index-TTS-1.5-vLLM
+
+# IndexTTS-2
+modelscope download --model kusuriuri/IndexTTS-2-vLLM --local_dir ./checkpoints/IndexTTS-2-vLLM
+```
+
+#### Manual Download
+
+- ModelScope: [Index-TTS](https://www.modelscope.cn/models/kusuriuri/Index-TTS-vLLM) | [IndexTTS-1.5](https://www.modelscope.cn/models/kusuriuri/Index-TTS-1.5-vLLM) | [IndexTTS-2](https://www.modelscope.cn/models/kusuriuri/IndexTTS-2-vLLM)
+
+#### Convert original weights yourself (Optional, not recommended)
+
+You can use `convert_hf_format.sh` to convert the official weight files yourself:
+
+```bash
+bash convert_hf_format.sh /path/to/your/model_dir
+```
+
+### 6. Launch the web UI!
+
+Run the corresponding version (the first launch may take longer due to CUDA kernel compilation for bigvgan):
+
+```bash
+# Index-TTS 1.0
+python webui.py
+
+# IndexTTS-1.5
+python webui.py --version 1.5
+
+# IndexTTS-2
+python webui_v2.py
+```
+
+
+## API
+
+An API interface is encapsulated using FastAPI. Here is an example of how to start it:
+
+```bash
+# Index-TTS-1.0/1.5
+python api_server.py
+
+# IndexTTS-2
+python api_server_v2.py
+```
+
+### Startup Parameters
+- `--model_dir`: Required, path to the model weights.
+- `--host`: Server IP address, defaults to `0.0.0.0`.
+- `--port`: Server port, defaults to `6006`.
+- `--gpu_memory_utilization`: vllm GPU memory utilization rate, defaults to `0.25`.
+
+### API Request Examples
+- For v1/1.5, please refer to `api_example.py`.
+- For v2, please refer to `api_example_v2.py`.
+
+### OpenAI API
+- Added `/audio/speech` API path for compatibility with the OpenAI interface.
+- Added `/audio/voices` API path to get the list of voices/characters.
+
+For details, see: [createSpeech](https://platform.openai.com/docs/api-reference/audio/createSpeech)
+
+## New Features
+- **v1/v1.5:** Supports multi-character audio mixing: You can input multiple reference audios, and the TTS output voice will be a mix of these reference audios. (Inputting multiple reference audios may lead to an unstable output voice; you can try multiple times to get a satisfactory voice and then use it as a reference audio).
+
+## Performance
+Word Error Rate (WER) Results for IndexTTS and Baseline Models on the [**seed-test**](https://github.com/BytedanceSpeech/seed-tts-eval)
+
+| model                   | zh    | en    |
+| ----------------------- | ----- | ----- |
+| Human                   | 1.254 | 2.143 |
+| index-tts (num_beams=3) | 1.005 | 1.943 |
+| index-tts (num_beams=1) | 1.107 | 2.032 |
+| index-tts-vllm          | 1.12  | 1.987 |
+
+Maintains the performance of the original project.
+
+## Concurrency Test
+Refer to [`simple_test.py`](simple_test.py). The API service must be started first.
--- a/__pycache__/patch_vllm.cpython-310.pyc
+++ b/__pycache__/patch_vllm.cpython-310.pyc
--- a/api_example.py
+++ b/api_example.py
+import requests
+
+SERVER_PORT = 6006
+
+# 1. 使用本地文件请求，将 audio_paths 中的路径改为本地音频文件
+url = F"http://0.0.0.0:{SERVER_PORT}/tts_url"
+data = {
+    "text": "还是会想你，还是想登你",
+    "audio_paths": [  # 支持多参考音频
+        "assets/jay_promptvn.wav",
+        "assets/vo_card_klee_endOfGame_fail_01.wav"
+    ]
+}
+
+response = requests.post(url, json=data)
+with open("output.wav", "wb") as f:
+    f.write(response.content)
+
+
+
+# 2. 使用角色名请求，角色注册请参考 `assets/speaker.json`
+url = f"http://0.0.0.0:{SERVER_PORT}/tts"
+data = {
+    "text": "还是会想你，还是想登你",
+    "character": "jay_klee"
+}
+
+response = requests.post(url, json=data)
+with open("output.wav", "wb") as f:
+    f.write(response.content)
\ No newline at end of file
--- a/api_example_v2.py
+++ b/api_example_v2.py
+from dataclasses import asdict, dataclass
+import os
+from typing import List, Optional
+import requests
+
+SERVER_PORT = 6006
+output_dir = "outputs"
+os.makedirs(output_dir, exist_ok=True)
+
+url = F"http://0.0.0.0:{SERVER_PORT}/tts_url"
+
+
+@dataclass
+class IndexTTS2RequestData:
+    text: str
+    spk_audio_path: str
+    emo_control_method: int = 0
+    emo_ref_path: Optional[str] = None
+    emo_weight: float = 1.0
+    emo_vec: List[float] = None
+    emo_text: Optional[str] = None
+    emo_random: bool = False
+    max_text_tokens_per_sentence: int = 120
+
+    def __post_init__(self):
+        # 保证 emo_vec 默认长度为 8 的 0 向量
+        if self.emo_vec is None:
+            self.emo_vec = [0.0] * 8
+
+    def to_dict(self) -> str:
+        return asdict(self)
+
+
+# 1. 情感与音色参考音频相同
+data = IndexTTS2RequestData(
+    text="还是会想你，还是想登你",
+    spk_audio_path="assets/jay_promptvn.wav"
+)
+
+response = requests.post(url, json=data.to_dict())
+with open(os.path.join(output_dir, "output1.wav"), "wb") as f:
+    f.write(response.content)
+
+
+
+# 2. 使用情感参考音频
+data = IndexTTS2RequestData(
+    text="还是会想你，还是想登你",
+    spk_audio_path="assets/jay_promptvn.wav",
+    emo_control_method=1,
+    emo_ref_path="assets/vo_card_klee_endOfGame_fail_01.wav",
+    emo_weight=0.6
+)
+
+response = requests.post(url, json=data.to_dict())
+with open(os.path.join(output_dir, "output2.wav"), "wb") as f:
+    f.write(response.content)
+
+
+
+# 3. 使用情感向量控制
+# ["喜", "怒", "哀", "惧", "厌恶", "低落", "惊喜", "平静"]
+emo_vec = [0, 0, 0.55, 0, 0, 0, 0, 0]
+
+data = IndexTTS2RequestData(
+    text="还是会想你，还是想登你",
+    spk_audio_path="assets/jay_promptvn.wav",
+    emo_control_method=2,
+    emo_vec=emo_vec
+)
+
+response = requests.post(url, json=data.to_dict())
+with open(os.path.join(output_dir, "output3.wav"), "wb") as f:
+    f.write(response.content)
+
+
+
+# 4. 使用情感描述文本控制
+data = IndexTTS2RequestData(
+    text="还是会想你，还是想登你",
+    spk_audio_path="assets/jay_promptvn.wav",
+    emo_control_method=3,
+    emo_text="极度悲伤"
+)
+
+response = requests.post(url, json=data.to_dict())
+with open(os.path.join(output_dir, "output4.wav"), "wb") as f:
+    f.write(response.content)
--- a/api_server.py
+++ b/api_server.py
+
+import os
+import asyncio
+import io
+import traceback
+from fastapi import FastAPI, Request, Response
+from fastapi.responses import JSONResponse, StreamingResponse
+from contextlib import asynccontextmanager
+from fastapi.middleware.cors import CORSMiddleware
+import uvicorn
+import argparse
+import json
+import asyncio
+import time
+import numpy as np
+import soundfile as sf
+
+from indextts.infer_vllm import IndexTTS
+
+tts = None
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    global tts
+    tts = IndexTTS(model_dir=args.model_dir, gpu_memory_utilization=args.gpu_memory_utilization)
+
+    current_file_path = os.path.abspath(__file__)
+    cur_dir = os.path.dirname(current_file_path)
+    speaker_path = os.path.join(cur_dir, "assets/speaker.json")
+    if os.path.exists(speaker_path):
+        speaker_dict = json.load(open(speaker_path, 'r'))
+
+        for speaker, audio_paths in speaker_dict.items():
+            audio_paths_ = []
+            for audio_path in audio_paths:
+                audio_paths_.append(os.path.join(cur_dir, audio_path))
+            tts.registry_speaker(speaker, audio_paths_)
+    yield
+    # Clean up the ML models and release the resources
+    # ml_models.clear()
+
+app = FastAPI(lifespan=lifespan)
+
+# 添加CORS中间件配置
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],  # 允许所有来源，生产环境建议改为具体域名
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+
+@app.get("/health")
+async def health_check():
+    """健康检查接口"""
+    try:
+        global tts
+        if tts is None:
+            return JSONResponse(
+                status_code=503,
+                content={
+                    "status": "unhealthy",
+                    "message": "TTS model not initialized"
+                }
+            )
+        
+        return JSONResponse(
+            status_code=200,
+            content={
+                "status": "healthy",
+                "message": "Service is running",
+                "timestamp": time.time()
+            }
+        )
+    except Exception as ex:
+        return JSONResponse(
+            status_code=503,
+            content={
+                "status": "unhealthy",
+                "error": str(ex)
+            }
+        )
+
+
+@app.post("/tts_url", responses={
+    200: {"content": {"application/octet-stream": {}}},
+    500: {"content": {"application/json": {}}}
+})
+async def tts_api_url(request: Request):
+    try:
+        data = await request.json()
+        text = data["text"]
+        audio_paths = data["audio_paths"]
+        seed = data.get("seed", 8)
+
+        global tts
+        sr, wav = await tts.infer(audio_paths, text, seed=seed)
+        
+        with io.BytesIO() as wav_buffer:
+            sf.write(wav_buffer, wav, sr, format='WAV')
+            wav_bytes = wav_buffer.getvalue()
+
+        return Response(content=wav_bytes, media_type="audio/wav")
+    
+    except Exception as ex:
+        tb_str = ''.join(traceback.format_exception(type(ex), ex, ex.__traceback__))
+        return JSONResponse(
+            status_code=500,
+            content={
+                "status": "error",
+                "error": str(tb_str)
+            }
+        )
+
+
+@app.post("/tts", responses={
+    200: {"content": {"application/octet-stream": {}}},
+    500: {"content": {"application/json": {}}}
+})
+async def tts_api(request: Request):
+    try:
+        data = await request.json()
+        text = data["text"]
+        character = data["character"]
+
+        global tts
+        sr, wav = await tts.infer_with_ref_audio_embed(character, text)
+        
+        with io.BytesIO() as wav_buffer:
+            sf.write(wav_buffer, wav, sr, format='WAV')
+            wav_bytes = wav_buffer.getvalue()
+
+        return Response(content=wav_bytes, media_type="audio/wav")
+    
+    except Exception as ex:
+        tb_str = ''.join(traceback.format_exception(type(ex), ex, ex.__traceback__))
+        print(tb_str)
+        return JSONResponse(
+            status_code=500,
+            content={
+                "status": "error",
+                "error": str(tb_str)
+            }
+        )
+
+
+
+@app.get("/audio/voices")
+async def tts_voices():
+    """ additional function to provide the list of available voices, in the form of JSON """
+    current_file_path = os.path.abspath(__file__)
+    cur_dir = os.path.dirname(current_file_path)
+    speaker_path = os.path.join(cur_dir, "assets/speaker.json")
+    if os.path.exists(speaker_path):
+        speaker_dict = json.load(open(speaker_path, 'r'))
+        return speaker_dict
+    else:
+        return []
+
+
+
+@app.post("/audio/speech", responses={
+    200: {"content": {"application/octet-stream": {}}},
+    500: {"content": {"application/json": {}}}
+})
+async def tts_api_openai(request: Request):
+    """ OpenAI competible API, see: https://api.openai.com/v1/audio/speech """
+    try:
+        data = await request.json()
+        text = data["input"]
+        character = data["voice"]
+        #model param is omitted
+        _model = data["model"]
+
+        global tts
+        sr, wav = await tts.infer_with_ref_audio_embed(character, text)
+        
+        with io.BytesIO() as wav_buffer:
+            sf.write(wav_buffer, wav, sr, format='WAV')
+            wav_bytes = wav_buffer.getvalue()
+
+        return Response(content=wav_bytes, media_type="audio/wav")
+    
+    except Exception as ex:
+        tb_str = ''.join(traceback.format_exception(type(ex), ex, ex.__traceback__))
+        print(tb_str)
+        return JSONResponse(
+            status_code=500,
+            content={
+                "status": "error",
+                "error": str(tb_str)
+            }
+        )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--host", type=str, default="0.0.0.0")
+    parser.add_argument("--port", type=int, default=6006)
+    parser.add_argument("--model_dir", type=str, default="/path/to/IndexTeam/Index-TTS")
+    parser.add_argument("--gpu_memory_utilization", type=float, default=0.25)
+    args = parser.parse_args()
+
+    uvicorn.run(app=app, host=args.host, port=args.port)
--- a/api_server_v2.py
+++ b/api_server_v2.py
+import os
+import asyncio
+import io
+import traceback
+from fastapi import FastAPI, Request, Response, File, UploadFile, Form
+from fastapi.responses import JSONResponse
+from contextlib import asynccontextmanager
+from fastapi.middleware.cors import CORSMiddleware
+import uvicorn
+import argparse
+import json
+import time
+import soundfile as sf
+from typing import List, Optional, Union
+
+from loguru import logger
+logger.add("logs/api_server_v2.log", rotation="10 MB", retention=10, level="DEBUG", enqueue=True)
+
+from indextts.infer_vllm_v2 import IndexTTS2
+
+tts = None
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    global tts
+    tts = IndexTTS2(
+        model_dir=args.model_dir,
+        is_fp16=args.is_fp16,
+        gpu_memory_utilization=args.gpu_memory_utilization,
+        qwenemo_gpu_memory_utilization=args.qwenemo_gpu_memory_utilization,
+    )
+    yield
+
+
+app = FastAPI(lifespan=lifespan)
+
+# Add CORS middleware configuration
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],  # Allows all origins, change in production for security
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+
+@app.get("/health")
+async def health_check():
+    """Health check endpoint"""
+    if tts is None:
+        return JSONResponse(
+            status_code=503,
+            content={
+                "status": "unhealthy",
+                "message": "TTS model not initialized"
+            }
+        )
+    
+    return JSONResponse(
+        status_code=200,
+        content={
+            "status": "healthy",
+            "message": "Service is running",
+            "timestamp": time.time()
+        }
+    )
+
+
+@app.post("/tts_url", responses={
+    200: {"content": {"application/octet-stream": {}}},
+    500: {"content": {"application/json": {}}}
+})
+async def tts_api_url(request: Request):
+    try:
+        data = await request.json()
+        emo_control_method = data.get("emo_control_method", 0)
+        text = data["text"]
+        spk_audio_path = data["spk_audio_path"]
+        emo_ref_path = data.get("emo_ref_path", None)
+        emo_weight = data.get("emo_weight", 1.0)
+        emo_vec = data.get("emo_vec", [0] * 8)
+        emo_text = data.get("emo_text", None)
+        emo_random = data.get("emo_random", False)
+        max_text_tokens_per_sentence = data.get("max_text_tokens_per_sentence", 120)
+
+        global tts
+        if type(emo_control_method) is not int:
+            emo_control_method = emo_control_method.value
+        if emo_control_method == 0:
+            emo_ref_path = None
+            emo_weight = 1.0
+        if emo_control_method == 1:
+            emo_weight = emo_weight
+        if emo_control_method == 2:
+            vec = emo_vec
+            vec_sum = sum(vec)
+            if vec_sum > 1.5:
+                return JSONResponse(
+                    status_code=500,
+                    content={
+                        "status": "error",
+                        "error": "情感向量之和不能超过1.5，请调整后重试。"
+                    }
+                )
+        else:
+            vec = None
+
+        # logger.info(f"Emo control mode:{emo_control_method}, vec:{vec}")
+        sr, wav = await tts.infer(spk_audio_prompt=spk_audio_path, text=text,
+                        output_path=None,
+                        emo_audio_prompt=emo_ref_path, emo_alpha=emo_weight,
+                        emo_vector=vec,
+                        use_emo_text=(emo_control_method==3), emo_text=emo_text,use_random=emo_random,
+                        max_text_tokens_per_sentence=int(max_text_tokens_per_sentence))
+        
+        with io.BytesIO() as wav_buffer:
+            sf.write(wav_buffer, wav, sr, format='WAV')
+            wav_bytes = wav_buffer.getvalue()
+
+        return Response(content=wav_bytes, media_type="audio/wav")
+    
+    except Exception as ex:
+        tb_str = ''.join(traceback.format_exception(type(ex), ex, ex.__traceback__))
+        return JSONResponse(
+            status_code=500,
+            content={
+                "status": "error",
+                "error": str(tb_str)
+            }
+        )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--host", type=str, default="0.0.0.0")
+    parser.add_argument("--port", type=int, default=6006)
+    parser.add_argument("--model_dir", type=str, default="checkpoints/IndexTTS-2-vLLM", help="Model checkpoints directory")
+    parser.add_argument("--is_fp16", action="store_true", default=False, help="Fp16 infer")
+    parser.add_argument("--gpu_memory_utilization", type=float, default=0.25)
+    parser.add_argument("--qwenemo_gpu_memory_utilization", type=float, default=0.10)
+    parser.add_argument("--verbose", action="store_true", default=False, help="Enable verbose mode")
+    args = parser.parse_args()
+    
+    if not os.path.exists("outputs"):
+        os.makedirs("outputs")
+
+    uvicorn.run(app=app, host=args.host, port=args.port)
\ No newline at end of file
--- a/assets/jay_promptvn.wav
+++ b/assets/jay_promptvn.wav
--- a/assets/speaker.json
+++ b/assets/speaker.json
+{
+    "jay_klee": [
+        "assets/jay_promptvn.wav",
+        "assets/vo_card_klee_endOfGame_fail_01.wav"
+    ]
+}
\ No newline at end of file
--- a/assets/vo_card_klee_endOfGame_fail_01.wav
+++ b/assets/vo_card_klee_endOfGame_fail_01.wav
--- a/convert_hf_format.py
+++ b/convert_hf_format.py
+import os
+from omegaconf import OmegaConf
+import torch
+# from indextts.gpt.model import UnifiedVoice
+from indextts.gpt.model_v2 import UnifiedVoice
+from indextts.utils.checkpoint import load_checkpoint
+import argparse
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, default="")
+args = parser.parse_args()
+
+model_dir = args.model_dir
+cfg_path = os.path.join(model_dir, "config.yaml")
+gpt_save_dir = os.path.join(model_dir, "gpt")
+
+cfg = OmegaConf.load(cfg_path)
+gpt = UnifiedVoice(**cfg.gpt)
+gpt_path = os.path.join(model_dir, cfg.gpt_checkpoint)
+load_checkpoint(gpt, gpt_path)
+gpt = gpt.to(device="cuda", dtype=torch.float16)
+gpt.eval()  # .half()
+gpt.post_init_gpt2_config()
+print(">> GPT weights restored from:", gpt_path)
+
+gpt.inference_model.save_pretrained(gpt_save_dir, safe_serialization=False)
+print(f"GPT transformer saved to {gpt_save_dir}")
+
+
+# from safetensors.torch import load_file
+
+# # 加载模型参数
+# model_path = os.path.join(gpt_save_dir, "pytorch_model.bin")
+# state_dict = load_file(model_path)
+
+# # 打印所有参数名
+# for key in state_dict.keys():
+#     print(key)
--- a/convert_hf_format.sh
+++ b/convert_hf_format.sh
+#!/bin/bash
+set -e  # Exit on any error
+
+MODEL_DIR=$1
+GPT_DIR="$MODEL_DIR/gpt"
+
+if [ -z "$MODEL_DIR" ]; then
+    echo "Error: MODEL_DIR not provided"
+    exit 1
+fi
+
+echo "Creating gpt directory: $GPT_DIR"
+mkdir -p "$GPT_DIR"
+
+echo "Downloading tokenizer files..."
+if ! wget https://modelscope.cn/models/openai-community/gpt2/resolve/master/tokenizer.json -O "$GPT_DIR/tokenizer.json"; then
+    echo "Error: Failed to download tokenizer.json"
+    exit 1
+fi
+
+if ! wget https://modelscope.cn/models/openai-community/gpt2/resolve/master/tokenizer_config.json -O "$GPT_DIR/tokenizer_config.json"; then
+    echo "Error: Failed to download tokenizer_config.json"
+    exit 1
+fi
+
+if ! wget https://modelscope.cn/models/Qwen/Qwen2-Audio-7B-Instruct/resolve/master/preprocessor_config.json -O "$GPT_DIR/preprocessor_config.json"; then
+    echo "Error: Failed to download tokenizer_config.json"
+    exit 1
+fi
+
+
+echo "Converting model format..."
+if ! python convert_hf_format.py --model_dir "$MODEL_DIR"; then
+    echo "Error: Model conversion failed"
+    exit 1
+fi
+
+echo "All operations completed successfully!"
+exit 0
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
+version: '3.8'
+
+services:
+  index-tts:
+    build: .
+    image: index-tts-server
+    env_file:
+      - .env.example
+    container_name: index-tts-server
+    ports:
+      - "${PORT:-8001}:${PORT:-8001}"
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [gpu]
+    volumes:
+    #   - ./audio_prompts:/app/audio_prompts
+      - ./assets:/app/assets
+    restart: unless-stopped
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:${PORT:-8001}/health"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
--- a/entrypoint.sh
+++ b/entrypoint.sh
+#!/bin/bash
+
+#need to set alias within container
+alias python=python3
+# Set default values if environment variables are not set
+MODEL_DIR=${MODEL_DIR:-"checkpoints/"}
+MODEL=${MODEL:-"IndexTeam/IndexTTS-1.5"}
+VLLM_USE_MODELSCOPE=${VLLM_USE_MODELSCOPE:-1}
+DOWNLOAD_MODEL=${DOWNLOAD_MODEL:-1}
+CONVERT_MODEL=${CONVERT_MODEL:-1}
+PORT=${PORT:-8001}
+
+echo "Starting IndexTTS server..."
+echo "Model directory: $MODEL_DIR"
+echo "Model: $MODEL"
+echo "Use ModelScope: $VLLM_USE_MODELSCOPE"
+
+# Function to check if model directory exists and has required files
+check_model_exists() {
+    if [ ! -d "$MODEL_DIR" ]; then
+        echo "Model directory $MODEL_DIR does not exist"
+        return 1
+    fi
+    
+    # Check for download completion marker
+    if [ ! -f "$MODEL_DIR/.download_complete" ]; then
+        echo "Model download not completed (marker file missing)"
+        return 1
+    fi
+    
+    # Check for essential model files
+    if [ ! -f "$MODEL_DIR/config.yaml" ] || [ ! -f "$MODEL_DIR/gpt.pth" ] || [ ! -f "$MODEL_DIR/bigvgan_generator.pth" ]; then
+        echo "Essential model files not found in $MODEL_DIR"
+        return 1
+    fi
+    
+    echo "Model files found in $MODEL_DIR"
+    return 0
+}
+
+# Function to check if model conversion is complete
+check_conversion_complete() {
+    if [ -f "$MODEL_DIR/.conversion_complete" ]; then
+        echo "Model conversion already completed"
+        return 0
+    fi
+    return 1
+}
+
+# Function to download model from HuggingFace
+download_from_huggingface() {
+    echo "Downloading model from HuggingFace: $MODEL"
+    
+    # Create model directory
+    mkdir -p "$MODEL_DIR"
+    
+    # Use huggingface-cli to download the model
+    if ! huggingface-cli download "$MODEL" --local-dir "$MODEL_DIR" --local-dir-use-symlinks False; then
+        echo "Error: Failed to download model from HuggingFace"
+        exit 1
+    fi
+    
+    # Create download marker file
+    touch "$MODEL_DIR/.download_complete"
+    echo "Download completed successfully!"
+}
+
+# Function to download model from ModelScope
+download_from_modelscope() {
+    echo "Downloading model from ModelScope: $MODEL"
+    
+    # Create model directory
+    mkdir -p "$MODEL_DIR"
+    
+    # Use modelscope CLI to download the model
+    if ! modelscope download --model "$MODEL" --local_dir "$MODEL_DIR"; then
+        echo "Error: Failed to download model from ModelScope"
+        exit 1
+    fi
+    
+    # Create download marker file
+    touch "$MODEL_DIR/.download_complete"
+    echo "Download completed successfully!"
+}
+
+# Check if model exists and download if necessary
+if [ "$DOWNLOAD_MODEL" = "1" ]; then
+    if ! check_model_exists; then
+        echo "Model not found, downloading..."
+        
+        # Download based on VLLM_USE_MODELSCOPE setting
+        if [ "$VLLM_USE_MODELSCOPE" = "1" ]; then
+            download_from_modelscope
+        else
+            download_from_huggingface
+        fi
+        
+        # Verify download
+        if ! check_model_exists; then
+            echo "Error: Model download failed or files are missing"
+            exit 1
+        fi
+    else
+        echo "Model already exists, skipping download"
+    fi
+else
+    echo "Model download disabled (DOWNLOAD_MODEL=0)"
+    if ! check_model_exists; then
+        echo "Error: Model not found and download is disabled"
+        exit 1
+    fi
+fi
+
+# Convert model format if requested
+if [ "$CONVERT_MODEL" = "1" ]; then
+    if ! check_conversion_complete; then
+        echo "Converting model format..."
+        # Run conversion and capture the exit code
+        bash convert_hf_format.sh "$MODEL_DIR"
+        conversion_exit_code=$?
+        
+        # Check if conversion was successful by verifying the vllm directory exists
+        if [ $conversion_exit_code -eq 0 ] && [ -d "$MODEL_DIR/vllm" ] && [ -f "$MODEL_DIR/vllm/model.safetensors" ]; then
+            # Create conversion marker file on success
+            touch "$MODEL_DIR/.conversion_complete"
+            echo "Model conversion completed successfully"
+        else
+            echo "Error: Model conversion failed (exit code: $conversion_exit_code)"
+            exit 1
+        fi
+    else
+        echo "Model conversion already completed, skipping"
+    fi
+else
+    echo "Model conversion disabled (CONVERT_MODEL=0)"
+fi
+
+# Start the API server
+echo "Starting IndexTTS API server on port $PORT..."
+VLLM_USE_V1=0 python3 api_server.py --model_dir "$MODEL_DIR" --port "$PORT" --gpu_memory_utilization="$GPU_MEMORY_UTILIZATION"
--- a/examples/cases.jsonl
+++ b/examples/cases.jsonl
+{"prompt_audio":"voice_01.wav","text":"Translate for me，what is a surprise！","emo_mode":0}
+{"prompt_audio":"voice_02.wav","text":"The palace is strict, no false rumors, Lady Qi!","emo_mode":0}
+{"prompt_audio":"voice_03.wav","text":"这个呀，就是我们精心制作准备的纪念品，大家可以看到这个色泽和这个材质啊，哎呀多么的光彩照人。","emo_mode":0}
+{"prompt_audio":"voice_04.wav","text":"你就需要我这种专业人士的帮助，就像手无缚鸡之力的人进入雪山狩猎，一定需要最老练的猎人指导。","emo_mode":0}
+{"prompt_audio":"voice_05.wav","text":"在真正的日本剑道中，格斗过程极其短暂，常常短至半秒，最长也不超过两秒，利剑相击的转瞬间，已有一方倒在血泊中。但在这电光石火的对决之前，双方都要以一个石雕般凝固的姿势站定，长时间的逼视对方，这一过程可能长达十分钟！","emo_mode":0}
+{"prompt_audio":"voice_06.wav","text":"今天呢，咱们开一部新书，叫《赛博朋克二零七七》。这词儿我听着都新鲜。这赛博朋克啊，简单理解就是“高科技，低生活”。这一听，我就明白了，于老师就爱用那高科技的东西，手机都得拿脚纹开，大冬天为了解锁脱得一丝不挂，冻得跟王八蛋似的。","emo_mode":0}
+{"prompt_audio":"voice_07.wav","emo_audio":"emo_sad.wav","emo_weight": 0.9, "emo_mode":1,"text":"酒楼丧尽天良，开始借机竞拍房间，哎，一群蠢货。"}
+{"prompt_audio":"voice_08.wav","emo_audio":"emo_hate.wav","emo_weight": 0.8, "emo_mode":1,"text":"你看看你，对我还有没有一点父子之间的信任了。"}
+{"prompt_audio":"voice_09.wav","emo_vec_3":0.55,"emo_mode":2,"text":"对不起嘛！我的记性真的不太好，但是和你在一起的事情，我都会努力记住的~"}
+{"prompt_audio":"voice_10.wav","emo_vec_7":0.45,"emo_mode":2,"text":"哇塞！这个爆率也太高了！欧皇附体了！"}
+{"prompt_audio":"voice_11.wav","emo_mode":3,"emo_text":"极度悲伤","text":"这些年的时光终究是错付了... "}
+{"prompt_audio":"voice_12.wav","emo_mode":3,"emo_text":"You scared me to death! What are you, a ghost?","text":"快躲起来！是他要来了！他要来抓我们了！"}
\ No newline at end of file
--- a/examples/emo_hate.wav
+++ b/examples/emo_hate.wav