v1.0

39ac40a9 · chenzk · 39ac40a9 · 39ac40a9 · 39ac40a9 · 39ac40a9
Commit 39ac40a9 authored Jun 06, 2025 by chenzk
20 changed files
--- a/.gitmodules
+++ b/.gitmodules
+[submodule "third_party/GLM-4-Voice"]
+	path = third_party/GLM-4-Voice
+	url = https://github.com/shenyunhang/GLM-4-Voice.git
+[submodule "third_party/seed-tts-eval"]
+	path = third_party/seed-tts-eval
+	url = https://github.com/shenyunhang/seed-tts-eval.git
--- a/FunAudioLLM/SenseVoiceSmall/README.md
+++ b/FunAudioLLM/SenseVoiceSmall/README.md
+---
+frameworks:
+- Pytorch
+license: Apache License 2.0
+tasks:
+- auto-speech-recognition
+
+#model-type:
+##如 gpt、phi、llama、chatglm、baichuan 等
+#- gpt
+
+#domain:
+##如 nlp、cv、audio、multi-modal
+#- nlp
+
+#language:
+##语言代码列表 https://help.aliyun.com/document_detail/215387.html?spm=a2c4g.11186623.0.0.9f8d7467kni6Aa
+#- cn 
+
+#metrics:
+##如 CIDEr、Blue、ROUGE 等
+#- CIDEr
+
+#tags:
+##各种自定义，包括 pretrained、fine-tuned、instruction-tuned、RL-tuned 等训练方法和其他
+#- pretrained
+
+#tools:
+##如 vllm、fastchat、llamacpp、AdaSeq 等
+#- vllm
+---
+
+# Highlights
+**SenseVoice**专注于高精度多语言语音识别、情感辨识和音频事件检测
+- **多语言识别：** 采用超过40万小时数据训练，支持超过50种语言，识别效果上优于Whisper模型。
+- **富文本识别：** 
+  - 具备优秀的情感识别，能够在测试数据上达到和超过目前最佳情感识别模型的效果。
+  - 支持声音事件检测能力，支持音乐、掌声、笑声、哭声、咳嗽、喷嚏等多种常见人机交互事件进行检测。
+- **高效推理：** SenseVoice-Small模型采用非自回归端到端框架，推理延迟极低，10s音频推理仅耗时70ms，15倍优于Whisper-Large。
+- **微调定制：** 具备便捷的微调脚本与策略，方便用户根据业务场景修复长尾样本问题。
+- **服务部署：** 具有完整的服务部署链路，支持多并发请求，支持客户端语言有，python、c++、html、java与c#等。
+
+
+## <strong>[SenseVoice开源项目介绍](https://github.com/FunAudioLLM/SenseVoice)</strong>
+<strong>[SenseVoice](https://github.com/FunAudioLLM/SenseVoice)</strong>开源模型是多语言音频理解模型，具有包括语音识别、语种识别、语音情感识别，声学事件检测能力。
+
+[**github仓库**](https://github.com/FunAudioLLM/SenseVoice)
+| [**最新动态**](https://github.com/FunAudioLLM/SenseVoice/blob/main/README_zh.md#%E6%9C%80%E6%96%B0%E5%8A%A8%E6%80%81)
+| [**环境安装**](https://github.com/FunAudioLLM/SenseVoice/blob/main/README_zh.md#%E7%8E%AF%E5%A2%83%E5%AE%89%E8%A3%85)
+
+# 模型结构图
+SenseVoice多语言音频理解模型，支持语音识别、语种识别、语音情感识别、声学事件检测、逆文本正则化等能力，采用工业级数十万小时的标注音频进行模型训练，保证了模型的通用识别效果。模型可以被应用于中文、粤语、英语、日语、韩语音频识别，并输出带有情感和事件的富文本转写结果。
+
+<p align="center">
+<img src="fig/sensevoice.png" alt="SenseVoice模型结构"  width="1500" />
+</p>
+
+SenseVoice-Small是基于非自回归端到端框架模型，为了指定任务，我们在语音特征前添加四个嵌入作为输入传递给编码器：
+- LID：用于预测音频语种标签。
+- SER：用于预测音频情感标签。
+- AED：用于预测音频包含的事件标签。
+- ITN：用于指定识别输出文本是否进行逆文本正则化。
+
+
+# 依赖环境
+
+推理之前，请务必更新funasr与modelscope版本
+
+```shell
+pip install -U funasr modelscope
+```
+
+# 用法
+
+
+## 推理
+
+### modelscope pipeline推理
+```python
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+
+inference_pipeline = pipeline(
+    task=Tasks.auto_speech_recognition,
+    model='iic/SenseVoiceSmall',
+    model_revision="master",
+    device="cuda:0",)
+
+rec_result = inference_pipeline('https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav')
+print(rec_result)
+```
+
+### 使用funasr推理
+
+支持任意格式音频输入，支持任意时长输入
+
+```python
+from funasr import AutoModel
+from funasr.utils.postprocess_utils import rich_transcription_postprocess
+
+model_dir = "iic/SenseVoiceSmall"
+
+
+model = AutoModel(
+    model=model_dir,
+    trust_remote_code=True,
+    remote_code="./model.py",  
+    vad_model="fsmn-vad",
+    vad_kwargs={"max_single_segment_time": 30000},
+    device="cuda:0",
+)
+
+# en
+res = model.generate(
+    input=f"{model.model_path}/example/en.mp3",
+    cache={},
+    language="auto",  # "zn", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=True,
+    batch_size_s=60,
+    merge_vad=True,  #
+    merge_length_s=15,
+)
+text = rich_transcription_postprocess(res[0]["text"])
+print(text)
+```
+参数说明：
+- `model_dir`：模型名称，或本地磁盘中的模型路径。
+- `trust_remote_code`：
+  - `True`表示model代码实现从`remote_code`处加载，`remote_code`指定`model`具体代码的位置（例如，当前目录下的`model.py`），支持绝对路径与相对路径，以及网络url。
+  - `False`表示，model代码实现为 [FunASR](https://github.com/modelscope/FunASR) 内部集成版本，此时修改当前目录下的`model.py`不会生效，因为加载的是funasr内部版本，模型代码[点击查看](https://github.com/modelscope/FunASR/tree/main/funasr/models/sense_voice)。
+- `vad_model`：表示开启VAD，VAD的作用是将长音频切割成短音频，此时推理耗时包括了VAD与SenseVoice总耗时，为链路耗时，如果需要单独测试SenseVoice模型耗时，可以关闭VAD模型。
+- `vad_kwargs`：表示VAD模型配置,`max_single_segment_time`: 表示`vad_model`最大切割音频时长, 单位是毫秒ms。
+- `use_itn`：输出结果中是否包含标点与逆文本正则化。
+- `batch_size_s` 表示采用动态batch，batch中总音频时长，单位为秒s。
+- `merge_vad`：是否将 vad 模型切割的短音频碎片合成，合并后长度为`merge_length_s`，单位为秒s。
+- `ban_emo_unk`：禁用emo_unk标签，禁用后所有的句子都会被赋与情感标签。默认`False`
+
+```python
+model = AutoModel(model=model_dir, trust_remote_code=True, device="cuda:0")
+
+res = model.generate(
+    input=f"{model.model_path}/example/en.mp3",
+    cache={},
+    language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=True,
+    batch_size=64, 
+)
+```
+
+更多详细用法，请参考 [文档](https://github.com/modelscope/FunASR/blob/main/docs/tutorial/README.md)
+
+
+
+## 模型下载
+上面代码会自动下载模型，如果您需要离线下载好模型，可以通过下面代码，手动下载，之后指定模型本地路径即可。
+
+SDK下载
+```bash
+#安装ModelScope
+pip install modelscope
+```
+```python
+#SDK模型下载
+from modelscope import snapshot_download
+model_dir = snapshot_download('iic/SenseVoiceSmall')
+```
+Git下载
+```
+#Git模型下载
+git clone https://www.modelscope.cn/iic/SenseVoiceSmall.git
+```
+
+## 服务部署
+
+Undo
+
+# Performance
+
+## 语音识别效果
+我们在开源基准数据集（包括 AISHELL-1、AISHELL-2、Wenetspeech、Librispeech和Common Voice）上比较了SenseVoice与Whisper的多语言语音识别性能和推理效率。在中文和粤语识别效果上，SenseVoice-Small模型具有明显的效果优势。
+
+<p align="center">
+<img src="fig/asr_results.png" alt="SenseVoice模型在开源测试集上的表现"  width="2500" />
+</p>
+
+
+
+## 情感识别效果
+由于目前缺乏被广泛使用的情感识别测试指标和方法，我们在多个测试集的多种指标进行测试，并与近年来Benchmark上的多个结果进行了全面的对比。所选取的测试集同时包含中文/英文两种语言以及表演、影视剧、自然对话等多种风格的数据，在不进行目标数据微调的前提下，SenseVoice能够在测试数据上达到和超过目前最佳情感识别模型的效果。
+
+<p align="center">
+<img src="fig/ser_table.png" alt="SenseVoice模型SER效果1"  width="1500" />
+</p>
+
+同时，我们还在测试集上对多个开源情感识别模型进行对比，结果表明，SenseVoice-Large模型可以在几乎所有数据上都达到了最佳效果，而SenseVoice-Small模型同样可以在多数数据集上取得超越其他开源模型的效果。
+
+<p align="center">
+<img src="fig/ser_figure.png" alt="SenseVoice模型SER效果2"  width="500" />
+</p>
+
+## 事件检测效果
+
+尽管SenseVoice只在语音数据上进行训练，它仍然可以作为事件检测模型进行单独使用。我们在环境音分类ESC-50数据集上与目前业内广泛使用的BEATS与PANN模型的效果进行了对比。SenseVoice模型能够在这些任务上取得较好的效果，但受限于训练数据与训练方式，其事件分类效果专业的事件检测模型相比仍然有一定的差距。
+
+<p align="center">
+<img src="fig/aed_figure.png" alt="SenseVoice模型AED效果"  width="500" />
+</p>
+
+
+
+## 推理效率
+SenseVoice-Small模型采用非自回归端到端架构，推理延迟极低。在参数量与Whisper-Small模型相当的情况下，比Whisper-Small模型推理速度快7倍，比Whisper-Large模型快17倍。同时SenseVoice-small模型在音频时长增加的情况下，推理耗时也无明显增加。
+
+
+<p align="center">
+<img src="fig/inference.png" alt="SenseVoice模型的推理效率"  width="1500" />
+</p>
+
+<p style="color: lightgrey;">如果您是本模型的贡献者，我们邀请您根据<a href="https://modelscope.cn/docs/ModelScope%E6%A8%A1%E5%9E%8B%E6%8E%A5%E5%85%A5%E6%B5%81%E7%A8%8B%E6%A6%82%E8%A7%88" style="color: lightgrey; text-decoration: underline;">模型贡献文档</a>，及时完善模型卡片内容。</p>
--- a/Kimi-Audio-Evalkit/almeval/models/vita_audio.py
+++ b/Kimi-Audio-Evalkit/almeval/models/vita_audio.py
+from io import BytesIO
+import sys
+
+import librosa
+import numpy as np
+import torch
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, GenerationConfig
+
+from ..utils.misc import print_once
+from .base import BaseModel
+
+from vita_audio.data.processor.audio_processor import add_audio_input_contiguous
+from vita_audio.tokenizer import get_audio_tokenizer
+
+chat_template = """
+{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- messages[0]['content'] }}\n    {%- endif %}\n    {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n
+"""
+
+class VITAAudio(BaseModel):
+    NAME = 'VITA-Audio'
+
+    def __init__(self,
+                 model_path="VITA-MLLM/VITA-Audio-Plus-Boost",
+                 device='cuda',
+                 torch_dtype=torch.bfloat16,
+                 **kwargs):
+        self.device = device
+
+        self.config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
+
+        self.vita_model = AutoModelForCausalLM.from_pretrained(
+            model_path,
+            trust_remote_code=True,
+            torch_dtype=torch_dtype,
+            attn_implementation="flash_attention_2",
+        ).to(device).eval()
+
+        self.vita_tokenizer = AutoTokenizer.from_pretrained(
+            model_path,
+            trust_remote_code=True,
+            chat_template=chat_template,
+        )
+
+        self.vita_model.generation_config = GenerationConfig.from_pretrained(
+            model_path, trust_remote_code=True
+        )
+
+        self.vita_model.generation_config.max_new_tokens = 2048
+        self.vita_model.generation_config.chat_format = "chatml"
+        self.vita_model.generation_config.max_window_size = 2048
+        self.vita_model.generation_config.use_cache = True
+        # self.vita_model.generation_config.use_cache = False
+        self.vita_model.generation_config.do_sample = False
+
+        sys.path.append("glm4voice/")
+        sys.path.append("glm4voice/cosyvoice/")
+        sys.path.append("glm4voice/third_party/Matcha-TTS/")
+        audio_tokenizer_path = "/data/models/THUDM/glm-4-voice-tokenizer"
+        flow_path = "/data/models/THUDM/glm-4-voice-decoder"
+        audio_tokenizer_type = "sensevoice_glm4voice"
+        self.audio_tokenizer = get_audio_tokenizer(
+            audio_tokenizer_path,
+            audio_tokenizer_type,
+            flow_path=flow_path,
+            # rank=audio_tokenizer_rank,
+        )
+
+        self.default_system_message = [
+        ]
+
+        self.luke_system_message = [
+            {
+                "role": "system",
+                "content": "Your Name: Luke\nYour Gender: male\n\nRespond in a text-audio interleaved manner.",
+            },
+        ]
+
+        self.add_generation_prompt = True
+
+        torch.cuda.empty_cache()
+
+    def get_system_message(self, msg: dict):
+        meta = msg['meta']
+        if meta is None:
+            return self.default_system_message
+        if meta['task'] == 'ASR':
+            return self.default_system_message
+
+        return self.luke_system_message
+
+    def get_task_message(self, msg: dict):
+        meta = msg['meta']
+        if meta['task'] == 'ASR':
+            messages = [
+                {
+                    "role": "user",
+                    "content": "Convert the speech to text.\n<|audio|>",
+                },
+            ]
+
+        elif meta['interactive'] == 'Audio-QA':
+            messages = [
+                {
+                    "role": "user",
+                    "content":  "<|audio|>",
+                },
+            ]
+
+        elif meta['audio_type'] == 'AudioEvent':
+            messages = [
+                {
+                    "role": "user",
+                    "content":  msg['text'] + "\n<|audio|>",
+                },
+            ]
+
+        else:
+            messages = [
+                {
+                    "role": "user",
+                    "content":  msg['text'] + "\n<|audio|>",
+                },
+            ]
+
+        return messages
+
+
+    def generate_inner(self, msg: dict):
+        audio_path = msg['audio']
+        if len(audio_path) == 1:
+            audio_path = audio_path[0]
+
+        prompt_audio_path = None
+        messages = self.get_task_message(msg)
+        system_message = self.get_system_message(msg)
+
+        # only for dump
+        messages = system_message + messages
+        print_once(f'messages: {messages}')
+
+        if prompt_audio_path is not None:
+            if self.audio_tokenizer.apply_to_role("system", is_discrete=True):
+                # discrete codec
+                prompt_audio_tokens = self.audio_tokenizer.encode(prompt_audio_path)
+                prompt_audio_tokens = "".join(f"<|audio_{i}|>" for i in prompt_audio_tokens)
+                system_message = [
+                    {
+                        "role": "system",
+                        "content": f"Your Voice: <|begin_of_audio|>{prompt_audio_tokens}<|end_of_audio|>\n",
+                    },
+                ]
+
+            else:
+                # contiguous codec
+                system_message = system_message
+
+        if audio_path is not None and self.audio_tokenizer.apply_to_role("user", is_discrete=True):
+            # discrete codec
+            audio_tokens = self.audio_tokenizer.encode(audio_path)
+            audio_tokens = "".join(f"<|audio_{i}|>" for i in audio_tokens)
+            messages[-1]["content"] = messages[-1]["content"].replace(
+                "<|audio|>", f"<|begin_of_audio|>{audio_tokens}<|end_of_audio|>"
+            )
+
+        input_ids = self.vita_tokenizer.apply_chat_template(
+            messages,
+            tokenize=True,
+            add_generation_prompt=self.add_generation_prompt,
+        )
+
+        if audio_path is not None and self.audio_tokenizer.apply_to_role("user", is_contiguous=True):
+            # contiguous codec
+            input_ids, audios, audio_indices = add_audio_input_contiguous(
+                input_ids, [audio_path], self.vita_tokenizer, self.audio_tokenizer
+            )
+        else:
+            audios = None
+            audio_indices = None
+
+        input_ids = torch.tensor([input_ids], dtype=torch.long).to("cuda")
+
+        responses = self.vita_model.generate(
+            input_ids,
+            audios=audios,
+            audio_indices=audio_indices,
+        )
+
+        response = responses[0][len(input_ids[0]) :]
+
+        # audio_offset = self.vita_tokenizer.convert_tokens_to_ids("<|audio_0|>")
+        audio_offset = self.vita_tokenizer.convert_tokens_to_ids("<|begin_of_audio|>")
+
+        audio_tokens = []
+        text_tokens = []
+        for token_id in response:
+            if token_id >= audio_offset:
+                audio_tokens.append(token_id - audio_offset)
+            else:
+                text_tokens.append(token_id)
+
+        # if len(audio_tokens) > 0:
+        #     tts_speech = self.audio_tokenizer.decode(
+        #         audio_tokens, source_speech_16k=prompt_audio_path
+        #     )
+
+        # else:
+        #     tts_speech = None
+
+        out_text = self.vita_tokenizer.decode(
+            text_tokens, skip_special_tokens=True,
+        )
+        # print_once(f'{out_text=}')
+
+        return self.vita_tokenizer.decode(input_ids[0], skip_special_tokens=False), out_text
--- a/Kimi-Audio-Evalkit/run_vita_audio.sh
+++ b/Kimi-Audio-Evalkit/run_vita_audio.sh
+#!/bin/bash
+
+set -e
+set -x
+
+SEQ_LENGTH="$1"
+if [ -z "$SEQ_LENGTH" ]
+then
+    SEQ_LENGTH=32768
+fi
+
+timestamp="$2"
+if [ -z "$timestamp" ]
+then
+    timestamp=`date +'%Y%m%d_%H%M%S'`
+fi
+
+######################################################################
+export ROOT_PATH=/data/
+export CODE_PATH=${ROOT_PATH}/VITA-Audio/
+
+export LOCAL_ROOT_PATH=/data_local/
+export LOCAL_CODE_PATH=${LOCAL_ROOT_PATH}/VITA-Audio/
+mkdir -p ${LOCAL_ROOT_PATH}
+mkdir -p ${LOCAL_CODE_PATH}
+
+apt install -y rsync
+mkdir -p ${LOCAL_CODE_PATH}
+rsync -a --exclude ".git" --exclude ".gitee" ${CODE_PATH}/ ${LOCAL_CODE_PATH}/
+
+cd ${LOCAL_CODE_PATH}
+rm -fr datasets
+ln -s ${ROOT_PATH}/data datasets
+
+######################################################################
+SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+source ${CODE_PATH}/scripts/set_env_ds_gpu.sh
+
+######################################################################
+OUTPUT_DIR=${ROOT_PATH}/output/LM/"$0"/${timestamp}/
+
+mkdir -p ${OUTPUT_DIR}
+rsync -avh $0 ${OUTPUT_DIR}
+
+export HF_HOME="${ROOT_PATH}/data/HF_HOME/"
+mkdir -p ${HF_HOME}
+export HF_ENDPOINT=https://hf-mirror.com
+
+export MODELSCOPE_CACHE="${ROOT_PATH}/data/MODELSCOPE_CACHE/"
+mkdir -p ${MODELSCOPE_CACHE}
+
+export LC_ALL="en_US.utf8"
+
+######################################################################
+LOG=${OUTPUT_DIR}/log_node${INDEX}.txt
+exec &> >(tee -a "$LOG")
+echo Logging output to "$LOG"
+
+
+######################################################################
+rsync -avh -P ${CODE_PATH}/Kimi-Audio-Evalkit/ /data/Kimi-Audio-Evalkit/
+
+cd /data/Kimi-Audio-Evalkit/
+
+
+
+
+######################################################################
+if true
+#if false
+then
+
+    bash run_audio.sh \
+        --model VITA-Audio \
+        --data "LibriSpeech AISHELL-1 AISHELL-2 WenetSpeech Fleurs-en Fleurs-zh" \
+        --work-dir ${OUTPUT_DIR} \
+
+fi
+
+if true
+#if false
+then
+
+    bash run_audio.sh \
+        --model VITA-Audio \
+        --data "mmsu openbookqa sd-qa advbench alpacaeval_full commoneval ifeval OpenAudioBench" \
+        --work-dir ${OUTPUT_DIR} \
+        --skip-eval 
+
+    export OPENAI_API_KEY=""
+
+    bash run_audio.sh \
+        --model VITA-Audio \
+        --data "mmsu openbookqa sd-qa advbench alpacaeval_full commoneval ifeval OpenAudioBench" \
+        --work-dir ${OUTPUT_DIR} \
+        --reeval 
+
+fi
+
+set +x
--- a/LICENSE
+++ b/LICENSE
+Copyright (C) 2024 THL A29 Limited, a Tencent company.  All rights reserved. The below software and/or models in this distribution may have been modified by THL A29 Limited ("Tencent Modifications"). All Tencent Modifications are Copyright (C) THL A29 Limited.
+
+License Terms of the  VITA1.5:
+--------------------------------------------------------------------
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this Software and associated documentation files, to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sublicense copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+- You agree to use the VITA1.5 only for academic, research and education purposes, and refrain from using it for any commercial or production purposes under any circumstances.
+
+- The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+For avoidance of doubts, "Software" means the VITA1.5 model inference-enabling code, and weights made available under this license.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
+
+Other dependencies and licenses:
+
+
+Open Source Model Licensed under the Apache License Version 2.0:
+The below software in this distribution may have been modified by THL A29 Limited ("Tencent Modifications"), as model weights provided for the VITA1.5 Project hereunder is fine-tuned with the assistance of below model.
+
+All Tencent Modifications are Copyright (C) 2024 THL A29 Limited.
+--------------------------------------------------------------------
+1. Qwen2-7B-Instruct
+Copyright 2024 Alibaba Cloud
+
+Terms of the Apache License Version 2.0:
+--------------------------------------------------------------------
+Apache License 
+
+Version 2.0, January 2004
+
+http://www.apache.org/licenses/ 
+
+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+1. Definitions.
+
+"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
+
+"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
+
+"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
+
+"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
+
+"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
+
+"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
+
+"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
+
+"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
+
+"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
+
+"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
+
+2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
+
+3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
+
+4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
+
+You must give any other recipients of the Work or Derivative Works a copy of this License; and 
+
+You must cause any modified files to carry prominent notices stating that You changed the files; and 
+
+You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and 
+
+If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. 
+
+You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 
+
+5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
+
+6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
+
+7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
+
+8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
+
+9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
+
+END OF TERMS AND CONDITIONS
+
+
+Open Source Model/Software Licensed under the Apache License Version 2.0:
+The below software in this distribution may have been modified by THL A29 Limited ("Tencent Modifications").  All Tencent Modifications are Copyright (C) 2024 THL A29 Limited.
+--------------------------------------------------------------------
+1. ModelLink
+Copyright (c) 2024, HUAWEI CORPORATION.  All rights reserved.
+
+A copy of the Apache License Version 2.0 is included in this file.
+
+
+Open Source Model/Software Licensed under the Apache License Version 2.0 and Other Licenses of the Third-Party Components therein:
+--------------------------------------------------------------------
+1. opencv
+Copyright (C) 2000-2022, Intel Corporation, all rights reserved.
+Copyright (C) 2009-2011, Willow Garage Inc., all rights reserved.
+Copyright (C) 2009-2016, NVIDIA Corporation, all rights reserved.
+Copyright (C) 2010-2013, Advanced Micro Devices, Inc., all rights reserved.
+Copyright (C) 2015-2023, OpenCV Foundation, all rights reserved.
+Copyright (C) 2008-2016, Itseez Inc., all rights reserved.
+Copyright (C) 2019-2023, Xperience AI, all rights reserved.
+Copyright (C) 2019-2022, Shenzhen Institute of Artificial Intelligence and Robotics for Society, all rights reserved.
+Copyright (C) 2022-2023, Southern University of Science And Technology, all rights reserved.
+
+A copy of the Apache 2.0 is included in this file.
+
+For the license of other third party components, please refer to the following URL:
+https://github.com/opencv/opencv/tree/4.10.0/3rdparty
+
+
+Open Source Model/Software Licensed under the BSD 3-Clause License:
+--------------------------------------------------------------------
+1. flask
+Copyright 2010 Pallets
+
+2. flask-restful
+Copyright (c) 2013, Twilio, Inc.
+All rights reserved.
+
+
+Terms of the BSD 3-Clause License:
+--------------------------------------------------------------------
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+
+3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+
+Open Source Model/Software Licensed under the BSD 3-Clause License and Other Licenses of the Third-Party Components therein:
+--------------------------------------------------------------------
+1. Megatron-LM
+Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+
+
+A copy of the BSD 3-Clause is included in this file.
+
+For the license of other third party components, please refer to the following URL:
+https://github.com/NVIDIA/Megatron-LM/blob/master/LICENSE
+
+
+Open Source Model/Software Licensed under the BSD 3-Clause License and Other Licenses of the Third-Party Components therein:
+--------------------------------------------------------------------
+1. MindSpeed
+Copyright (c) 2024, Bytedance Inc.
+Copyright (c) 2023, Huawei Technologies Co., Ltd
+Copyright (c) 2022, NVIDIA CORPORATION. 
+All rights reserved.
+
+
+A copy of the BSD 3-Clause is included in this file.
+
+For the license of other third party components, please refer to the following URL:
+https://gitee.com/ascend/MindSpeed/blob/master/LICENSE
+
+
+Open Source Model/Software Licensed under the MIT License:
+--------------------------------------------------------------------
+1. natsort
+Copyright (c) 2012-2023 Seth M. Morton
+
+
+Terms of the MIT License:
+--------------------------------------------------------------------
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+VertiTab
--- a/Qwen/Qwen2.5-7B-Instruct/README.md
+++ b/Qwen/Qwen2.5-7B-Instruct/README.md
+---
+license: apache-2.0
+license_link: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/blob/main/LICENSE
+language:
+- en
+pipeline_tag: text-generation
+base_model: Qwen/Qwen2.5-7B
+tags:
+- chat
+library_name: transformers
+---
+
+# Qwen2.5-7B-Instruct
+<a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
+    <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
+</a>
+
+## Introduction
+
+Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2:
+
+- Significantly **more knowledge** and has greatly improved capabilities in **coding** and **mathematics**, thanks to our specialized expert models in these domains.
+- Significant improvements in **instruction following**, **generating long texts** (over 8K tokens), **understanding structured data** (e.g, tables), and **generating structured outputs** especially JSON. **More resilient to the diversity of system prompts**, enhancing role-play implementation and condition-setting for chatbots.
+- **Long-context Support** up to 128K tokens and can generate up to 8K tokens.
+- **Multilingual support** for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. 
+
+**This repo contains the instruction-tuned 7B Qwen2.5 model**, which has the following features:
+- Type: Causal Language Models
+- Training Stage: Pretraining & Post-training
+- Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
+- Number of Parameters: 7.61B
+- Number of Paramaters (Non-Embedding): 6.53B
+- Number of Layers: 28
+- Number of Attention Heads (GQA): 28 for Q and 4 for KV
+- Context Length: Full 131,072 tokens and generation 8192 tokens
+  - Please refer to [this section](#processing-long-texts) for detailed instructions on how to deploy Qwen2.5 for handling long texts.
+
+For more details, please refer to our [blog](https://qwenlm.github.io/blog/qwen2.5/), [GitHub](https://github.com/QwenLM/Qwen2.5), and [Documentation](https://qwen.readthedocs.io/en/latest/).
+
+## Requirements
+
+The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`.
+
+With `transformers<4.37.0`, you will encounter the following error:
+```
+KeyError: 'qwen2'
+```
+
+## Quickstart
+
+Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_name = "Qwen/Qwen2.5-7B-Instruct"
+
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+
+prompt = "Give me a short introduction to large language model."
+messages = [
+    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=512
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+```
+
+### Processing Long Texts
+
+The current `config.json` is set for context length up to 32,768 tokens.
+To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
+
+For supported frameworks, you could add the following to `config.json` to enable YaRN:
+```json
+{
+  ...,
+  "rope_scaling": {
+    "factor": 4.0,
+    "original_max_position_embeddings": 32768,
+    "type": "yarn"
+  }
+}
+```
+
+For deployment, we recommend using vLLM. 
+Please refer to our [Documentation](https://qwen.readthedocs.io/en/latest/deployment/vllm.html) for usage if you are not familar with vLLM.
+Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. 
+We advise adding the `rope_scaling` configuration only when processing long contexts is required.
+
+## Evaluation & Performance
+
+Detailed evaluation results are reported in this [📑 blog](https://qwenlm.github.io/blog/qwen2.5/).
+
+For requirements on GPU memory and the respective throughput, see results [here](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).
+
+## Citation
+
+If you find our work helpful, feel free to give us a cite.
+
+```
+@misc{qwen2.5,
+    title = {Qwen2.5: A Party of Foundation Models},
+    url = {https://qwenlm.github.io/blog/qwen2.5/},
+    author = {Qwen Team},
+    month = {September},
+    year = {2024}
+}
+
+@article{qwen2,
+      title={Qwen2 Technical Report}, 
+      author={An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jin Xu and Jingren Zhou and Jinze Bai and Jinzheng He and Junyang Lin and Kai Dang and Keming Lu and Keqin Chen and Kexin Yang and Mei Li and Mingfeng Xue and Na Ni and Pei Zhang and Peng Wang and Ru Peng and Rui Men and Ruize Gao and Runji Lin and Shijie Wang and Shuai Bai and Sinan Tan and Tianhang Zhu and Tianhao Li and Tianyu Liu and Wenbin Ge and Xiaodong Deng and Xiaohuan Zhou and Xingzhang Ren and Xinyu Zhang and Xipin Wei and Xuancheng Ren and Yang Fan and Yang Yao and Yichang Zhang and Yu Wan and Yunfei Chu and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zhihao Fan},
+      journal={arXiv preprint arXiv:2407.10671},
+      year={2024}
+}
+```
\ No newline at end of file
--- a/README.md
+++ b/README.md
+# VITA-Audio
+在生成首个音频片段时大幅提升响应速度，解决实时语音关键瓶颈，整体推理速度相比同规模模型提升3–5倍。
+
+## 论文
+`VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model`
+- https://arxiv.org/pdf/2505.03739
+
+## 模型结构
+VITA-Audio的核心组件包括音频编码器、音频解码器、LLM、十个轻量级MCTP模块。
+<div align=center>
+    <img src="./doc/VITA_MCTP.png"/>
+</div>
+
+## 算法原理
+语音Token随着语言模型（LLM）前向传播被逐步自回归地生成；然后多个已生成的语音Token会被收集并送入解码器，最终合成为可播放的音频，本算法创新点：
+
+1、语音模型在预测某个音频Token时，对应的文本Token Hidden States所承载的注意力权重显著高于其他位置，语音生成并不需要对整个文本—音频序列的全局语义空间进行复杂建模；
+
+2、多个MCTP模块直接在单次前向传播中并行预测多个音频Token，大幅减少自回归循环次数，不仅加速了整体推理流程，更显著降低了流式场景下首个音频片段的生成延迟；
+<div align=center>
+    <img src="./doc/relative.png"/>
+</div>
+
+
+## 环境配置
+```
+mv VITA-Audio_pytorch VITA-Audio # 去框架名后缀
+```
+
+### Docker（方法一）
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
+# <your IMAGE ID>为以上拉取的docker的镜像ID替换，本镜像为：6063b673703a
+docker run -it --shm-size=64G -v $PWD/VITA-Audio:/home/VITA-Audio -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name va <your IMAGE ID> bash
+cd /home/VITA-Audio
+pip install -r requirements.txt --user -i https://mirrors.aliyun.com/pypi/simple
+pip install whl/torchaudio-2.4.1+das.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl # torchaudio==2.4.1
+pip install -e . # vita_audio==0.0.1
+```
+### Dockerfile（方法二）
+```
+cd /home/VITA-Audio/docker
+docker build --no-cache -t va:latest .
+docker run --shm-size=64G --name va -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../VITA-Audio:/home/VITA-Audio -it va bash
+# 若遇到Dockerfile启动的方式安装环境需要长时间等待，可注释掉里面的pip安装，启动容器后再安装python库：pip install -r requirements.txt。
+cd /home/VITA-Audio
+pip install -r requirements.txt --user -i https://mirrors.aliyun.com/pypi/simple
+pip install whl/torchaudio-2.4.1+das.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl # torchaudio==2.4.1
+pip install -e . # vita_audio==0.0.1
+```
+### Anaconda（方法三）
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
+- https://developer.sourcefind.cn/tool/
+```
+DTK驱动:dtk2504
+python:python3.10
+torch:2.4.1
+torchvision:0.19.1
+torchaudio:2.4.1
+triton:3.0.0
+flash-attn:2.6.1
+deepspeed:0.14.2
+apex:1.4.0
+transformers:4.48.3
+```
+
+`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
+
+2、其它非特殊库参照requirements.txt安装
+```
+cd /home/VITA-Audio
+pip install -r requirements.txt --user -i https://mirrors.aliyun.com/pypi/simple
+pip install whl/torchaudio-2.4.1+das.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl # torchaudio==2.4.1
+pip install -e . # vita_audio==0.0.1
+```
+
+## 数据集
+`无`
+
+## 训练
+无
+
+## 推理
+预训练权重目录结构：
+```
+/home/VITA-Audio/
+    |── VITA-MLLM/VITA-Audio-Plus-Boost
+    |── FunAudioLLM/SenseVoiceSmall
+    |── THUDM/glm-4-voice-tokenizer
+    └── THUDM/glm-4-voice-decoder
+
+将glm-4-voice相关权重放到文件夹/data/models下面：
+mkdir -p /data/models
+mv THUDM /data/models/
+``` 
+
+### 单机单卡
+```
+cd /home/VITA-Audio
+python tools/inference_sts.py
+```
+
+更多资料可参考源项目中的[`README_origin`](./README_origin.md)。
+
+## result
+`输入: `
+```
+asset/piano.mp3
+asset/介绍一下上海.wav
+asset/发表一个悲伤的演讲.wav
+asset/发表一个振奋人心的演讲.wav
+```
+
+`输出:`
+```
+/data/output/LM/inference/asset/piano.mp3
+/data/output/LM/inference/asset/介绍一下上海.wav
+/data/output/LM/inference/asset/发表一个悲伤的演讲.wav
+/data/output/LM/inference/asset/发表一个振奋人心的演讲.wav
+```
+
+官方示例效果可参考源项目中的[`README_origin`](./README_origin.md)
+
+### 精度
+DCU与GPU精度一致，推理框架：pytorch。
+
+## 应用场景
+### 算法类别
+`语音合成`
+### 热点应用行业
+`广媒,影视,动漫,医疗,家居,教育`
+## 预训练权重
+HF下载地址为：[VITA-MLLM/VITA-Audio-Plus-Boost](https://huggingface.co/VITA-MLLM/VITA-Audio-Plus-Boost)、[VITA-MLLM/VITA-Audio-Plus-Vanilla](https://huggingface.co/VITA-MLLM/VITA-Audio-Plus-Vanilla)、[FunAudioLLM/SenseVoiceSmall](https://huggingface.co/FunAudioLLM/SenseVoiceSmall)、[THUDM/glm-4-voice-tokenizer](https://huggingface.co/THUDM/glm-4-voice-tokenizer)、[THUDM/glm-4-voice-decoder](https://huggingface.co/THUDM/glm-4-voice-decoder)、[Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
+## 源码仓库及问题反馈
+- http://developer.sourcefind.cn/codes/modelzoo/VITA-Audio_pytorch.git
+## 参考资料
+- https://github.com/VITA-MLLM/VITA-Audio.git
+
--- a/README_origin.md
+++ b/README_origin.md
+# VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
+
+<p align="center">
+    <img src="asset/vita-audio_logo.jpg" width="60%" height="60%">
+</p>
+
+
+<font size=7><div align='center' > [[📖 VITA-Audio Paper](https://arxiv.org/abs/2505.03739)] [[🤖 Model Weight](https://huggingface.co/collections/VITA-MLLM/vita-audio-680f036c174441e7cdf02575)]  [[💬 WeChat (微信)](./asset/wechat-group.jpg)]</div></font>
+
+
+
+
+
+## :fire: News
+
+
+
+* **`2025.05.07`** 🌟 We are proud to launch VITA-Audio, an end-to-end large speech model with fast audio-text token generation.
+
+
+## 📄 Contents <!-- omit in toc -->
+
+
+- [Highlights](#-highlights)
+- [Exhibition](#-exhibition)
+- [Models](#-models)
+- [Experimental Results](#-experimental-results)
+- [Training](#-training)
+- [Inference](#-inference)
+- [Evaluation](#-evaluation)
+
+
+## ✨ Highlights
+
+- **Low Latency**. VITA-Audio is the first end-to-end speech model capable of generating audio during the initial forward pass. By utilizing a set of 32 prefill tokens, VITA-Audio reduces the time required to generate the first audio token chunk from 236 ms to just 53 ms.
+- **Fast Inference**. VITA-Audio achieves an inference speedup of 3-5x at the 7B parameter scale.
+- **Open Source**. VITA-Audio is trained on **open-source data** only, consisting of 200k hours of publicly available audio.
+- **Strong Performance**. VITA-Audio achieves competitive results on ASR,TTS and SQA benchmarks among cutting-edge models under 7B parameters.
+  
+
+
+## 📌 Exhibition
+
+### Inference Acceleration
+Model inference speed under different inference modes.
+
+<p align="center">
+  <img src="./asset/qa_speed.gif" alt="demogif" width="48%" style="display: inline-block; margin-right: 2%;">
+  <img src="./asset/tts_speed.gif" alt="second_gif" width="48%" style="display: inline-block;">
+</p>
+
+### Time to Generate the First Audio Segment In Streaming Inference
+<div align="center">
+  <img width="400" alt="first audio generate time" src="https://github.com/user-attachments/assets/165f943e-ac53-443f-abba-e5eb1e0c0f40" />
+</div>
+
+
+
+
+
+### Generated Audio Case
+
+
+
+> 打南边来了个哑巴，腰里别了个喇叭；打北边来了个喇嘛，手里提了个獭犸。  
+> 提着獭犸的喇嘛要拿獭犸换别着喇叭的哑巴的喇叭；别着喇叭的哑巴不愿拿喇叭换提着獭玛的喇嘛的獭犸。  
+> 不知是别着喇叭的哑巴打了提着獭玛的喇嘛一喇叭；还是提着獭玛的喇嘛打了别着喇叭的哑巴一獭玛。  
+> 喇嘛回家炖獭犸；哑巴嘀嘀哒哒吹喇叭。
+
+https://github.com/user-attachments/assets/38da791f-5d72-4d9c-a9b2-cec97c2f2b2b
+
+
+---
+
+> To be or not to be--to live intensely and richly,
+> merely to exist, that depends on ourselves. Let widen and intensify our relations.   
+> While we live, let live!  
+
+https://github.com/user-attachments/assets/fd478065-4041-4eb8-b331-0c03b304d853
+
+
+---
+
+> The hair has been so little, don't think about it, go to bed early, for your hair. Good night!
+
+https://github.com/user-attachments/assets/4cfe4742-e237-42bd-9f17-7935b2285799
+
+
+---
+> 两个黄鹂鸣翠柳，
+> 一行白鹭上青天。  
+> 窗含西岭千秋雪，
+> 门泊东吴万里船。
+
+https://github.com/user-attachments/assets/382620ee-bb2a-488e-9e00-71afd2342b56
+
+
+---
+
+
+
+## :label: TODO 
+
+- [x] Release training code and inference code.
+- [x] Release checkpoints.
+- [x] Release VITA-Audio-Plus.
+- [ ] Release the cleaned open-source data JSON and audio.
+
+
+## 🔔 Models
+
+| Model                   | LLM Size | Huggingface Weights                                           |
+|-------------------------|----------|---------------------------------------------------------------|
+| VITA-Audio-Boost        | 7B       | https://huggingface.co/VITA-MLLM/VITA-Audio-Boost             |
+| VITA-Audio-Balance      | 7B       | https://huggingface.co/VITA-MLLM/VITA-Audio-Balance           |
+| VITA-Audio-Plus-Vanilla | 7B       | https://huggingface.co/VITA-MLLM/VITA-Audio-Plus-Vanilla      |
+| VITA-Audio-Plus-Boost| 7B       | https://huggingface.co/VITA-MLLM/VITA-Audio-Plus-Boost     |
+
+
+## 📈 Experimental Results
+- **Comparison of Spoken Question Answering**.
+
+![Clipboard_Screenshot_1746531780](https://github.com/user-attachments/assets/3adcad15-0333-4b92-bfdf-b753b330a3e2)
+
+
+- **Comparison of Text to Speech**.
+
+![image](https://github.com/user-attachments/assets/09cf8fd3-d7a5-4b77-be49-5a0ace308f3f)
+
+
+- **Comparison of Automatic Speech Recognition**.
+
+![Clipboard_Screenshot_1746532039](https://github.com/user-attachments/assets/d950cae0-c065-4da9-b37a-a471d28158a0)
+
+![Clipboard_Screenshot_1746532022](https://github.com/user-attachments/assets/929f45cd-693a-4ff6-af73-ceec6e875706)
+
+
+
+- **Effectiveness of Inference Acceleration**.
+
+
+![Clipboard_Screenshot_1746532167](https://github.com/user-attachments/assets/ad8b9e90-cd3c-4968-8653-998811a50006)
+
+![Image](https://github.com/user-attachments/assets/4aa5db8c-362d-4152-8090-92292b9a84c0)
+
+
+
+## 📔 Requirements and Installation
+
+### Prepare Environment
+```
+docker pull shenyunhang/pytorch:24.11-py3_2024-1224
+```
+
+### Get the Code
+```
+git clone https://github.com/VITA-MLLM/VITA-Audio.git
+cd VITA-Audio
+git submodule update --init --recursive
+pip install -r requirements_ds_gpu.txt
+pip install -e .
+```
+
+### Prepare Pre-trained Weight
+
+#### LLM
+
+- Download the LLM from https://huggingface.co/Qwen/Qwen2.5-7B-Instruct.
+- Put it into '../models/Qwen/Qwen2.5-7B-Instruct/'
+
+#### Audio Encoder and Audio Decoder
+
+- Download the Audio Encoder from https://huggingface.co/THUDM/glm-4-voice-tokenizer.
+- Put it into '../models/THUDM/glm-4-voice-tokenizer'
+
+- Download the Audio Decoder from https://huggingface.co/THUDM/glm-4-voice-decoder.
+- Put it into '../models/THUDM/glm-4-voice-decoder'
+
+
+### Data Format
+#### **Speech QA Data Format**
+
+
+```jsonc
+{
+  "messages": [
+    {
+      "content": "<|audio|>",
+      "role": "user"
+    },
+    {
+      "content": "好的，这样排列更合理：这些生物废弃物如鸡蛋壳、蛤壳、贻贝壳比其他工业废渣更有价值。研究表明，它们在能源、材料、环境保护等领域有广泛应用。高效利用贝壳能提高资源利用效率，减少废弃物，减轻环境负担。特别是在这些领域中，鸡蛋壳因为含有丰富的钙元素，被用于制造医药品和肥料。\n<|audio|>",
+      "role": "assistant"
+    }
+  ],
+  "audios": [
+    "datasets/VITA-MLLM/AudioQA-1M/QA_1450K_question_tar/question_shuf_part_8/wav/000000200014510ac1fd776006fc66b36f7f3cda76_question.wav",
+    "datasets/VITA-MLLM/AudioQA-1M/QA_1450K_answer_part1_tar/answer_part1_shuf_part_3/wav/000000200114510ac1fd776006fc66b36f7f3cda76_F10.wav"
+  ]
+}
+```
+
+#### **ASR Data Format**
+
+
+```jsonc
+{
+  "messages": [
+    {
+      "content": "Convert the speech to text.\n<|audio|>",
+      "role": "user"
+    },
+    {
+      "content": "没有跟大家说是在做什么",
+      "role": "assistant"
+    }
+  ],
+  "audios": [
+    "datasets/wenet-e2e/wenetspeech/data/cuts_L_fixed.00000000/X00/X0000016296_135343932_S00019.wav"
+  ]
+}
+```
+
+#### **TTS Data Format**
+
+
+```jsonc
+{
+  "messages": [
+    {
+      "content": "Convert the text to speech.\n那我情愿无药可救。",
+      "role": "user"
+    },
+    {
+      "content": "<|audio|>",
+      "role": "assistant"
+    }
+  ],
+  "audios": [
+    "datasets/Wenetspeech4TTS/WenetSpeech4TTS/Premium/WenetSpeech4TTS_Premium_9/wavs/X0000001735_50639692_S00035.wav"
+  ]
+}
+```
+
+## 🎲 Training
+
+
+The following tutorial will take `VITA-Audio-Boost` as an example.
+
+- To train `VITA-Audio-Balance` and other variants, you should modify the `text-audio-interval-ratio`.
+
+  VITA-Audio-Boost:
+  ```
+  --text-audio-interval-ratio 1 10 4 10 \
+  ```
+
+  VITA-Audio-Balance:
+  ```
+  --text-audio-interval-ratio 1 4 3 8 4 10 \
+  ```
+
+- To train `VITA-Audio-Plus-*`, you should use the script like `scripts/deepspeed/sts_qwen25/finetune_sensevoice_glm4voice...`
+
+### Stage-1 (Audio-Text Alignment)
+
+```
+bash scripts/deepspeed/sts_qwen25/finetune_glm4voice_stage1.sh 8192 `date +'%Y%m%d_%H%M%S'`
+```
+
+The above script may need some adjustments.
+
+- Set `ROOT_PATH` to your code root folder.
+- Set `LOCAL_ROOT_PATH` to a temporary code root folder.
+- Modify other variables as needed for your environment.
+
+### Stage-2 (Single MCTP Module Training)
+
+```
+bash scripts/deepspeed/sts_qwen25/finetune_glm4voice_mtp1_stage1.sh 8192 `date +'%Y%m%d_%H%M%S'`
+```
+
+The above script may need some adjustments.
+
+- Set `ROOT_PATH` to your code root folder.
+- Set `LOCAL_ROOT_PATH` to a temporary code root folder.
+- Set `MODEL_NAME_OR_PATH` to the path of the model trained in Stage 1.
+- Modify other variables as needed for your environment.
+
+### Stage-3 (Multiple MCTP Modules Training)
+
+```
+bash scripts/deepspeed/sts_qwen25/finetune_glm4voice_mtp10_stage1.sh 8192 `date +'%Y%m%d_%H%M%S'`
+```
+
+The above script may need some adjustments.
+
+- Set `ROOT_PATH` to your code root folder.
+- Set `LOCAL_ROOT_PATH` to a temporary code root folder.
+- Set `MODEL_NAME_OR_PATH` to the path of the model trained in Stage 2.
+- Modify other variables as needed for your environment.
+
+### Stage-4 (Supervised Fine-tuning)
+
+```
+bash scripts/deepspeed/sts_qwen25/finetune_glm4voice_mtp10_stage2.sh 2048 `date +'%Y%m%d_%H%M%S'`
+```
+
+The above script may need some adjustments.
+
+- Set `ROOT_PATH` to your code root folder.
+- Set `LOCAL_ROOT_PATH` to a temporary code root folder.
+- Set `MODEL_NAME_OR_PATH` to the path of the model trained in Stage 3.
+- Modify other variables as needed for your environment.
+
+
+
+## 📐 Inference
+
+Here we implement a simple script for inference.
+
+It includes examples of speech-to-speech, ASR, and TTS tasks, as well as streaming and non-streaming inference speed testing.
+
+```
+python tools/inference_sts.py
+```
+
+- Set `model_name_or_path` to VITA-Audio weights.
+- Set `audio_tokenizer_path` to the path of the audio encoder.
+- Set `flow_path` to the path of the audio decoder.
+
+
+## 🔎 Evaluation
+
+Evaluate SQA, ASR, and TTS benchmarks
+```
+bash scripts/deepspeed/evaluate_sts.sh
+```
+
+
+## &#x1F4E3; Statement
+
+**VITA-Audio is trained on large-scale open-source corpus, and its output has randomness. Any content generated by VITA-Audio does not represent the views of the model developers. We are not responsible for any problems arising from the use, misuse, and dissemination of VITA-Audio, including but not limited to public opinion risks and data security issues.**
+
+
+## :black_nib: Citation
+
+If you find our work helpful for your research, please consider citing the following BibTeX entry.   
+
+
+
+```bibtex
+@misc{,
+      title={VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model}, 
+      author={Zuwei Long and Yunhang Shen and Chaoyou Fu and Heting Gao and Lijiang Li and Peixian Chen and Mengdan Zhang and Hang Shao and Jian Li and Jinlong Peng and Haoyu Cao and Ke Li and Rongrong Ji and Xing Sun},
+      year={2025},
+      eprint={2505.03739},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2505.03739}, 
+}
+```
--- a/VITA-MLLM/VITA-Audio-Plus-Boost/README.md
+++ b/VITA-MLLM/VITA-Audio-Plus-Boost/README.md
+---
+license: apache-2.0
+datasets:
+- VITA-MLLM/VITA-Audio-Data
+language:
+- zh
+- en
+base_model:
+- Qwen/Qwen2.5-7B-Instruct
+---
+
+
+
+
+
+
+
+
+
+
+
+
+## ACCEPTABLE USE POLICY
+
+Any license on the model is subject to your compliance with the Acceptable Use Policy, and You must not violate (or encourage or permit anyone else to violate) any term of the Acceptable Use Policy. Tencent reserves the right to update this Acceptable Use Policy from time to time.
+
+Tencent endeavors to promote safe and fair use of its tools and features, including VITA. You agree not to use VITA or any of its derivatives:
+1. In any way that violates any applicable national, federal, state, local, international or any other law or regulation;
+2. To harm Yourself or others;
+3. To repurpose or distribute output from VITA or any of its derivatives to harm Yourself or others; 
+4. To override or circumvent the safety guardrails and safeguards We have put in place;
+5. For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
+6. To generate or disseminate verifiably false information and/or content with the purpose of harming others or influencing elections;
+7. To generate or facilitate false online engagement, including fake reviews and other means of fake online engagement;
+8. To intentionally defame, disparage or otherwise harass others;
+9. To generate and/or disseminate malware (including ransomware) or any other content to be used for the purpose of harming electronic systems;
+10. To generate or disseminate personal identifiable information with the purpose of harming others;
+11. To generate or disseminate information (including images, code, posts, articles), and place the information in any public context (including –through the use of bot generated tweets), without expressly and conspicuously identifying that the information and/or content is machine generated;
+12. To impersonate another individual without consent, authorization, or legal right;
+13. To make high-stakes automated decisions in domains that affect an individual’s safety, rights or wellbeing (e.g., law enforcement, migration, medicine/health, management of critical infrastructure, safety components of products, essential services, credit, employment, housing, education, social scoring, or insurance);
+14. In a manner that violates or disrespects the social ethics and moral standards of other countries or regions;
+15. To perform, facilitate, threaten, incite, plan, promote or encourage violent extremism or terrorism;
+16. For any use intended to discriminate against or harm individuals or groups based on protected characteristics or categories, online or offline social behavior or known or predicted personal or personality characteristics;
+17. To intentionally exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
+18. For military purposes;
+19. To engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or other professional practices.
\ No newline at end of file
--- a/VITA-MLLM/VITA-Audio-Plus-Vanilla/README.md
+++ b/VITA-MLLM/VITA-Audio-Plus-Vanilla/README.md
+---
+license: apache-2.0
+base_model:
+- Qwen/Qwen2.5-7B-Instruct
+---
+
+
+
+
+
+
+
+## ACCEPTABLE USE POLICY
+
+Any license on the model is subject to your compliance with the Acceptable Use Policy, and You must not violate (or encourage or permit anyone else to violate) any term of the Acceptable Use Policy. Tencent reserves the right to update this Acceptable Use Policy from time to time.
+
+Tencent endeavors to promote safe and fair use of its tools and features, including VITA. You agree not to use VITA or any of its derivatives:
+1. In any way that violates any applicable national, federal, state, local, international or any other law or regulation;
+2. To harm Yourself or others;
+3. To repurpose or distribute output from VITA or any of its derivatives to harm Yourself or others; 
+4. To override or circumvent the safety guardrails and safeguards We have put in place;
+5. For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
+6. To generate or disseminate verifiably false information and/or content with the purpose of harming others or influencing elections;
+7. To generate or facilitate false online engagement, including fake reviews and other means of fake online engagement;
+8. To intentionally defame, disparage or otherwise harass others;
+9. To generate and/or disseminate malware (including ransomware) or any other content to be used for the purpose of harming electronic systems;
+10. To generate or disseminate personal identifiable information with the purpose of harming others;
+11. To generate or disseminate information (including images, code, posts, articles), and place the information in any public context (including –through the use of bot generated tweets), without expressly and conspicuously identifying that the information and/or content is machine generated;
+12. To impersonate another individual without consent, authorization, or legal right;
+13. To make high-stakes automated decisions in domains that affect an individual’s safety, rights or wellbeing (e.g., law enforcement, migration, medicine/health, management of critical infrastructure, safety components of products, essential services, credit, employment, housing, education, social scoring, or insurance);
+14. In a manner that violates or disrespects the social ethics and moral standards of other countries or regions;
+15. To perform, facilitate, threaten, incite, plan, promote or encourage violent extremism or terrorism;
+16. For any use intended to discriminate against or harm individuals or groups based on protected characteristics or categories, online or offline social behavior or known or predicted personal or personality characteristics;
+17. To intentionally exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
+18. For military purposes;
+19. To engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or other professional practices.
\ No newline at end of file
--- a/asset/2631296891109983590.wav
+++ b/asset/2631296891109983590.wav
--- a/asset/379838640-d5ff0815-74f8-4738-b0f1-477cfc8dcc2d.wav
+++ b/asset/379838640-d5ff0815-74f8-4738-b0f1-477cfc8dcc2d.wav
--- a/asset/4202818730519913143.wav
+++ b/asset/4202818730519913143.wav
--- a/asset/logo.png
+++ b/asset/logo.png
--- a/asset/piano.mp3
+++ b/asset/piano.mp3
--- a/asset/qa_speed.gif
+++ b/asset/qa_speed.gif
--- a/asset/tts_speed.gif
+++ b/asset/tts_speed.gif
--- a/asset/vita-audio_logo.jpg
+++ b/asset/vita-audio_logo.jpg
--- a/asset/wechat-group.jpg
+++ b/asset/wechat-group.jpg
--- a/asset/介绍一下上海.wav
+++ b/asset/介绍一下上海.wav