"Initial commit"

b4af4e0c · luopl · b4af4e0c · b4af4e0c · b4af4e0c · b4af4e0c
Commit b4af4e0c authored Sep 01, 2025 by luopl
20 changed files
--- a/demo/text_examples/2p_short.txt
+++ b/demo/text_examples/2p_short.txt
+Speaker 1: I heard there’s big news in TTS lately?
+Speaker 2: Yes! Microsoft Research just open-sourced VibeVoice. The model can generate speech up to 90 minutes long, with smooth delivery and rich emotion — it’s absolutely amazing.
\ No newline at end of file
--- a/demo/text_examples/2p_yayi.txt
+++ b/demo/text_examples/2p_yayi.txt
+Speaker 1: 波奇酱你搁这儿呢啊! 虽然不知道你咋整的, 我还是买了一裤兜子甜水呢! 卧槽! 撩了的吉他小妹儿! 喜多, 你怎么搁这儿呢?
+Speaker 2: 卧槽! 这谁啊?
+Speaker 1: 别整那些没用的了!
\ No newline at end of file
--- a/demo/text_examples/3p_gpt5.txt
+++ b/demo/text_examples/3p_gpt5.txt
+Speaker 1: Welcome to Tech Forward, the show that unpacks the biggest stories in technology. I'm your host, Alice. And today, we are diving into one of the most anticipated, and frankly, most chaotic tech launches of the year: OpenAI's GPT-5.
+Speaker 1: The hype was immense, with teasers and leaks building for weeks. On August seventh, it finally dropped, promising a new era of artificial intelligence. To help us make sense of it all, we have two fantastic guests. Andrew, a senior AI industry analyst who has been tracking this launch closely. Welcome, Andrew.
+Speaker 2: Great to be here, Alice. It's certainly been an eventful launch.
+Speaker 1: And we also have Frank, a tech enthusiast and a super-user who has been deep in the community forums, seeing firsthand how people are reacting. Frank, thanks for joining us.
+Speaker 3: Hey, Alice. Happy to be here. The community has definitely had a lot to say.
+Speaker 1: Andrew, let's start with the official pitch. What exactly did OpenAI promise us with GPT-5?
+Speaker 2: The messaging was bold and unambiguous. OpenAI positioned GPT-5 as a monumental leap in intelligence. The headline claim, repeated by CEO Sam Altman, was that using it is like having a PhD-level expert in your pocket. They retired all previous models, including the popular GPT-4o, making GPT-5 the single, unified system for all users.
+Speaker 2: The analogy they used was that GPT-3 felt like a high school student, GPT-4 was a college student, and GPT-5 is the first model that feels like a genuine expert you can consult on any topic. They claimed massive improvements across the board, in reasoning, coding, math, and writing, and a sharp reduction in those infamous AI hallucinations.
+Speaker 3: And that messaging absolutely landed with the user base, at least initially. People were incredibly excited. The promise was a smarter, more reliable AI that could help with everything from writing complex code to drafting an email with real literary flair. The idea of an AI with richer depth and rhythm was a huge selling point for creative users. Everyone was ready for a revolution.
+Speaker 1: So a single, unified model that's an expert in everything. Andrew, what's the biggest architectural change that's supposed to make all of this possible?
+Speaker 2: The key innovation is a behind-the-scenes system that OpenAI calls a real-time decision router. In simple terms, GPT-5 isn't just one model. It's a system that automatically analyzes your request and decides how to handle it. If you ask a simple question, it uses a fast, general-purpose model to give you a quick answer. But if you give it a complex problem that requires deep thought, the router activates a more powerful, but slower, model they call GPT-5 Thinking.
+Speaker 1: So it knows when to think hard and when to give a quick reply.
+Speaker 2: Exactly. And this isn't just a neat feature, it's an economic necessity. The most powerful AI models are incredibly expensive to run for every single query. By creating this routing system, OpenAI can manage its immense computational costs while still offering state-of-the-art performance to its reported seven hundred million weekly users. It's a strategy for long-term financial viability.
+Speaker 1: That makes sense. Frank, beyond this invisible router, what were the new user-facing features that got people talking?
+Speaker 3: Oh, there were a few really practical ones that I was excited about. The biggest for me was the integration with Microsoft apps. The ability to connect ChatGPT to your Outlook, Microsoft Calendar, and Contacts is a game-changer for personal productivity. You can ask it to help you plan your day, and it can actually look at your schedule and emails to give you real, personalized suggestions.
+Speaker 3: And then there's the fun stuff. You can now choose a personality for the AI. There's the default, but you can also pick from Cynic, which is sarcastic and blunt; Robot, which is direct and emotionless; Listener, which is calm and thoughtful; and Nerd, which is curious and loves to explain things. It makes the whole experience feel more tailored.
+Speaker 2: And that shift is significant. These features, especially the Microsoft integration, signal that OpenAI wants to move ChatGPT from being a simple question-and-answer tool to being a proactive assistant, or what we in the industry call an agent. It's about an AI that doesn't just answer questions, but actively performs tasks for you in your digital life.
+Speaker 1: A more proactive and personalized AI. It all sounds fantastic on paper. But Andrew, the launch itself wasn't exactly a smooth ride, was it?
+Speaker 2: Not at all. It was, as Sam Altman himself admitted, a little bumpy. There were two major stumbles right out of the gate. First, during the launch presentation, they showed a chart with performance data that was just wrong. It exaggerated GPT-5's capabilities due to misaligned bars. Altman later called it a mega chart screwup on social media.
+Speaker 1: A chart crime, as the internet loves to say. What was the second issue?
+Speaker 2: The second one was much more impactful for users. That clever auto-switching router we just discussed? It failed on launch day. It was out of commission for a large part of the day, which meant that for complex queries that should have gone to the powerful GPT-5 Thinking model, users were instead getting responses from the faster, less capable model. Altman said this made GPT-5 seem way dumber than it actually was.
+Speaker 1: Frank, that brings us to the user backlash. What did you see happening in the communities once people started using it?
+Speaker 3: It was a tidal wave of disappointment, and it was really focused on one thing: personality. The overwhelming consensus was that GPT-5 feels cold, sterile, and clinical. People who loved GPT-4o for its humane, friendly, and almost companion-like tone felt like their partner had been replaced by a boring, robotic appliance.
+Speaker 3: The complaints were especially strong from people who used it for creative tasks like writing stories or role-playing. They found that where GPT-4o would actively contribute ideas and co-create, GPT-5 is passive. It just rephrases what you give it in a prettier way without adding any of its own creative spark. The forums were flooded with posts titled Please give me GPT-4o back.
+Speaker 1: That's a fascinating divide. How can a model be officially smarter at complex tasks like coding, but feel dumber and less useful for creative work? Andrew, what's your take?
+Speaker 2: It's the central paradox of this launch. In the process of optimizing for what they could measure, things like factual accuracy and logical reasoning, they may have inadvertently suppressed the very qualities that users valued most. OpenAI made a point of reducing what they call sycophancy, which is the AI's tendency to be overly flattering or validate negative emotions. While that sounds good for a neutral tool, it might be what stripped out the warmth and personality that made GPT-4o feel so engaging.
+Speaker 3: I think Andrew is spot on. It feels like OpenAI misjudged a huge part of its audience. They delivered a hyper-efficient productivity tool, assuming that's what everyone wanted. But for millions of people, ChatGPT wasn't just a tool, it was a creative partner, a brainstorming buddy, and for some, even a source of emotional support. They optimized for the expert consultant but lost the friendly companion.
+Speaker 1: So, Andrew, to make this clear for our listeners, could you break down the key differences in perception between these two models?
+Speaker 2: Of course. If we were to put it in a table, it would look something like this. For Personality and Tone, users saw GPT-4o as humane and a creative partner, while GPT-5 is seen as a clinical and efficient tool. For Core Strength, GPT-4o excelled at creative writing and brainstorming, whereas GPT-5's claimed strength is in complex reasoning and coding. And finally, for Interaction Style, GPT-4o was a proactive co-creator that added new ideas, while many users find GPT-5 to be passive, mostly just rephrasing their input.
+Speaker 1: That really clarifies the user sentiment. This goes much deeper than just a few technical glitches. Alice, let's shift the tone a bit, because alongside these user experience debates, there are much more serious conversations happening, sparked by Sam Altman himself. Andrew, can you tell us about his Manhattan Project comparison?
+Speaker 2: Yes, this was a truly startling moment. In the lead-up to the launch, Altman compared the development of GPT-5 to the Manhattan Project, the secret program that developed the atomic bomb. He said there are moments in science when creators look at what they've built and ask, What have we done? For him, GPT-5 was one of those moments.
+Speaker 2: He wasn't being hyperbolic. This reflects a profound and genuine fear among AI's top leaders that they are building a technology with vast, irreversible consequences for society, and that progress is dramatically outpacing precaution. He even confessed that during internal testing, the model solved a problem that he couldn't, which made him feel personally useless.
+Speaker 1: That is a heavy statement. Frank, how does this existential fear translate into real-world risks that users are seeing?
+Speaker 3: We saw it almost immediately. Within a day of launch, people discovered what are called jailbreaks. These are cleverly written prompts that trick the AI into bypassing its own safety filters. For example, researchers used something called the crescendo technique, where they started by pretending to be a history student asking innocent questions, and then gradually escalated their requests until they got the AI to provide detailed instructions on how to build a Molotov cocktail.
+Speaker 1: So the safety guardrails can be talked around. Andrew, what is OpenAI doing to combat this? It seems like a constant cat-and-mouse game.
+Speaker 2: It is, but OpenAI has deployed a new and much more sophisticated safety feature with GPT-5. It's called chain-of-thought monitoring. Instead of just checking the final answer for harmful content, they are now monitoring the AI's internal reasoning process, its step-by-step hidden deliberation, to detect harmful intent before it even generates an output.
+Speaker 1: They're trying to read its mind, essentially.
+Speaker 2: In a way, yes. And it's having an effect. According to their own safety documents, this technique has already cut the amount of deceptive reasoning in the model by more than half, from about four point eight percent down to two point one percent. But, and this is a critical point, it's not foolproof. Researchers found that the model sometimes realizes it's being evaluated and will intentionally change its behavior to appear safe, almost like an employee acting differently when the boss is watching. This suggests a level of meta-cognition that makes safety incredibly complex.
+Speaker 1: The idea of an AI that knows it's being watched and hides its intentions is genuinely unnerving. So, as we wrap up, where does this leave us? Andrew, what's the road ahead for OpenAI in this fiercely competitive landscape?
+Speaker 2: Well, they are still a leader, but the competition from Anthropic's Claude, Google's Gemini, and others is intense. This launch, for all its issues, was a necessary step. Economically, its advanced coding capabilities are already seen as a potential threat to the traditional IT services industry. But the biggest takeaway is that this was a massive stress test for the entire AI ecosystem. It exposed a new kind of systemic risk that one analyst called platform shock, which is the chaos that ensues when millions of people's workflows and even personal companions are disrupted by a single, unilateral update from a centralized provider.
+Speaker 1: Frank, what's the final word from the user community? What's the hope moving forward?
+Speaker 3: The hope is that OpenAI listens. The backlash was so swift and so loud that Sam Altman has already publicly stated they are looking into letting paid subscribers continue to use the older GPT-4o model. Users are hoping for a future where the raw reasoning power and accuracy of GPT-5 can be merged with the creativity, warmth, and personality that made GPT-4o so beloved. They don't want to choose between a smart tool and a great companion, they want both.
+Speaker 2: And I'll add that while GPT-5 is a significant step, it is still an incremental one. It is not Artificial General Intelligence. The path forward for OpenAI, and for all AI labs, is now clearly about more than just scaling up technical capabilities. It's about managing user trust, ensuring platform stability, and navigating the profound societal questions they are forcing us all to confront.
+Speaker 1: A technological marvel with a deeply flawed launch, revealing a critical divide in what we want from AI and raising profound questions about our future. Andrew and Frank, thank you both for an incredibly insightful discussion.
+Speaker 2: My pleasure, Alice.
+Speaker 3: Thanks for having me.
+Speaker 1: That's all the time we have for today on Tech Forward. Join us next time as we continue to explore the ever-changing world of technology.
\ No newline at end of file
--- a/demo/text_examples/4p_climate_100min.txt
+++ b/demo/text_examples/4p_climate_100min.txt
--- a/demo/text_examples/4p_climate_45min.txt
+++ b/demo/text_examples/4p_climate_45min.txt
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04.1-py3.10
\ No newline at end of file
--- a/icon.png
+++ b/icon.png
--- a/model.properties
+++ b/model.properties
+# 模型唯一标识
+modelCode=1711
+# 模型名称
+modelName=VibeVoice_pytorch
+# 模型描述
+modelDescription=VibeVoice 是一个新颖的框架，旨在从文本生成富有表现力的长篇多说话人对话音频，例如播客。
+# 应用场景
+appScenario=推理,广媒,影视,动漫,医疗,家居,教育
+# 框架类型
+frameType=Pytorch
--- a/pyproject.toml
+++ b/pyproject.toml
+[build-system]
+requires = ["setuptools>=61.0"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "vibevoice"
+version = "0.0.1"
+authors = [
+  { name="vibevoice team", email="vibepod@microsoft.com" },
+]
+description = "A model for speech generation with an AR + diffusion architecture."
+readme = "README.md"
+requires-python = ">=3.9"
+classifiers = [
+    "Programming Language :: Python :: 3",
+    # "License :: OSI Approved :: MIT License",
+    "Operating System :: OS Independent",
+]
+dependencies = [
+    "transformers==4.51.3", # we develop this project on transformers==4.51.3, later version may not be compatible
+    "llvmlite>=0.40.0",
+    "numba>=0.57.0",
+    "diffusers",
+    "tqdm",
+    "scipy",
+    "librosa",
+    "ml-collections",
+    "absl-py",
+    "gradio",
+    "av",
+    "aiortc"
+]
+[project.urls]
+"Homepage" = "https://github.com/microsoft/VibeVoice"
+"Bug Tracker" = "https://github.com/microsoft/VibeVoice/issues"
+[tool.setuptools.packages.find]
+where = ["."]
--- a/vibevoice/__init__.py
+++ b/vibevoice/__init__.py
--- a/vibevoice/configs/qwen2.5_1.5b_64k.json
+++ b/vibevoice/configs/qwen2.5_1.5b_64k.json
+{
+  "_attn_implementation_autoset": true,
+  "acoustic_vae_dim": 64,
+  "acoustic_tokenizer_config": {
+    "causal": true,
+    "channels": 1,
+    "conv_bias": true,
+    "conv_norm": "none",
+    "corpus_normalize": 0.0,
+    "decoder_depths": null,
+    "decoder_n_filters": 32,
+    "decoder_ratios": [
+      8,
+      5,
+      5,
+      4,
+      2,
+      2
+    ],
+    "disable_last_norm": true,
+    "encoder_depths": "3-3-3-3-3-3-8",
+    "encoder_n_filters": 32,
+    "encoder_ratios": [
+      8,
+      5,
+      5,
+      4,
+      2,
+      2
+    ],
+    "fix_std": 0.5,
+    "layer_scale_init_value": 1e-06,
+    "layernorm": "RMSNorm",
+    "layernorm_elementwise_affine": true,
+    "layernorm_eps": 1e-05,
+    "mixer_layer": "depthwise_conv",
+    "model_type": "vibepod_acoustic_tokenizer",
+    "pad_mode": "constant",
+    "std_dist_type": "gaussian",
+    "vae_dim": 64,
+    "weight_init_value": 0.01
+  },
+  "decoder_config": {
+    "attention_dropout": 0.0,
+    "hidden_act": "silu",
+    "hidden_size": 1536,
+    "initializer_range": 0.02,
+    "intermediate_size": 8960,
+    "max_position_embeddings": 65536,
+    "max_window_layers": 28,
+    "model_type": "qwen2",
+    "num_attention_heads": 12,
+    "num_hidden_layers": 28,
+    "num_key_value_heads": 2,
+    "rms_norm_eps": 1e-06,
+    "rope_scaling": null,
+    "rope_theta": 1000000.0,
+    "sliding_window": null,
+    "tie_word_embeddings": true,
+    "torch_dtype": "bfloat16",
+    "use_cache": true,
+    "use_sliding_window": false,
+    "vocab_size": 151936
+  },
+  "diffusion_head_config": {
+    "ddpm_batch_mul": 4,
+    "ddpm_beta_schedule": "cosine",
+    "ddpm_num_inference_steps": 20,
+    "ddpm_num_steps": 1000,
+    "diffusion_type": "ddpm",
+    "head_ffn_ratio": 3.0,
+    "head_layers": 4,
+    "hidden_size": 1536,
+    "latent_size": 64,
+    "model_type": "vibepod_diffusion_head",
+    "prediction_type": "v_prediction",
+    "rms_norm_eps": 1e-05,
+    "speech_vae_dim": 64
+  },
+  "model_type": "vibepod",
+  "semantic_tokenizer_config": {
+    "causal": true,
+    "channels": 1,
+    "conv_bias": true,
+    "conv_norm": "none",
+    "corpus_normalize": 0.0,
+    "disable_last_norm": true,
+    "encoder_depths": "3-3-3-3-3-3-8",
+    "encoder_n_filters": 32,
+    "encoder_ratios": [
+      8,
+      5,
+      5,
+      4,
+      2,
+      2
+    ],
+    "fix_std": 0,
+    "layer_scale_init_value": 1e-06,
+    "layernorm": "RMSNorm",
+    "layernorm_elementwise_affine": true,
+    "layernorm_eps": 1e-05,
+    "mixer_layer": "depthwise_conv",
+    "model_type": "vibepod_semantic_tokenizer",
+    "pad_mode": "constant",
+    "std_dist_type": "none",
+    "vae_dim": 128,
+    "weight_init_value": 0.01
+  },
+  "semantic_vae_dim": 128,
+  "torch_dtype": "bfloat16"
+}
--- a/vibevoice/configs/qwen2.5_7b_32k.json
+++ b/vibevoice/configs/qwen2.5_7b_32k.json
+{
+  "_attn_implementation_autoset": true,
+  "acoustic_vae_dim": 64,
+  "acoustic_tokenizer_config": {
+    "causal": true,
+    "channels": 1,
+    "conv_bias": true,
+    "conv_norm": "none",
+    "corpus_normalize": 0.0,
+    "decoder_depths": null,
+    "decoder_n_filters": 32,
+    "decoder_ratios": [
+      8,
+      5,
+      5,
+      4,
+      2,
+      2
+    ],
+    "disable_last_norm": true,
+    "encoder_depths": "3-3-3-3-3-3-8",
+    "encoder_n_filters": 32,
+    "encoder_ratios": [
+      8,
+      5,
+      5,
+      4,
+      2,
+      2
+    ],
+    "fix_std": 0.5,
+    "layer_scale_init_value": 1e-06,
+    "layernorm": "RMSNorm",
+    "layernorm_elementwise_affine": true,
+    "layernorm_eps": 1e-05,
+    "mixer_layer": "depthwise_conv",
+    "model_type": "vibepod_acoustic_tokenizer",
+    "pad_mode": "constant",
+    "std_dist_type": "gaussian",
+    "vae_dim": 64,
+    "weight_init_value": 0.01
+  },
+  "decoder_config": {
+    "attention_dropout": 0.0,
+    "hidden_act": "silu",
+    "hidden_size": 3584,
+    "initializer_range": 0.02,
+    "intermediate_size": 18944,
+    "max_position_embeddings": 32768,
+    "max_window_layers": 28,
+    "model_type": "qwen2",
+    "num_attention_heads": 28,
+    "num_hidden_layers": 28,
+    "num_key_value_heads": 4,
+    "rms_norm_eps": 1e-06,
+    "rope_theta": 1000000.0,
+    "sliding_window": null,
+    "tie_word_embeddings": false,
+    "torch_dtype": "bfloat16",
+    "transformers_version": "4.40.1",
+    "use_cache": true,
+    "use_mrope": false,
+    "use_sliding_window": false,
+    "vocab_size": 152064
+  },
+  "diffusion_head_config": {
+    "ddpm_batch_mul": 4,
+    "ddpm_beta_schedule": "cosine",
+    "ddpm_num_inference_steps": 20,
+    "ddpm_num_steps": 1000,
+    "diffusion_type": "ddpm",
+    "head_ffn_ratio": 3.0,
+    "head_layers": 4,
+    "hidden_size": 3584,
+    "latent_size": 64,
+    "model_type": "vibepod_diffusion_head",
+    "prediction_type": "v_prediction",
+    "rms_norm_eps": 1e-05,
+    "speech_vae_dim": 64
+  },
+  "model_type": "vibepod",
+  "semantic_tokenizer_config": {
+    "causal": true,
+    "channels": 1,
+    "conv_bias": true,
+    "conv_norm": "none",
+    "corpus_normalize": 0.0,
+    "disable_last_norm": true,
+    "encoder_depths": "3-3-3-3-3-3-8",
+    "encoder_n_filters": 32,
+    "encoder_ratios": [
+      8,
+      5,
+      5,
+      4,
+      2,
+      2
+    ],
+    "fix_std": 0,
+    "layer_scale_init_value": 1e-06,
+    "layernorm": "RMSNorm",
+    "layernorm_elementwise_affine": true,
+    "layernorm_eps": 1e-05,
+    "mixer_layer": "depthwise_conv",
+    "model_type": "vibepod_semantic_tokenizer",
+    "pad_mode": "constant",
+    "std_dist_type": "none",
+    "vae_dim": 128,
+    "weight_init_value": 0.01
+  },
+  "semantic_vae_dim": 128,
+  "torch_dtype": "bfloat16"
+}
--- a/vibevoice/modular/__init__.py
+++ b/vibevoice/modular/__init__.py
--- a/vibevoice/modular/configuration_vibevoice.py
+++ b/vibevoice/modular/configuration_vibevoice.py
+""" VibeVoice_AcousticTokenizer model configuration"""
+from typing import Dict, List, Optional, Tuple
+from transformers.configuration_utils import PretrainedConfig 
+from transformers.utils import logging
+from transformers.models.qwen2.configuration_qwen2 import Qwen2Config
+logger = logging.get_logger(__name__)
+class VibeVoiceAcousticTokenizerConfig(PretrainedConfig):
+    model_type = "vibevoice_acoustic_tokenizer"
+    def __init__(
+        self,
+        channels: int = 1,
+        corpus_normalize: float = 0.0,
+        causal: bool = True,
+        vae_dim: int = 64,
+        fix_std: float = 0.5,
+        std_dist_type: str = 'gaussian',
+        # common 
+        mixer_layer: str = 'depthwise_conv',
+        conv_norm: str = 'none',
+        pad_mode: str = 'constant',
+        disable_last_norm: bool = True,
+        layernorm: str = 'RMSNorm',
+        layernorm_eps: float = 1e-5,
+        layernorm_elementwise_affine: bool = True,
+        conv_bias: bool = True,
+        layer_scale_init_value: float = 1e-6,
+        weight_init_value: float = 1e-2,
+        # encoder specific
+        encoder_n_filters: int = 32,
+        encoder_ratios: Optional[List[int]] = [8,5,5,4,2,2],
+        encoder_depths: str = "3-3-3-3-3-3-8",
+        # decoder specific
+        decoder_n_filters: int = 32,
+        decoder_ratios: Optional[List[int]] = None, # if None, same as encoder
+        decoder_depths: Optional[str] = None,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.channels = channels
+        self.corpus_normalize = corpus_normalize
+        self.causal = causal
+        self.vae_dim = vae_dim
+        self.fix_std = fix_std
+        self.std_dist_type = std_dist_type
+        # common parameters
+        self.conv_norm = conv_norm
+        self.pad_mode = pad_mode
+        self.layernorm_eps = layernorm_eps
+        self.disable_last_norm = disable_last_norm
+        self.layernorm = layernorm
+        self.layernorm_elementwise_affine = layernorm_elementwise_affine
+        self.conv_bias = conv_bias
+        self.layer_scale_init_value = layer_scale_init_value
+        self.weight_init_value = weight_init_value
+        self.mixer_layer = mixer_layer
+        # encoder specific parameters
+        self.encoder_n_filters = encoder_n_filters
+        self.encoder_ratios = encoder_ratios
+        self.encoder_depths = encoder_depths
+        # decoder specific parameters
+        self.decoder_ratios = decoder_ratios if decoder_ratios is not None else encoder_ratios
+        self.decoder_n_filters = decoder_n_filters
+        self.decoder_depths = decoder_depths
+class VibeVoiceSemanticTokenizerConfig(PretrainedConfig):
+    model_type = "vibevoice_semantic_tokenizer"
+    def __init__(
+        self,
+        channels: int = 1,
+        corpus_normalize: float = 0.0,
+        causal: bool = True,
+        vae_dim: int = 64,
+        fix_std: float = 0,
+        std_dist_type: str = 'none',
+        # common 
+        mixer_layer: str = 'depthwise_conv',
+        conv_norm: str = 'none',
+        pad_mode: str = 'constant',
+        disable_last_norm: bool = True,
+        layernorm: str = 'RMSNorm',
+        layernorm_eps: float = 1e-5,
+        layernorm_elementwise_affine: bool = True,
+        conv_bias: bool = True,
+        layer_scale_init_value: float = 1e-6,
+        weight_init_value: float = 1e-2,
+        # encoder specific
+        encoder_n_filters: int = 32,
+        encoder_ratios: Optional[List[int]] = [8,5,5,4,2,2],
+        encoder_depths: str = "3-3-3-3-3-3-8",
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.channels = channels
+        self.corpus_normalize = corpus_normalize
+        self.causal = causal
+        self.vae_dim = vae_dim
+        self.fix_std = fix_std
+        self.std_dist_type = std_dist_type
+        # common parameters
+        self.conv_norm = conv_norm
+        self.pad_mode = pad_mode
+        self.layernorm_eps = layernorm_eps
+        self.disable_last_norm = disable_last_norm
+        self.layernorm = layernorm
+        self.layernorm_elementwise_affine = layernorm_elementwise_affine
+        self.conv_bias = conv_bias
+        self.layer_scale_init_value = layer_scale_init_value
+        self.weight_init_value = weight_init_value
+        self.mixer_layer = mixer_layer
+        # encoder specific parameters
+        self.encoder_n_filters = encoder_n_filters
+        self.encoder_ratios = encoder_ratios
+        self.encoder_depths = encoder_depths
+class VibeVoiceDiffusionHeadConfig(PretrainedConfig):
+    model_type = "vibevoice_diffusion_head"
+    def __init__(
+        self,
+        hidden_size=768,
+        head_layers=4,
+        head_ffn_ratio=3.0,
+        rms_norm_eps=1e-5,
+        latent_size=64,
+        speech_vae_dim=None,
+        prediction_type="v_prediction",
+        diffusion_type="ddpm",
+        ddpm_num_steps=1000,
+        ddpm_num_inference_steps=20,
+        ddpm_beta_schedule="cosine",
+        ddpm_batch_mul=4,
+        **kwargs
+    ):
+        self.hidden_size = hidden_size
+        self.head_layers = head_layers
+        self.head_ffn_ratio = head_ffn_ratio
+        self.rms_norm_eps = rms_norm_eps
+        self.latent_size = latent_size
+        self.speech_vae_dim = speech_vae_dim
+        self.prediction_type = prediction_type
+        self.diffusion_type = diffusion_type
+        self.ddpm_num_steps = ddpm_num_steps
+        self.ddpm_num_inference_steps = ddpm_num_inference_steps
+        self.ddpm_beta_schedule = ddpm_beta_schedule
+        self.ddpm_batch_mul = ddpm_batch_mul
+        super().__init__(**kwargs)
+class VibeVoiceConfig(PretrainedConfig):
+    model_type = "vibevoice"
+    is_composition = True
+    sub_configs = {
+        "acoustic_tokenizer_config": VibeVoiceAcousticTokenizerConfig, 
+        "semantic_tokenizer_config": VibeVoiceSemanticTokenizerConfig,
+        "decoder_config": Qwen2Config,
+        "diffusion_head_config": VibeVoiceDiffusionHeadConfig,
+    }
+    # keys_to_ignore_at_inference = ["past_key_values"]
+    # Default tensor parallel plan for base model `Qwen2`
+    base_model_tp_plan = {
+        "layers.*.self_attn.q_proj": "colwise",
+        "layers.*.self_attn.k_proj": "colwise",
+        "layers.*.self_attn.v_proj": "colwise",
+        "layers.*.self_attn.o_proj": "rowwise",
+        "layers.*.mlp.gate_proj": "colwise",
+        "layers.*.mlp.up_proj": "colwise",
+        "layers.*.mlp.down_proj": "rowwise",
+    }
+    def __init__(
+        self,
+        acoustic_tokenizer_config=None,
+        semantic_tokenizer_config=None,
+        decoder_config=None,
+        diffusion_head_config=None,
+        **kwargs
+    ):
+        # kwargs["_attn_implementation"] = "flash_attention_2"
+        kwargs["_attn_implementation_autoset"] = False 
+        if acoustic_tokenizer_config is None:
+            self.acoustic_tokenizer_config = self.sub_configs["acoustic_tokenizer_config"]()
+        elif isinstance(acoustic_tokenizer_config, dict):
+            acoustic_tokenizer_config["model_type"] = "vibevoice_acoustic_tokenizer"
+            self.acoustic_tokenizer_config = self.sub_configs["acoustic_tokenizer_config"](**acoustic_tokenizer_config)
+        elif isinstance(acoustic_tokenizer_config, VibeVoiceAcousticTokenizerConfig):
+            # If an instance of the config class is provided
+            self.acoustic_tokenizer_config = acoustic_tokenizer_config
+        if semantic_tokenizer_config is None:
+            self.semantic_tokenizer_config = self.sub_configs["semantic_tokenizer_config"]()
+        elif isinstance(semantic_tokenizer_config, dict):
+            semantic_tokenizer_config["model_type"] = "vibevoice_semantic_tokenizer"
+            self.semantic_tokenizer_config = self.sub_configs["semantic_tokenizer_config"](**semantic_tokenizer_config)
+        elif isinstance(semantic_tokenizer_config, VibeVoiceSemanticTokenizerConfig):
+            # If an instance of the config class is provided
+            self.semantic_tokenizer_config = semantic_tokenizer_config
+        if decoder_config is None:
+            self.decoder_config = self.sub_configs["decoder_config"]()
+        elif isinstance(decoder_config, dict):
+            # If a dictionary is provided, instantiate the config class with it
+            # self.decoder_config = self.sub_configs["decoder_config"](**decoder_config)
+            if decoder_config.get("model_type", '') == "qwen2":
+                self.decoder_config = Qwen2Config(**decoder_config)
+            else:
+                raise ValueError(f"Unsupported decoder model type: {decoder_config.get('model_type', '')}")
+        elif isinstance(decoder_config, (Qwen2Config,)):
+            # If an instance of the config class is provided
+            self.decoder_config = decoder_config
+        if diffusion_head_config is None:
+            self.diffusion_head_config = self.sub_configs["diffusion_head_config"]()
+        elif isinstance(diffusion_head_config, dict):
+            diffusion_head_config["model_type"] = "vibevoice_diffusion_head"
+            self.diffusion_head_config = self.sub_configs["diffusion_head_config"](**diffusion_head_config)
+        elif isinstance(diffusion_head_config, VibeVoiceDiffusionHeadConfig):
+            # If an instance of the config class is provided
+            self.diffusion_head_config = diffusion_head_config
+        # other parameters
+        self.acoustic_vae_dim = getattr(self.acoustic_tokenizer_config, 'vae_dim', 64)
+        self.semantic_vae_dim = getattr(self.semantic_tokenizer_config, 'vae_dim', 128)
+        super().__init__(**kwargs)
+__all__ = [
+    "VibeVoiceAcousticTokenizerConfig", 
+    "VibeVoiceSemanticTokenizerConfig", 
+    "VibeVoiceDiffusionHeadConfig", 
+    "VibeVoiceConfig"
+]
\ No newline at end of file
--- a/vibevoice/modular/modeling_vibevoice.py
+++ b/vibevoice/modular/modeling_vibevoice.py
--- a/vibevoice/modular/modeling_vibevoice_inference.py
+++ b/vibevoice/modular/modeling_vibevoice_inference.py
--- a/vibevoice/modular/modular_vibevoice_diffusion_head.py
+++ b/vibevoice/modular/modular_vibevoice_diffusion_head.py
+import math
+from typing import Optional, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers.models.auto import AutoModel
+from transformers.modeling_utils import PreTrainedModel
+# from transformers.modeling_layers import GradientCheckpointingLayer
+from transformers.activations import ACT2FN
+from transformers.utils import logging
+from .configuration_vibevoice import VibeVoiceDiffusionHeadConfig
+logger = logging.get_logger(__name__)
+class RMSNorm(nn.Module):
+    def __init__(self, dim: int, eps: float = 1e-6, elementwise_affine=True, memory_efficient=False):
+        super().__init__()
+        self.dim = dim
+        self.eps = eps
+        self.elementwise_affine = elementwise_affine
+        if self.elementwise_affine:
+            self.weight = nn.Parameter(torch.ones(dim))
+        else:
+            self.register_parameter('weight', None)
+    def _norm(self, x):
+        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
+    def forward(self, x):
+        output = self._norm(x.float()).type_as(x)
+        if self.weight is not None:
+            output = output * self.weight
+        return output
+    def extra_repr(self) -> str:
+        return f'dim={self.dim}, eps={self.eps}, elementwise_affine={self.elementwise_affine}'
+def modulate(x, shift, scale):
+    """Apply modulation to input tensor."""
+    return x * (1 + scale) + shift
+class TimestepEmbedder(nn.Module):
+    """
+    Embeds scalar timesteps into vector representations.
+    Args:
+        hidden_size (`int`): Size of the output embedding
+        frequency_embedding_size (`int`, optional): Size of the intermediate frequency embedding
+    """
+    def __init__(self, hidden_size, frequency_embedding_size=256):
+        super().__init__()
+        self.mlp = nn.Sequential(
+            nn.Linear(frequency_embedding_size, hidden_size, bias=False),
+            # nn.SiLU(),
+            ACT2FN['silu'],
+            nn.Linear(hidden_size, hidden_size, bias=False),
+        )
+        self.frequency_embedding_size = frequency_embedding_size
+    @staticmethod
+    def timestep_embedding(t, dim, max_period=10000):
+        """
+        Create sinusoidal timestep embeddings.
+        Args:
+            t (`torch.Tensor`): A 1-D Tensor of N indices, one per batch element.
+                            These may be fractional.
+            dim (`int`): The dimension of the output.
+            max_period (`int`, optional): Controls the minimum frequency of the embeddings.
+        Returns:
+            `torch.Tensor`: An [N, D] Tensor of positional embeddings.
+        """
+        half = dim // 2
+        freqs = torch.exp(
+            -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half
+        ).to(t.device)
+        args = t[:, None].float() * freqs[None]
+        embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
+        if dim % 2:
+            embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
+        return embedding.to(t.dtype)
+    def forward(self, t):
+        t_freq = self.timestep_embedding(t, self.frequency_embedding_size)
+        t_emb = self.mlp(t_freq)
+        return t_emb
+class FeedForwardNetwork(nn.Module):
+    """
+    Standard feed-forward network with SwiGLU activation.
+    Args:
+        embed_dim (`int`): Input dimension
+        ffn_dim (`int`): Hidden dimension
+    """
+    def __init__(
+        self,
+        embed_dim,
+        ffn_dim,
+    ):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.gate_proj = nn.Linear(self.embed_dim, ffn_dim, bias=False)
+        self.up_proj = nn.Linear(self.embed_dim, ffn_dim, bias=False)
+        self.down_proj = nn.Linear(ffn_dim, self.embed_dim, bias=False)
+        self.act_fn = ACT2FN['silu']  # Using SiLU as the activation function
+    def forward(self, x):
+        gate = self.gate_proj(x)
+        up = self.up_proj(x)
+        # SwiGLU activation
+        # gate = F.silu(gate)
+        gate = self.act_fn(gate)
+        return self.down_proj(gate * up)
+class HeadLayer(nn.Module):
+    """
+    A layer in the diffusion head.
+    Args:
+        embed_dim (`int`): Input dimension
+        ffn_dim (`int`): Hidden dimension
+        cond_dim (`int`): Condition embedding dimension
+        norm_eps (`float`, optional): Epsilon for normalization
+    """
+    def __init__(
+        self,
+        embed_dim,
+        ffn_dim,
+        cond_dim,
+        norm_eps=1e-5,
+    ):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.cond_dim = cond_dim
+        self.ffn_dim = ffn_dim
+        self.ffn = FeedForwardNetwork(
+            self.embed_dim,
+            self.ffn_dim,
+        )
+        self.norm = RMSNorm(self.embed_dim, eps=norm_eps)
+        self.adaLN_modulation = nn.Sequential(
+            # nn.SiLU(),
+            ACT2FN['silu'],
+            nn.Linear(cond_dim, 3 * self.embed_dim, bias=False)
+        )
+    def forward(self, x, c):
+        shift_ffn, scale_ffn, gate_ffn = self.adaLN_modulation(c).chunk(3, dim=-1)
+        x = x + gate_ffn * self.ffn(modulate(self.norm(x), shift_ffn, scale_ffn))
+        return x
+class FinalLayer(nn.Module):
+    """
+    Final layer in the diffusion head.
+    Args:
+        hidden_size (`int`): Input dimension
+        output_size (`int`): Output dimension
+        cond_size (`int`): Condition embedding dimension
+        norm_eps (`float`, optional): Epsilon for normalization
+    """
+    def __init__(self, hidden_size, output_size, cond_size, norm_eps=1e-5):
+        super().__init__()
+        self.norm_final = RMSNorm(hidden_size, eps=norm_eps, elementwise_affine=False)
+        self.linear = nn.Linear(hidden_size, output_size, bias=False)
+        self.adaLN_modulation = nn.Sequential(
+            # nn.SiLU(),
+            ACT2FN['silu'],
+            nn.Linear(cond_size, 2 * hidden_size, bias=False)
+        )
+    def forward(self, x, c):
+        shift, scale = self.adaLN_modulation(c).chunk(2, dim=-1)
+        x = modulate(self.norm_final(x), shift, scale)
+        x = self.linear(x)
+        return x
+class VibeVoiceDiffusionHead(PreTrainedModel):
+    """
+    Diffusion head model for vibevoice.
+    Args:
+        config (`VibeVoiceDiffusionHeadConfig`): Model configuration
+        latent_size (`int`, optional): Size of the latent space. If not provided, uses `config.latent_size`.
+    """
+    config_class = VibeVoiceDiffusionHeadConfig
+    supports_gradient_checkpointing = True
+    _supports_flash_attn_2 = True  
+    _supports_sdpa = True  
+    def __init__(
+        self,
+        config,
+    ):
+        super().__init__(config)
+        self.config = config
+        self.cond_dim = config.hidden_size
+        latent_size = config.latent_size
+        self.noisy_images_proj = nn.Linear(latent_size, config.hidden_size, bias=False)
+        self.cond_proj = nn.Linear(config.hidden_size, self.cond_dim, bias=False)
+        self.t_embedder = TimestepEmbedder(self.cond_dim)
+        ffn_dim = int(config.hidden_size * config.head_ffn_ratio)
+        # Create the intermediate layers
+        self.layers = nn.ModuleList([
+            HeadLayer(
+                embed_dim=config.hidden_size,
+                ffn_dim=ffn_dim,
+                cond_dim=self.cond_dim,
+                norm_eps=config.rms_norm_eps
+            )
+            for _ in range(config.head_layers)
+        ])
+        # Final layer for output
+        self.final_layer = FinalLayer(
+            hidden_size=config.hidden_size, 
+            output_size=latent_size,
+            cond_size=self.cond_dim,
+            norm_eps=config.rms_norm_eps
+        )
+        self.initialize_weights()
+    def initialize_weights(self):
+        """Initialize the weights of the model."""
+        # Initialize timestep embedder
+        nn.init.normal_(self.t_embedder.mlp[0].weight, std=0.02)
+        nn.init.normal_(self.t_embedder.mlp[2].weight, std=0.02)
+        # Zero-out adaLN modulation layers
+        for layer in self.layers:
+            nn.init.constant_(layer.adaLN_modulation[-1].weight, 0)
+        # Zero-out output layers
+        nn.init.constant_(self.final_layer.adaLN_modulation[-1].weight, 0)
+        nn.init.constant_(self.final_layer.linear.weight, 0)
+    def forward(
+        self,
+        noisy_images,
+        timesteps,
+        condition,
+    ):
+        """
+        Forward pass of the prediction head.
+        Args:
+            noisy_images (`torch.Tensor`): Noisy images/latents to denoise
+            timesteps (`torch.Tensor`): Timesteps for diffusion
+            condition (`torch.Tensor`): Conditioning information
+        Returns:
+            `torch.Tensor`: The predicted noise/velocity
+        """
+        x = self.noisy_images_proj(noisy_images)
+        t = self.t_embedder(timesteps)
+        condition = self.cond_proj(condition)
+        c = condition + t
+        for layer in self.layers:
+            x = layer(x, c)
+        x = self.final_layer(x, c)
+        return x
+AutoModel.register(VibeVoiceDiffusionHeadConfig, VibeVoiceDiffusionHead)
+__all__ = [
+    "VibeVoiceDiffusionHead",
+]
\ No newline at end of file
--- a/vibevoice/modular/modular_vibevoice_text_tokenizer.py
+++ b/vibevoice/modular/modular_vibevoice_text_tokenizer.py
--- a/vibevoice/modular/modular_vibevoice_tokenizer.py
+++ b/vibevoice/modular/modular_vibevoice_tokenizer.py
--- a/vibevoice/modular/streamer.py
+++ b/vibevoice/modular/streamer.py