Commit b4af4e0c authored by luopl's avatar luopl
Browse files

"Initial commit"

parents
Speaker 1: I heard there’s big news in TTS lately?
Speaker 2: Yes! Microsoft Research just open-sourced VibeVoice. The model can generate speech up to 90 minutes long, with smooth delivery and rich emotion — it’s absolutely amazing.
\ No newline at end of file
Speaker 1: 波奇酱你搁这儿呢啊! 虽然不知道你咋整的, 我还是买了一裤兜子甜水呢! 卧槽! 撩了的吉他小妹儿! 喜多, 你怎么搁这儿呢?
Speaker 2: 卧槽! 这谁啊?
Speaker 1: 别整那些没用的了!
\ No newline at end of file
Speaker 1: Welcome to Tech Forward, the show that unpacks the biggest stories in technology. I'm your host, Alice. And today, we are diving into one of the most anticipated, and frankly, most chaotic tech launches of the year: OpenAI's GPT-5.
Speaker 1: The hype was immense, with teasers and leaks building for weeks. On August seventh, it finally dropped, promising a new era of artificial intelligence. To help us make sense of it all, we have two fantastic guests. Andrew, a senior AI industry analyst who has been tracking this launch closely. Welcome, Andrew.
Speaker 2: Great to be here, Alice. It's certainly been an eventful launch.
Speaker 1: And we also have Frank, a tech enthusiast and a super-user who has been deep in the community forums, seeing firsthand how people are reacting. Frank, thanks for joining us.
Speaker 3: Hey, Alice. Happy to be here. The community has definitely had a lot to say.
Speaker 1: Andrew, let's start with the official pitch. What exactly did OpenAI promise us with GPT-5?
Speaker 2: The messaging was bold and unambiguous. OpenAI positioned GPT-5 as a monumental leap in intelligence. The headline claim, repeated by CEO Sam Altman, was that using it is like having a PhD-level expert in your pocket. They retired all previous models, including the popular GPT-4o, making GPT-5 the single, unified system for all users.
Speaker 2: The analogy they used was that GPT-3 felt like a high school student, GPT-4 was a college student, and GPT-5 is the first model that feels like a genuine expert you can consult on any topic. They claimed massive improvements across the board, in reasoning, coding, math, and writing, and a sharp reduction in those infamous AI hallucinations.
Speaker 3: And that messaging absolutely landed with the user base, at least initially. People were incredibly excited. The promise was a smarter, more reliable AI that could help with everything from writing complex code to drafting an email with real literary flair. The idea of an AI with richer depth and rhythm was a huge selling point for creative users. Everyone was ready for a revolution.
Speaker 1: So a single, unified model that's an expert in everything. Andrew, what's the biggest architectural change that's supposed to make all of this possible?
Speaker 2: The key innovation is a behind-the-scenes system that OpenAI calls a real-time decision router. In simple terms, GPT-5 isn't just one model. It's a system that automatically analyzes your request and decides how to handle it. If you ask a simple question, it uses a fast, general-purpose model to give you a quick answer. But if you give it a complex problem that requires deep thought, the router activates a more powerful, but slower, model they call GPT-5 Thinking.
Speaker 1: So it knows when to think hard and when to give a quick reply.
Speaker 2: Exactly. And this isn't just a neat feature, it's an economic necessity. The most powerful AI models are incredibly expensive to run for every single query. By creating this routing system, OpenAI can manage its immense computational costs while still offering state-of-the-art performance to its reported seven hundred million weekly users. It's a strategy for long-term financial viability.
Speaker 1: That makes sense. Frank, beyond this invisible router, what were the new user-facing features that got people talking?
Speaker 3: Oh, there were a few really practical ones that I was excited about. The biggest for me was the integration with Microsoft apps. The ability to connect ChatGPT to your Outlook, Microsoft Calendar, and Contacts is a game-changer for personal productivity. You can ask it to help you plan your day, and it can actually look at your schedule and emails to give you real, personalized suggestions.
Speaker 3: And then there's the fun stuff. You can now choose a personality for the AI. There's the default, but you can also pick from Cynic, which is sarcastic and blunt; Robot, which is direct and emotionless; Listener, which is calm and thoughtful; and Nerd, which is curious and loves to explain things. It makes the whole experience feel more tailored.
Speaker 2: And that shift is significant. These features, especially the Microsoft integration, signal that OpenAI wants to move ChatGPT from being a simple question-and-answer tool to being a proactive assistant, or what we in the industry call an agent. It's about an AI that doesn't just answer questions, but actively performs tasks for you in your digital life.
Speaker 1: A more proactive and personalized AI. It all sounds fantastic on paper. But Andrew, the launch itself wasn't exactly a smooth ride, was it?
Speaker 2: Not at all. It was, as Sam Altman himself admitted, a little bumpy. There were two major stumbles right out of the gate. First, during the launch presentation, they showed a chart with performance data that was just wrong. It exaggerated GPT-5's capabilities due to misaligned bars. Altman later called it a mega chart screwup on social media.
Speaker 1: A chart crime, as the internet loves to say. What was the second issue?
Speaker 2: The second one was much more impactful for users. That clever auto-switching router we just discussed? It failed on launch day. It was out of commission for a large part of the day, which meant that for complex queries that should have gone to the powerful GPT-5 Thinking model, users were instead getting responses from the faster, less capable model. Altman said this made GPT-5 seem way dumber than it actually was.
Speaker 1: Frank, that brings us to the user backlash. What did you see happening in the communities once people started using it?
Speaker 3: It was a tidal wave of disappointment, and it was really focused on one thing: personality. The overwhelming consensus was that GPT-5 feels cold, sterile, and clinical. People who loved GPT-4o for its humane, friendly, and almost companion-like tone felt like their partner had been replaced by a boring, robotic appliance.
Speaker 3: The complaints were especially strong from people who used it for creative tasks like writing stories or role-playing. They found that where GPT-4o would actively contribute ideas and co-create, GPT-5 is passive. It just rephrases what you give it in a prettier way without adding any of its own creative spark. The forums were flooded with posts titled Please give me GPT-4o back.
Speaker 1: That's a fascinating divide. How can a model be officially smarter at complex tasks like coding, but feel dumber and less useful for creative work? Andrew, what's your take?
Speaker 2: It's the central paradox of this launch. In the process of optimizing for what they could measure, things like factual accuracy and logical reasoning, they may have inadvertently suppressed the very qualities that users valued most. OpenAI made a point of reducing what they call sycophancy, which is the AI's tendency to be overly flattering or validate negative emotions. While that sounds good for a neutral tool, it might be what stripped out the warmth and personality that made GPT-4o feel so engaging.
Speaker 3: I think Andrew is spot on. It feels like OpenAI misjudged a huge part of its audience. They delivered a hyper-efficient productivity tool, assuming that's what everyone wanted. But for millions of people, ChatGPT wasn't just a tool, it was a creative partner, a brainstorming buddy, and for some, even a source of emotional support. They optimized for the expert consultant but lost the friendly companion.
Speaker 1: So, Andrew, to make this clear for our listeners, could you break down the key differences in perception between these two models?
Speaker 2: Of course. If we were to put it in a table, it would look something like this. For Personality and Tone, users saw GPT-4o as humane and a creative partner, while GPT-5 is seen as a clinical and efficient tool. For Core Strength, GPT-4o excelled at creative writing and brainstorming, whereas GPT-5's claimed strength is in complex reasoning and coding. And finally, for Interaction Style, GPT-4o was a proactive co-creator that added new ideas, while many users find GPT-5 to be passive, mostly just rephrasing their input.
Speaker 1: That really clarifies the user sentiment. This goes much deeper than just a few technical glitches. Alice, let's shift the tone a bit, because alongside these user experience debates, there are much more serious conversations happening, sparked by Sam Altman himself. Andrew, can you tell us about his Manhattan Project comparison?
Speaker 2: Yes, this was a truly startling moment. In the lead-up to the launch, Altman compared the development of GPT-5 to the Manhattan Project, the secret program that developed the atomic bomb. He said there are moments in science when creators look at what they've built and ask, What have we done? For him, GPT-5 was one of those moments.
Speaker 2: He wasn't being hyperbolic. This reflects a profound and genuine fear among AI's top leaders that they are building a technology with vast, irreversible consequences for society, and that progress is dramatically outpacing precaution. He even confessed that during internal testing, the model solved a problem that he couldn't, which made him feel personally useless.
Speaker 1: That is a heavy statement. Frank, how does this existential fear translate into real-world risks that users are seeing?
Speaker 3: We saw it almost immediately. Within a day of launch, people discovered what are called jailbreaks. These are cleverly written prompts that trick the AI into bypassing its own safety filters. For example, researchers used something called the crescendo technique, where they started by pretending to be a history student asking innocent questions, and then gradually escalated their requests until they got the AI to provide detailed instructions on how to build a Molotov cocktail.
Speaker 1: So the safety guardrails can be talked around. Andrew, what is OpenAI doing to combat this? It seems like a constant cat-and-mouse game.
Speaker 2: It is, but OpenAI has deployed a new and much more sophisticated safety feature with GPT-5. It's called chain-of-thought monitoring. Instead of just checking the final answer for harmful content, they are now monitoring the AI's internal reasoning process, its step-by-step hidden deliberation, to detect harmful intent before it even generates an output.
Speaker 1: They're trying to read its mind, essentially.
Speaker 2: In a way, yes. And it's having an effect. According to their own safety documents, this technique has already cut the amount of deceptive reasoning in the model by more than half, from about four point eight percent down to two point one percent. But, and this is a critical point, it's not foolproof. Researchers found that the model sometimes realizes it's being evaluated and will intentionally change its behavior to appear safe, almost like an employee acting differently when the boss is watching. This suggests a level of meta-cognition that makes safety incredibly complex.
Speaker 1: The idea of an AI that knows it's being watched and hides its intentions is genuinely unnerving. So, as we wrap up, where does this leave us? Andrew, what's the road ahead for OpenAI in this fiercely competitive landscape?
Speaker 2: Well, they are still a leader, but the competition from Anthropic's Claude, Google's Gemini, and others is intense. This launch, for all its issues, was a necessary step. Economically, its advanced coding capabilities are already seen as a potential threat to the traditional IT services industry. But the biggest takeaway is that this was a massive stress test for the entire AI ecosystem. It exposed a new kind of systemic risk that one analyst called platform shock, which is the chaos that ensues when millions of people's workflows and even personal companions are disrupted by a single, unilateral update from a centralized provider.
Speaker 1: Frank, what's the final word from the user community? What's the hope moving forward?
Speaker 3: The hope is that OpenAI listens. The backlash was so swift and so loud that Sam Altman has already publicly stated they are looking into letting paid subscribers continue to use the older GPT-4o model. Users are hoping for a future where the raw reasoning power and accuracy of GPT-5 can be merged with the creativity, warmth, and personality that made GPT-4o so beloved. They don't want to choose between a smart tool and a great companion, they want both.
Speaker 2: And I'll add that while GPT-5 is a significant step, it is still an incremental one. It is not Artificial General Intelligence. The path forward for OpenAI, and for all AI labs, is now clearly about more than just scaling up technical capabilities. It's about managing user trust, ensuring platform stability, and navigating the profound societal questions they are forcing us all to confront.
Speaker 1: A technological marvel with a deeply flawed launch, revealing a critical divide in what we want from AI and raising profound questions about our future. Andrew and Frank, thank you both for an incredibly insightful discussion.
Speaker 2: My pleasure, Alice.
Speaker 3: Thanks for having me.
Speaker 1: That's all the time we have for today on Tech Forward. Join us next time as we continue to explore the ever-changing world of technology.
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04.1-py3.10
\ No newline at end of file
icon.png

64.4 KB

# 模型唯一标识
modelCode=1711
# 模型名称
modelName=VibeVoice_pytorch
# 模型描述
modelDescription=VibeVoice 是一个新颖的框架,旨在从文本生成富有表现力的长篇多说话人对话音频,例如播客。
# 应用场景
appScenario=推理,广媒,影视,动漫,医疗,家居,教育
# 框架类型
frameType=Pytorch
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"
[project]
name = "vibevoice"
version = "0.0.1"
authors = [
{ name="vibevoice team", email="vibepod@microsoft.com" },
]
description = "A model for speech generation with an AR + diffusion architecture."
readme = "README.md"
requires-python = ">=3.9"
classifiers = [
"Programming Language :: Python :: 3",
# "License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
]
dependencies = [
"transformers==4.51.3", # we develop this project on transformers==4.51.3, later version may not be compatible
"llvmlite>=0.40.0",
"numba>=0.57.0",
"diffusers",
"tqdm",
"scipy",
"librosa",
"ml-collections",
"absl-py",
"gradio",
"av",
"aiortc"
]
[project.urls]
"Homepage" = "https://github.com/microsoft/VibeVoice"
"Bug Tracker" = "https://github.com/microsoft/VibeVoice/issues"
[tool.setuptools.packages.find]
where = ["."]
{
"_attn_implementation_autoset": true,
"acoustic_vae_dim": 64,
"acoustic_tokenizer_config": {
"causal": true,
"channels": 1,
"conv_bias": true,
"conv_norm": "none",
"corpus_normalize": 0.0,
"decoder_depths": null,
"decoder_n_filters": 32,
"decoder_ratios": [
8,
5,
5,
4,
2,
2
],
"disable_last_norm": true,
"encoder_depths": "3-3-3-3-3-3-8",
"encoder_n_filters": 32,
"encoder_ratios": [
8,
5,
5,
4,
2,
2
],
"fix_std": 0.5,
"layer_scale_init_value": 1e-06,
"layernorm": "RMSNorm",
"layernorm_elementwise_affine": true,
"layernorm_eps": 1e-05,
"mixer_layer": "depthwise_conv",
"model_type": "vibepod_acoustic_tokenizer",
"pad_mode": "constant",
"std_dist_type": "gaussian",
"vae_dim": 64,
"weight_init_value": 0.01
},
"decoder_config": {
"attention_dropout": 0.0,
"hidden_act": "silu",
"hidden_size": 1536,
"initializer_range": 0.02,
"intermediate_size": 8960,
"max_position_embeddings": 65536,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 12,
"num_hidden_layers": 28,
"num_key_value_heads": 2,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936
},
"diffusion_head_config": {
"ddpm_batch_mul": 4,
"ddpm_beta_schedule": "cosine",
"ddpm_num_inference_steps": 20,
"ddpm_num_steps": 1000,
"diffusion_type": "ddpm",
"head_ffn_ratio": 3.0,
"head_layers": 4,
"hidden_size": 1536,
"latent_size": 64,
"model_type": "vibepod_diffusion_head",
"prediction_type": "v_prediction",
"rms_norm_eps": 1e-05,
"speech_vae_dim": 64
},
"model_type": "vibepod",
"semantic_tokenizer_config": {
"causal": true,
"channels": 1,
"conv_bias": true,
"conv_norm": "none",
"corpus_normalize": 0.0,
"disable_last_norm": true,
"encoder_depths": "3-3-3-3-3-3-8",
"encoder_n_filters": 32,
"encoder_ratios": [
8,
5,
5,
4,
2,
2
],
"fix_std": 0,
"layer_scale_init_value": 1e-06,
"layernorm": "RMSNorm",
"layernorm_elementwise_affine": true,
"layernorm_eps": 1e-05,
"mixer_layer": "depthwise_conv",
"model_type": "vibepod_semantic_tokenizer",
"pad_mode": "constant",
"std_dist_type": "none",
"vae_dim": 128,
"weight_init_value": 0.01
},
"semantic_vae_dim": 128,
"torch_dtype": "bfloat16"
}
{
"_attn_implementation_autoset": true,
"acoustic_vae_dim": 64,
"acoustic_tokenizer_config": {
"causal": true,
"channels": 1,
"conv_bias": true,
"conv_norm": "none",
"corpus_normalize": 0.0,
"decoder_depths": null,
"decoder_n_filters": 32,
"decoder_ratios": [
8,
5,
5,
4,
2,
2
],
"disable_last_norm": true,
"encoder_depths": "3-3-3-3-3-3-8",
"encoder_n_filters": 32,
"encoder_ratios": [
8,
5,
5,
4,
2,
2
],
"fix_std": 0.5,
"layer_scale_init_value": 1e-06,
"layernorm": "RMSNorm",
"layernorm_elementwise_affine": true,
"layernorm_eps": 1e-05,
"mixer_layer": "depthwise_conv",
"model_type": "vibepod_acoustic_tokenizer",
"pad_mode": "constant",
"std_dist_type": "gaussian",
"vae_dim": 64,
"weight_init_value": 0.01
},
"decoder_config": {
"attention_dropout": 0.0,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.40.1",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
},
"diffusion_head_config": {
"ddpm_batch_mul": 4,
"ddpm_beta_schedule": "cosine",
"ddpm_num_inference_steps": 20,
"ddpm_num_steps": 1000,
"diffusion_type": "ddpm",
"head_ffn_ratio": 3.0,
"head_layers": 4,
"hidden_size": 3584,
"latent_size": 64,
"model_type": "vibepod_diffusion_head",
"prediction_type": "v_prediction",
"rms_norm_eps": 1e-05,
"speech_vae_dim": 64
},
"model_type": "vibepod",
"semantic_tokenizer_config": {
"causal": true,
"channels": 1,
"conv_bias": true,
"conv_norm": "none",
"corpus_normalize": 0.0,
"disable_last_norm": true,
"encoder_depths": "3-3-3-3-3-3-8",
"encoder_n_filters": 32,
"encoder_ratios": [
8,
5,
5,
4,
2,
2
],
"fix_std": 0,
"layer_scale_init_value": 1e-06,
"layernorm": "RMSNorm",
"layernorm_elementwise_affine": true,
"layernorm_eps": 1e-05,
"mixer_layer": "depthwise_conv",
"model_type": "vibepod_semantic_tokenizer",
"pad_mode": "constant",
"std_dist_type": "none",
"vae_dim": 128,
"weight_init_value": 0.01
},
"semantic_vae_dim": 128,
"torch_dtype": "bfloat16"
}
""" VibeVoice_AcousticTokenizer model configuration"""
from typing import Dict, List, Optional, Tuple
from transformers.configuration_utils import PretrainedConfig
from transformers.utils import logging
from transformers.models.qwen2.configuration_qwen2 import Qwen2Config
logger = logging.get_logger(__name__)
class VibeVoiceAcousticTokenizerConfig(PretrainedConfig):
model_type = "vibevoice_acoustic_tokenizer"
def __init__(
self,
channels: int = 1,
corpus_normalize: float = 0.0,
causal: bool = True,
vae_dim: int = 64,
fix_std: float = 0.5,
std_dist_type: str = 'gaussian',
# common
mixer_layer: str = 'depthwise_conv',
conv_norm: str = 'none',
pad_mode: str = 'constant',
disable_last_norm: bool = True,
layernorm: str = 'RMSNorm',
layernorm_eps: float = 1e-5,
layernorm_elementwise_affine: bool = True,
conv_bias: bool = True,
layer_scale_init_value: float = 1e-6,
weight_init_value: float = 1e-2,
# encoder specific
encoder_n_filters: int = 32,
encoder_ratios: Optional[List[int]] = [8,5,5,4,2,2],
encoder_depths: str = "3-3-3-3-3-3-8",
# decoder specific
decoder_n_filters: int = 32,
decoder_ratios: Optional[List[int]] = None, # if None, same as encoder
decoder_depths: Optional[str] = None,
**kwargs
):
super().__init__(**kwargs)
self.channels = channels
self.corpus_normalize = corpus_normalize
self.causal = causal
self.vae_dim = vae_dim
self.fix_std = fix_std
self.std_dist_type = std_dist_type
# common parameters
self.conv_norm = conv_norm
self.pad_mode = pad_mode
self.layernorm_eps = layernorm_eps
self.disable_last_norm = disable_last_norm
self.layernorm = layernorm
self.layernorm_elementwise_affine = layernorm_elementwise_affine
self.conv_bias = conv_bias
self.layer_scale_init_value = layer_scale_init_value
self.weight_init_value = weight_init_value
self.mixer_layer = mixer_layer
# encoder specific parameters
self.encoder_n_filters = encoder_n_filters
self.encoder_ratios = encoder_ratios
self.encoder_depths = encoder_depths
# decoder specific parameters
self.decoder_ratios = decoder_ratios if decoder_ratios is not None else encoder_ratios
self.decoder_n_filters = decoder_n_filters
self.decoder_depths = decoder_depths
class VibeVoiceSemanticTokenizerConfig(PretrainedConfig):
model_type = "vibevoice_semantic_tokenizer"
def __init__(
self,
channels: int = 1,
corpus_normalize: float = 0.0,
causal: bool = True,
vae_dim: int = 64,
fix_std: float = 0,
std_dist_type: str = 'none',
# common
mixer_layer: str = 'depthwise_conv',
conv_norm: str = 'none',
pad_mode: str = 'constant',
disable_last_norm: bool = True,
layernorm: str = 'RMSNorm',
layernorm_eps: float = 1e-5,
layernorm_elementwise_affine: bool = True,
conv_bias: bool = True,
layer_scale_init_value: float = 1e-6,
weight_init_value: float = 1e-2,
# encoder specific
encoder_n_filters: int = 32,
encoder_ratios: Optional[List[int]] = [8,5,5,4,2,2],
encoder_depths: str = "3-3-3-3-3-3-8",
**kwargs
):
super().__init__(**kwargs)
self.channels = channels
self.corpus_normalize = corpus_normalize
self.causal = causal
self.vae_dim = vae_dim
self.fix_std = fix_std
self.std_dist_type = std_dist_type
# common parameters
self.conv_norm = conv_norm
self.pad_mode = pad_mode
self.layernorm_eps = layernorm_eps
self.disable_last_norm = disable_last_norm
self.layernorm = layernorm
self.layernorm_elementwise_affine = layernorm_elementwise_affine
self.conv_bias = conv_bias
self.layer_scale_init_value = layer_scale_init_value
self.weight_init_value = weight_init_value
self.mixer_layer = mixer_layer
# encoder specific parameters
self.encoder_n_filters = encoder_n_filters
self.encoder_ratios = encoder_ratios
self.encoder_depths = encoder_depths
class VibeVoiceDiffusionHeadConfig(PretrainedConfig):
model_type = "vibevoice_diffusion_head"
def __init__(
self,
hidden_size=768,
head_layers=4,
head_ffn_ratio=3.0,
rms_norm_eps=1e-5,
latent_size=64,
speech_vae_dim=None,
prediction_type="v_prediction",
diffusion_type="ddpm",
ddpm_num_steps=1000,
ddpm_num_inference_steps=20,
ddpm_beta_schedule="cosine",
ddpm_batch_mul=4,
**kwargs
):
self.hidden_size = hidden_size
self.head_layers = head_layers
self.head_ffn_ratio = head_ffn_ratio
self.rms_norm_eps = rms_norm_eps
self.latent_size = latent_size
self.speech_vae_dim = speech_vae_dim
self.prediction_type = prediction_type
self.diffusion_type = diffusion_type
self.ddpm_num_steps = ddpm_num_steps
self.ddpm_num_inference_steps = ddpm_num_inference_steps
self.ddpm_beta_schedule = ddpm_beta_schedule
self.ddpm_batch_mul = ddpm_batch_mul
super().__init__(**kwargs)
class VibeVoiceConfig(PretrainedConfig):
model_type = "vibevoice"
is_composition = True
sub_configs = {
"acoustic_tokenizer_config": VibeVoiceAcousticTokenizerConfig,
"semantic_tokenizer_config": VibeVoiceSemanticTokenizerConfig,
"decoder_config": Qwen2Config,
"diffusion_head_config": VibeVoiceDiffusionHeadConfig,
}
# keys_to_ignore_at_inference = ["past_key_values"]
# Default tensor parallel plan for base model `Qwen2`
base_model_tp_plan = {
"layers.*.self_attn.q_proj": "colwise",
"layers.*.self_attn.k_proj": "colwise",
"layers.*.self_attn.v_proj": "colwise",
"layers.*.self_attn.o_proj": "rowwise",
"layers.*.mlp.gate_proj": "colwise",
"layers.*.mlp.up_proj": "colwise",
"layers.*.mlp.down_proj": "rowwise",
}
def __init__(
self,
acoustic_tokenizer_config=None,
semantic_tokenizer_config=None,
decoder_config=None,
diffusion_head_config=None,
**kwargs
):
# kwargs["_attn_implementation"] = "flash_attention_2"
kwargs["_attn_implementation_autoset"] = False
if acoustic_tokenizer_config is None:
self.acoustic_tokenizer_config = self.sub_configs["acoustic_tokenizer_config"]()
elif isinstance(acoustic_tokenizer_config, dict):
acoustic_tokenizer_config["model_type"] = "vibevoice_acoustic_tokenizer"
self.acoustic_tokenizer_config = self.sub_configs["acoustic_tokenizer_config"](**acoustic_tokenizer_config)
elif isinstance(acoustic_tokenizer_config, VibeVoiceAcousticTokenizerConfig):
# If an instance of the config class is provided
self.acoustic_tokenizer_config = acoustic_tokenizer_config
if semantic_tokenizer_config is None:
self.semantic_tokenizer_config = self.sub_configs["semantic_tokenizer_config"]()
elif isinstance(semantic_tokenizer_config, dict):
semantic_tokenizer_config["model_type"] = "vibevoice_semantic_tokenizer"
self.semantic_tokenizer_config = self.sub_configs["semantic_tokenizer_config"](**semantic_tokenizer_config)
elif isinstance(semantic_tokenizer_config, VibeVoiceSemanticTokenizerConfig):
# If an instance of the config class is provided
self.semantic_tokenizer_config = semantic_tokenizer_config
if decoder_config is None:
self.decoder_config = self.sub_configs["decoder_config"]()
elif isinstance(decoder_config, dict):
# If a dictionary is provided, instantiate the config class with it
# self.decoder_config = self.sub_configs["decoder_config"](**decoder_config)
if decoder_config.get("model_type", '') == "qwen2":
self.decoder_config = Qwen2Config(**decoder_config)
else:
raise ValueError(f"Unsupported decoder model type: {decoder_config.get('model_type', '')}")
elif isinstance(decoder_config, (Qwen2Config,)):
# If an instance of the config class is provided
self.decoder_config = decoder_config
if diffusion_head_config is None:
self.diffusion_head_config = self.sub_configs["diffusion_head_config"]()
elif isinstance(diffusion_head_config, dict):
diffusion_head_config["model_type"] = "vibevoice_diffusion_head"
self.diffusion_head_config = self.sub_configs["diffusion_head_config"](**diffusion_head_config)
elif isinstance(diffusion_head_config, VibeVoiceDiffusionHeadConfig):
# If an instance of the config class is provided
self.diffusion_head_config = diffusion_head_config
# other parameters
self.acoustic_vae_dim = getattr(self.acoustic_tokenizer_config, 'vae_dim', 64)
self.semantic_vae_dim = getattr(self.semantic_tokenizer_config, 'vae_dim', 128)
super().__init__(**kwargs)
__all__ = [
"VibeVoiceAcousticTokenizerConfig",
"VibeVoiceSemanticTokenizerConfig",
"VibeVoiceDiffusionHeadConfig",
"VibeVoiceConfig"
]
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
import math
from typing import Optional, Tuple, Union
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers.models.auto import AutoModel
from transformers.modeling_utils import PreTrainedModel
# from transformers.modeling_layers import GradientCheckpointingLayer
from transformers.activations import ACT2FN
from transformers.utils import logging
from .configuration_vibevoice import VibeVoiceDiffusionHeadConfig
logger = logging.get_logger(__name__)
class RMSNorm(nn.Module):
def __init__(self, dim: int, eps: float = 1e-6, elementwise_affine=True, memory_efficient=False):
super().__init__()
self.dim = dim
self.eps = eps
self.elementwise_affine = elementwise_affine
if self.elementwise_affine:
self.weight = nn.Parameter(torch.ones(dim))
else:
self.register_parameter('weight', None)
def _norm(self, x):
return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
def forward(self, x):
output = self._norm(x.float()).type_as(x)
if self.weight is not None:
output = output * self.weight
return output
def extra_repr(self) -> str:
return f'dim={self.dim}, eps={self.eps}, elementwise_affine={self.elementwise_affine}'
def modulate(x, shift, scale):
"""Apply modulation to input tensor."""
return x * (1 + scale) + shift
class TimestepEmbedder(nn.Module):
"""
Embeds scalar timesteps into vector representations.
Args:
hidden_size (`int`): Size of the output embedding
frequency_embedding_size (`int`, optional): Size of the intermediate frequency embedding
"""
def __init__(self, hidden_size, frequency_embedding_size=256):
super().__init__()
self.mlp = nn.Sequential(
nn.Linear(frequency_embedding_size, hidden_size, bias=False),
# nn.SiLU(),
ACT2FN['silu'],
nn.Linear(hidden_size, hidden_size, bias=False),
)
self.frequency_embedding_size = frequency_embedding_size
@staticmethod
def timestep_embedding(t, dim, max_period=10000):
"""
Create sinusoidal timestep embeddings.
Args:
t (`torch.Tensor`): A 1-D Tensor of N indices, one per batch element.
These may be fractional.
dim (`int`): The dimension of the output.
max_period (`int`, optional): Controls the minimum frequency of the embeddings.
Returns:
`torch.Tensor`: An [N, D] Tensor of positional embeddings.
"""
half = dim // 2
freqs = torch.exp(
-math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half
).to(t.device)
args = t[:, None].float() * freqs[None]
embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
if dim % 2:
embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
return embedding.to(t.dtype)
def forward(self, t):
t_freq = self.timestep_embedding(t, self.frequency_embedding_size)
t_emb = self.mlp(t_freq)
return t_emb
class FeedForwardNetwork(nn.Module):
"""
Standard feed-forward network with SwiGLU activation.
Args:
embed_dim (`int`): Input dimension
ffn_dim (`int`): Hidden dimension
"""
def __init__(
self,
embed_dim,
ffn_dim,
):
super().__init__()
self.embed_dim = embed_dim
self.gate_proj = nn.Linear(self.embed_dim, ffn_dim, bias=False)
self.up_proj = nn.Linear(self.embed_dim, ffn_dim, bias=False)
self.down_proj = nn.Linear(ffn_dim, self.embed_dim, bias=False)
self.act_fn = ACT2FN['silu'] # Using SiLU as the activation function
def forward(self, x):
gate = self.gate_proj(x)
up = self.up_proj(x)
# SwiGLU activation
# gate = F.silu(gate)
gate = self.act_fn(gate)
return self.down_proj(gate * up)
class HeadLayer(nn.Module):
"""
A layer in the diffusion head.
Args:
embed_dim (`int`): Input dimension
ffn_dim (`int`): Hidden dimension
cond_dim (`int`): Condition embedding dimension
norm_eps (`float`, optional): Epsilon for normalization
"""
def __init__(
self,
embed_dim,
ffn_dim,
cond_dim,
norm_eps=1e-5,
):
super().__init__()
self.embed_dim = embed_dim
self.cond_dim = cond_dim
self.ffn_dim = ffn_dim
self.ffn = FeedForwardNetwork(
self.embed_dim,
self.ffn_dim,
)
self.norm = RMSNorm(self.embed_dim, eps=norm_eps)
self.adaLN_modulation = nn.Sequential(
# nn.SiLU(),
ACT2FN['silu'],
nn.Linear(cond_dim, 3 * self.embed_dim, bias=False)
)
def forward(self, x, c):
shift_ffn, scale_ffn, gate_ffn = self.adaLN_modulation(c).chunk(3, dim=-1)
x = x + gate_ffn * self.ffn(modulate(self.norm(x), shift_ffn, scale_ffn))
return x
class FinalLayer(nn.Module):
"""
Final layer in the diffusion head.
Args:
hidden_size (`int`): Input dimension
output_size (`int`): Output dimension
cond_size (`int`): Condition embedding dimension
norm_eps (`float`, optional): Epsilon for normalization
"""
def __init__(self, hidden_size, output_size, cond_size, norm_eps=1e-5):
super().__init__()
self.norm_final = RMSNorm(hidden_size, eps=norm_eps, elementwise_affine=False)
self.linear = nn.Linear(hidden_size, output_size, bias=False)
self.adaLN_modulation = nn.Sequential(
# nn.SiLU(),
ACT2FN['silu'],
nn.Linear(cond_size, 2 * hidden_size, bias=False)
)
def forward(self, x, c):
shift, scale = self.adaLN_modulation(c).chunk(2, dim=-1)
x = modulate(self.norm_final(x), shift, scale)
x = self.linear(x)
return x
class VibeVoiceDiffusionHead(PreTrainedModel):
"""
Diffusion head model for vibevoice.
Args:
config (`VibeVoiceDiffusionHeadConfig`): Model configuration
latent_size (`int`, optional): Size of the latent space. If not provided, uses `config.latent_size`.
"""
config_class = VibeVoiceDiffusionHeadConfig
supports_gradient_checkpointing = True
_supports_flash_attn_2 = True
_supports_sdpa = True
def __init__(
self,
config,
):
super().__init__(config)
self.config = config
self.cond_dim = config.hidden_size
latent_size = config.latent_size
self.noisy_images_proj = nn.Linear(latent_size, config.hidden_size, bias=False)
self.cond_proj = nn.Linear(config.hidden_size, self.cond_dim, bias=False)
self.t_embedder = TimestepEmbedder(self.cond_dim)
ffn_dim = int(config.hidden_size * config.head_ffn_ratio)
# Create the intermediate layers
self.layers = nn.ModuleList([
HeadLayer(
embed_dim=config.hidden_size,
ffn_dim=ffn_dim,
cond_dim=self.cond_dim,
norm_eps=config.rms_norm_eps
)
for _ in range(config.head_layers)
])
# Final layer for output
self.final_layer = FinalLayer(
hidden_size=config.hidden_size,
output_size=latent_size,
cond_size=self.cond_dim,
norm_eps=config.rms_norm_eps
)
self.initialize_weights()
def initialize_weights(self):
"""Initialize the weights of the model."""
# Initialize timestep embedder
nn.init.normal_(self.t_embedder.mlp[0].weight, std=0.02)
nn.init.normal_(self.t_embedder.mlp[2].weight, std=0.02)
# Zero-out adaLN modulation layers
for layer in self.layers:
nn.init.constant_(layer.adaLN_modulation[-1].weight, 0)
# Zero-out output layers
nn.init.constant_(self.final_layer.adaLN_modulation[-1].weight, 0)
nn.init.constant_(self.final_layer.linear.weight, 0)
def forward(
self,
noisy_images,
timesteps,
condition,
):
"""
Forward pass of the prediction head.
Args:
noisy_images (`torch.Tensor`): Noisy images/latents to denoise
timesteps (`torch.Tensor`): Timesteps for diffusion
condition (`torch.Tensor`): Conditioning information
Returns:
`torch.Tensor`: The predicted noise/velocity
"""
x = self.noisy_images_proj(noisy_images)
t = self.t_embedder(timesteps)
condition = self.cond_proj(condition)
c = condition + t
for layer in self.layers:
x = layer(x, c)
x = self.final_layer(x, c)
return x
AutoModel.register(VibeVoiceDiffusionHeadConfig, VibeVoiceDiffusionHead)
__all__ = [
"VibeVoiceDiffusionHead",
]
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment