README.md 5.85 KB
Newer Older
Soujanya Poria's avatar
Soujanya Poria committed
1
2
3
4
5
6
7
8
9
10
<h1 align="center">
<br/>  
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization 
<br/>
✨✨✨


</h1>

<div align="center">
Soujanya Poria's avatar
Soujanya Poria committed
11
  <img src="assets/tf_teaser.png" alt="TangoFlux" width="1000" />
Soujanya Poria's avatar
Soujanya Poria committed
12
13
14

<br/>

chenxwh's avatar
chenxwh committed
15
16
[![arXiv](https://img.shields.io/badge/Read_the_Paper-blue?link=https%3A%2F%2Fopenreview.net%2Fattachment%3Fid%3DtpJPlFTyxd%26name%3Dpdf)](https://arxiv.org/abs/2412.21037) [![Static Badge](https://img.shields.io/badge/TangoFlux-Huggingface-violet?logo=huggingface&link=https%3A%2F%2Fhuggingface.co%2Fdeclare-lab%2FTangoFlux)](https://huggingface.co/declare-lab/TangoFlux) [![Static Badge](https://img.shields.io/badge/Demos-declare--lab-brightred?style=flat)](https://tangoflux.github.io/) [![Static Badge](https://img.shields.io/badge/TangoFlux-Huggingface_Space-8A2BE2?logo=huggingface&link=https%3A%2F%2Fhuggingface.co%2Fspaces%2Fdeclare-lab%2FTangoFlux)](https://huggingface.co/spaces/declare-lab/TangoFlux) [![Static Badge](https://img.shields.io/badge/TangoFlux_Dataset-Huggingface-red?logo=huggingface&link=https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fdeclare-lab%2FTangoFlux)](https://huggingface.co/datasets/declare-lab/CRPO) [![Replicate](https://replicate.com/chenxwh/tangoflux/badge)](https://replicate.com/chenxwh/tangoflux)

Soujanya Poria's avatar
Soujanya Poria committed
17
18
19
20
21
22




</div>

Chia-Yu Hung's avatar
Chia-Yu Hung committed
23
24
25
26
27
28
## Quickstart on Google Colab

| Colab |
| --- |
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1j__4fl_BlaVS_225M34d-EKxsVDJPRiR?usp=sharing) 

Soujanya Poria's avatar
Soujanya Poria committed
29
## Overall Pipeline
Navonil Majumder's avatar
Navonil Majumder committed
30
TangoFlux consists of FluxTransformer blocks, which are Diffusion Transformers (DiT) and Multimodal Diffusion Transformers (MMDiT) conditioned on a textual prompt and a duration embedding to generate a 44.1kHz audio up to 30 seconds long. TangoFlux learns a rectified flow trajectory to an audio latent representation encoded by a variational autoencoder (VAE). TangoFlux training pipeline consists of three stages: pre-training, fine-tuning, and preference optimization with CRPO. CRPO, particularly, iteratively generates new synthetic data and constructs preference pairs for preference optimization using DPO loss for flow matching.
Soujanya Poria's avatar
Soujanya Poria committed
31

Soujanya Poria's avatar
Soujanya Poria committed
32
![cover-photo](assets/tangoflux.png)
Soujanya Poria's avatar
Soujanya Poria committed
33

Soujanya Poria's avatar
Soujanya Poria committed
34

Navonil Majumder's avatar
Navonil Majumder committed
35
🚀 **TangoFlux can generate up to 30 seconds long 44.1kHz stereo audios in about 3 seconds.**
Soujanya Poria's avatar
Soujanya Poria committed
36

Soujanya Poria's avatar
Soujanya Poria committed
37
## Training TangoFlux
Chia-Yu Hung's avatar
Chia-Yu Hung committed
38
We use the accelerate package from HuggingFace for multi-gpu training. Run accelerate config from terminal and set up your run configuration by the answering the questions asked. We have placed the default accelerator config in the `configs` folder. Please specify the path to your training files in the configs/tangoflux_config.yaml. A sample of train.json and val.json has been provided. Replace them with your own audio.
Chia-Yu Hung's avatar
Chia-Yu Hung committed
39

Chia-Yu Hung's avatar
Chia-Yu Hung committed
40
`tangoflux_config.yaml` defines the training file paths and model hyperparameters:
Navonil Majumder's avatar
Navonil Majumder committed
41
```bash
Navonil Majumder's avatar
Navonil Majumder committed
42
CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file='configs/accelerator_config.yaml' src/train.py   --checkpointing_steps="best" --save_every=5 --config='configs/tangoflux_config.yaml'
Chia-Yu Hung's avatar
Chia-Yu Hung committed
43
```
Soujanya Poria's avatar
Soujanya Poria committed
44
## Inference with TangoFlux
Navonil Majumder's avatar
Navonil Majumder committed
45
Download the TangoFlux model and generate audio from a text prompt.
Chia-Yu Hung's avatar
Chia-Yu Hung committed
46
TangoFlux can generate audios up to 30 second long through passing in a duration variable in the `model.generate` function. Please note that duration should be strictly greather than 1 and lesser than 30.
Chia-Yu Hung's avatar
Chia-Yu Hung committed
47
48
```python
import torchaudio
hungchiayu1's avatar
updates  
hungchiayu1 committed
49
50
from tangoflux import TangoFluxInference
from IPython.display import Audio
Chia-Yu Hung's avatar
Chia-Yu Hung committed
51

hungchiayu1's avatar
updates  
hungchiayu1 committed
52
53
model = TangoFluxInference(name='declare-lab/TangoFlux')
audio = model.generate('Hammer slowly hitting the wooden table', steps=50, duration=10)
Chia-Yu Hung's avatar
Chia-Yu Hung committed
54

hungchiayu1's avatar
updates  
hungchiayu1 committed
55
56
Audio(data=audio, rate=44100)
```
Navonil Majumder's avatar
Navonil Majumder committed
57
Our evaluation shows that inferring with 50 steps yield the best results. A CFG scale of 3.5, 4, and 4.5 yield simliar quality output.
Chia-Yu Hung's avatar
Chia-Yu Hung committed
58
For faster inference, consider setting steps to 25 that yield similar audio quality.
Soujanya Poria's avatar
Soujanya Poria committed
59
60
61

## Evaluation Scripts

Navonil Majumder's avatar
Navonil Majumder committed
62
## TangoFlux vs. Other Audio Generation Models
Soujanya Poria's avatar
Soujanya Poria committed
63

Navonil Majumder's avatar
Navonil Majumder committed
64
This key comparison metrics include:
Soujanya Poria's avatar
Soujanya Poria committed
65
66

- **Output Length**: Represents the duration of the generated audio.
Navonil Majumder's avatar
Navonil Majumder committed
67
- **FD**<sub>openl3</sub>: Fréchet Distance.
Soujanya Poria's avatar
Soujanya Poria committed
68
69
70
71
- **KL**<sub>passt</sub>: KL divergence.
- **CLAP**<sub>score</sub>: Alignment score.


Navonil Majumder's avatar
Navonil Majumder committed
72
All the inference times are observed on the same A40 GPU. The counts of trainable parameters are reported in the **\#Params** column.
Soujanya Poria's avatar
Soujanya Poria committed
73
74
75
76
77
78
79
80
81
82
83
84
85

| Model                           | \#Params  | Duration | Steps | FD<sub>openl3</sub> ↓ | KL<sub>passt</sub> ↓ | CLAP<sub>score</sub> ↑ | IS ↑ | Inference Time (s) |
|---------------------------------|-----------|----------|-------|-----------------------|----------------------|------------------------|------|--------------------|
| **AudioLDM 2-large**            | 712M      | 10 sec   | 200   | 108.3                | 1.81                 | 0.419                  | 7.9  | 24.8               |
| **Stable Audio Open**           | 1056M     | 47 sec   | 100   | 89.2                 | 2.58                 | 0.291                  | 9.9  | 8.6                |
| **Tango 2**                     | 866M      | 10 sec   | 200   | 108.4                | **1.11**             | 0.447                  | 9.0  | 22.8               |
| **TangoFlux-base**              | **515M**  | 30 sec   | 50    | 80.2                 | 1.22                 | 0.431                  | 11.7 | **3.7**            |
| **TangoFlux**                   | **515M**  | 30 sec   | 50    | **75.1**             | 1.15                 | **0.480**              | **12.2** | **3.7**         |



## Citation

Soujanya Poria's avatar
Soujanya Poria committed
86
87
88
89
90
91
92
93
94
```bibtex
@misc{hung2024tangofluxsuperfastfaithful,
      title={TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization}, 
      author={Chia-Yu Hung and Navonil Majumder and Zhifeng Kong and Ambuj Mehrish and Rafael Valle and Bryan Catanzaro and Soujanya Poria},
      year={2024},
      eprint={2412.21037},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2412.21037}, 
Soujanya Poria's avatar
Soujanya Poria committed
95
}
Soujanya Poria's avatar
Soujanya Poria committed
96
```