README.md 6.34 KB
Newer Older
Soujanya Poria's avatar
Soujanya Poria committed
1
2
3
4
5
6
7
8
9
10
<h1 align="center">
<br/>  
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization 
<br/>
✨✨✨


</h1>

<div align="center">
Soujanya Poria's avatar
Soujanya Poria committed
11
  <img src="assets/tf_teaser.png" alt="TangoFlux" width="1000" />
Soujanya Poria's avatar
Soujanya Poria committed
12
13
14

<br/>

chenxwh's avatar
chenxwh committed
15
16
[![arXiv](https://img.shields.io/badge/Read_the_Paper-blue?link=https%3A%2F%2Fopenreview.net%2Fattachment%3Fid%3DtpJPlFTyxd%26name%3Dpdf)](https://arxiv.org/abs/2412.21037) [![Static Badge](https://img.shields.io/badge/TangoFlux-Huggingface-violet?logo=huggingface&link=https%3A%2F%2Fhuggingface.co%2Fdeclare-lab%2FTangoFlux)](https://huggingface.co/declare-lab/TangoFlux) [![Static Badge](https://img.shields.io/badge/Demos-declare--lab-brightred?style=flat)](https://tangoflux.github.io/) [![Static Badge](https://img.shields.io/badge/TangoFlux-Huggingface_Space-8A2BE2?logo=huggingface&link=https%3A%2F%2Fhuggingface.co%2Fspaces%2Fdeclare-lab%2FTangoFlux)](https://huggingface.co/spaces/declare-lab/TangoFlux) [![Static Badge](https://img.shields.io/badge/TangoFlux_Dataset-Huggingface-red?logo=huggingface&link=https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fdeclare-lab%2FTangoFlux)](https://huggingface.co/datasets/declare-lab/CRPO) [![Replicate](https://replicate.com/chenxwh/tangoflux/badge)](https://replicate.com/chenxwh/tangoflux)

Soujanya Poria's avatar
Soujanya Poria committed
17
18
19
20
21
22




</div>

Chia-Yu Hung's avatar
Chia-Yu Hung committed
23
24
25
26
27
28
## Quickstart on Google Colab

| Colab |
| --- |
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1j__4fl_BlaVS_225M34d-EKxsVDJPRiR?usp=sharing) 

Soujanya Poria's avatar
Soujanya Poria committed
29
## Overall Pipeline
Navonil Majumder's avatar
Navonil Majumder committed
30
TangoFlux consists of FluxTransformer blocks, which are Diffusion Transformers (DiT) and Multimodal Diffusion Transformers (MMDiT) conditioned on a textual prompt and a duration embedding to generate a 44.1kHz audio up to 30 seconds long. TangoFlux learns a rectified flow trajectory to an audio latent representation encoded by a variational autoencoder (VAE). TangoFlux training pipeline consists of three stages: pre-training, fine-tuning, and preference optimization with CRPO. CRPO, particularly, iteratively generates new synthetic data and constructs preference pairs for preference optimization using DPO loss for flow matching.
Soujanya Poria's avatar
Soujanya Poria committed
31

Soujanya Poria's avatar
Soujanya Poria committed
32
![cover-photo](assets/tangoflux.png)
Soujanya Poria's avatar
Soujanya Poria committed
33

Soujanya Poria's avatar
Soujanya Poria committed
34

Navonil Majumder's avatar
Navonil Majumder committed
35
🚀 **TangoFlux can generate up to 30 seconds long 44.1kHz stereo audios in about 3 seconds.**
Soujanya Poria's avatar
Soujanya Poria committed
36

Soujanya Poria's avatar
Soujanya Poria committed
37
## Training TangoFlux
Chia-Yu Hung's avatar
Chia-Yu Hung committed
38
We use the accelerate package from HuggingFace for multi-gpu training. Run accelerate config from terminal and set up your run configuration by the answering the questions asked. We have placed the default accelerator config in the `configs` folder. Please specify the path to your training files in the configs/tangoflux_config.yaml. A sample of train.json and val.json has been provided. Replace them with your own audio.
Chia-Yu Hung's avatar
Chia-Yu Hung committed
39

Chia-Yu Hung's avatar
Chia-Yu Hung committed
40
`tangoflux_config.yaml` defines the training file paths and model hyperparameters:
Navonil Majumder's avatar
Navonil Majumder committed
41
```bash
Navonil Majumder's avatar
Navonil Majumder committed
42
CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file='configs/accelerator_config.yaml' src/train.py   --checkpointing_steps="best" --save_every=5 --config='configs/tangoflux_config.yaml'
Chia-Yu Hung's avatar
Chia-Yu Hung committed
43
```
Chia-Yu Hung's avatar
Chia-Yu Hung committed
44
45
46
47
48
49

To perform DPO training, modify the training files such that each data point contains a "chosen","reject","caption" and "duration". Please specify the path to your training files in the configs/tangoflux_config.yaml. An example has been provided in train_dpo.json. Replace them with your own audio.

```bash
CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file='configs/accelerator_config.yaml' src/train_dpo.py   --checkpointing_steps="best" --save_every=5 --config='configs/tangoflux_config.yaml'
```
Soujanya Poria's avatar
Soujanya Poria committed
50
## Inference with TangoFlux
Navonil Majumder's avatar
Navonil Majumder committed
51
Download the TangoFlux model and generate audio from a text prompt.
Chia-Yu Hung's avatar
Chia-Yu Hung committed
52
TangoFlux can generate audios up to 30 second long through passing in a duration variable in the `model.generate` function. Please note that duration should be strictly greather than 1 and lesser than 30.
Chia-Yu Hung's avatar
Chia-Yu Hung committed
53
54
```python
import torchaudio
hungchiayu1's avatar
updates  
hungchiayu1 committed
55
56
from tangoflux import TangoFluxInference
from IPython.display import Audio
Chia-Yu Hung's avatar
Chia-Yu Hung committed
57

hungchiayu1's avatar
updates  
hungchiayu1 committed
58
59
model = TangoFluxInference(name='declare-lab/TangoFlux')
audio = model.generate('Hammer slowly hitting the wooden table', steps=50, duration=10)
Chia-Yu Hung's avatar
Chia-Yu Hung committed
60

hungchiayu1's avatar
updates  
hungchiayu1 committed
61
62
Audio(data=audio, rate=44100)
```
Navonil Majumder's avatar
Navonil Majumder committed
63
Our evaluation shows that inferring with 50 steps yield the best results. A CFG scale of 3.5, 4, and 4.5 yield simliar quality output.
Chia-Yu Hung's avatar
Chia-Yu Hung committed
64
For faster inference, consider setting steps to 25 that yield similar audio quality.
Soujanya Poria's avatar
Soujanya Poria committed
65
66
67

## Evaluation Scripts

Navonil Majumder's avatar
Navonil Majumder committed
68
## TangoFlux vs. Other Audio Generation Models
Soujanya Poria's avatar
Soujanya Poria committed
69

Navonil Majumder's avatar
Navonil Majumder committed
70
This key comparison metrics include:
Soujanya Poria's avatar
Soujanya Poria committed
71
72

- **Output Length**: Represents the duration of the generated audio.
Navonil Majumder's avatar
Navonil Majumder committed
73
- **FD**<sub>openl3</sub>: Fréchet Distance.
Soujanya Poria's avatar
Soujanya Poria committed
74
75
76
77
- **KL**<sub>passt</sub>: KL divergence.
- **CLAP**<sub>score</sub>: Alignment score.


Navonil Majumder's avatar
Navonil Majumder committed
78
All the inference times are observed on the same A40 GPU. The counts of trainable parameters are reported in the **\#Params** column.
Soujanya Poria's avatar
Soujanya Poria committed
79
80
81
82
83
84
85
86
87
88
89
90
91

| Model                           | \#Params  | Duration | Steps | FD<sub>openl3</sub> ↓ | KL<sub>passt</sub> ↓ | CLAP<sub>score</sub> ↑ | IS ↑ | Inference Time (s) |
|---------------------------------|-----------|----------|-------|-----------------------|----------------------|------------------------|------|--------------------|
| **AudioLDM 2-large**            | 712M      | 10 sec   | 200   | 108.3                | 1.81                 | 0.419                  | 7.9  | 24.8               |
| **Stable Audio Open**           | 1056M     | 47 sec   | 100   | 89.2                 | 2.58                 | 0.291                  | 9.9  | 8.6                |
| **Tango 2**                     | 866M      | 10 sec   | 200   | 108.4                | **1.11**             | 0.447                  | 9.0  | 22.8               |
| **TangoFlux-base**              | **515M**  | 30 sec   | 50    | 80.2                 | 1.22                 | 0.431                  | 11.7 | **3.7**            |
| **TangoFlux**                   | **515M**  | 30 sec   | 50    | **75.1**             | 1.15                 | **0.480**              | **12.2** | **3.7**         |



## Citation

Soujanya Poria's avatar
Soujanya Poria committed
92
93
94
95
96
97
98
99
100
```bibtex
@misc{hung2024tangofluxsuperfastfaithful,
      title={TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization}, 
      author={Chia-Yu Hung and Navonil Majumder and Zhifeng Kong and Ambuj Mehrish and Rafael Valle and Bryan Catanzaro and Soujanya Poria},
      year={2024},
      eprint={2412.21037},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2412.21037}, 
Soujanya Poria's avatar
Soujanya Poria committed
101
}
Soujanya Poria's avatar
Soujanya Poria committed
102
```