"magic_pdf/vscode:/vscode.git/clone" did not exist on "cf0d76c094b13b923763da0863bd1e7b5e41c8c1"
README.md 6.17 KB
Newer Older
mrfakename's avatar
mrfakename committed
1
# TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization 
Soujanya Poria's avatar
Soujanya Poria committed
2
3

<div align="center">
Soujanya Poria's avatar
Soujanya Poria committed
4
  <img src="assets/tf_teaser.png" alt="TangoFlux" width="1000" />
mrfakename's avatar
mrfakename committed
5
6
7
  <br/>
  
  [![arXiv](https://img.shields.io/badge/Read_the_Paper-blue?link=https%3A%2F%2Fopenreview.net%2Fattachment%3Fid%3DtpJPlFTyxd%26name%3Dpdf)](https://arxiv.org/abs/2412.21037) [![Static Badge](https://img.shields.io/badge/TangoFlux-Hugging_Face-violet?logo=huggingface&link=https%3A%2F%2Fhuggingface.co%2Fdeclare-lab%2FTangoFlux)](https://huggingface.co/declare-lab/TangoFlux) [![Static Badge](https://img.shields.io/badge/Demos-declare--lab-brightred?style=flat)](https://tangoflux.github.io/) [![Static Badge](https://img.shields.io/badge/TangoFlux-Hugging_Face_Space-8A2BE2?logo=huggingface&link=https%3A%2F%2Fhuggingface.co%2Fspaces%2Fdeclare-lab%2FTangoFlux)](https://huggingface.co/spaces/declare-lab/TangoFlux) [![Static Badge](https://img.shields.io/badge/TangoFlux_Dataset-Hugging_Face-red?logo=huggingface&link=https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fdeclare-lab%2FTangoFlux)](https://huggingface.co/datasets/declare-lab/CRPO) [![Replicate](https://replicate.com/chenxwh/tangoflux/badge)](https://replicate.com/chenxwh/tangoflux)
Soujanya Poria's avatar
Soujanya Poria committed
8
9
10

</div>

mrfakename's avatar
mrfakename committed
11
12
13
## Demos

[![Hugging Face Space](https://img.shields.io/badge/Hugging_Face_Space-TangoFlux-blue?logo=huggingface&link=https%3A%2F%2Fhuggingface.co%2Fspaces%2Fdeclare-lab%2FTangoFlux)](https://huggingface.co/spaces/declare-lab/TangoFlux)
Chia-Yu Hung's avatar
Chia-Yu Hung committed
14

mrfakename's avatar
mrfakename committed
15
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1j__4fl_BlaVS_225M34d-EKxsVDJPRiR?usp=sharing)
Chia-Yu Hung's avatar
Chia-Yu Hung committed
16

Soujanya Poria's avatar
Soujanya Poria committed
17
## Overall Pipeline
mrfakename's avatar
mrfakename committed
18

Navonil Majumder's avatar
Navonil Majumder committed
19
TangoFlux consists of FluxTransformer blocks, which are Diffusion Transformers (DiT) and Multimodal Diffusion Transformers (MMDiT) conditioned on a textual prompt and a duration embedding to generate a 44.1kHz audio up to 30 seconds long. TangoFlux learns a rectified flow trajectory to an audio latent representation encoded by a variational autoencoder (VAE). TangoFlux training pipeline consists of three stages: pre-training, fine-tuning, and preference optimization with CRPO. CRPO, particularly, iteratively generates new synthetic data and constructs preference pairs for preference optimization using DPO loss for flow matching.
Soujanya Poria's avatar
Soujanya Poria committed
20

Soujanya Poria's avatar
Soujanya Poria committed
21
![cover-photo](assets/tangoflux.png)
Soujanya Poria's avatar
Soujanya Poria committed
22

mrfakename's avatar
mrfakename committed
23
24
25
🚀 **TangoFlux can generate 44.1kHz stereo audio up to 30 seconds in ~3 seconds on a single A40 GPU.**

## Installation
Soujanya Poria's avatar
Soujanya Poria committed
26

mrfakename's avatar
mrfakename committed
27
```bash
mrfakename's avatar
mrfakename committed
28
pip install git+https://github.com/declare-lab/TangoFlux
mrfakename's avatar
mrfakename committed
29
```
Soujanya Poria's avatar
Soujanya Poria committed
30

mrfakename's avatar
mrfakename committed
31
32
33
34
35
36
37
## Inference

TangoFlux can generate audio up to 30 seconds long. You must pass a duration to the `model.generate` function when using the Python API. Please note that duration should be between 1 and 30.

### Web Interface

Run the following command to start the web interface:
Chia-Yu Hung's avatar
Chia-Yu Hung committed
38

Navonil Majumder's avatar
Navonil Majumder committed
39
```bash
mrfakename's avatar
mrfakename committed
40
tangoflux-demo
Chia-Yu Hung's avatar
Chia-Yu Hung committed
41
```
Chia-Yu Hung's avatar
Chia-Yu Hung committed
42

mrfakename's avatar
mrfakename committed
43
44
45
### CLI

Use the CLI to generate audio from text.
Chia-Yu Hung's avatar
Chia-Yu Hung committed
46
47

```bash
mrfakename's avatar
mrfakename committed
48
tangoflux "Hammer slowly hitting the wooden table" output.wav --duration 10 --steps 50
Chia-Yu Hung's avatar
Chia-Yu Hung committed
49
```
mrfakename's avatar
mrfakename committed
50
51
52

### Python API

Chia-Yu Hung's avatar
Chia-Yu Hung committed
53
54
```python
import torchaudio
hungchiayu1's avatar
updates  
hungchiayu1 committed
55
from tangoflux import TangoFluxInference
Chia-Yu Hung's avatar
Chia-Yu Hung committed
56

hungchiayu1's avatar
updates  
hungchiayu1 committed
57
58
model = TangoFluxInference(name='declare-lab/TangoFlux')
audio = model.generate('Hammer slowly hitting the wooden table', steps=50, duration=10)
Chia-Yu Hung's avatar
Chia-Yu Hung committed
59

mrfakename's avatar
mrfakename committed
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
torchaudio.save('output.wav', audio, 44100)
```

Our evaluation shows that inference with 50 steps yields the best results. A CFG scale of 3.5, 4, and 4.5 yield similar quality output. Inference with 25 steps yields similar audio quality at a faster speed.

## Training

We use the `accelerate` package from Hugging Face for multi-GPU training. Run `accelerate config` to setup your run configuration. The default accelerate config is in the `configs` folder. Please specify the path to your training files in the `configs/tangoflux_config.yaml`. Samples of `train.json` and `val.json` have been provided. Replace them with your own audio.

`tangoflux_config.yaml` defines the training file paths and model hyperparameters:

```bash
CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file='configs/accelerator_config.yaml' src/train.py   --checkpointing_steps="best" --save_every=5 --config='configs/tangoflux_config.yaml'
```

To perform DPO training, modify the training files such that each data point contains "chosen", "reject", "caption" and "duration" fields. Please specify the path to your training files in `configs/tangoflux_config.yaml`. An example has been provided in `train_dpo.json`. Replace it with your own audio.

```bash
CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file='configs/accelerator_config.yaml' src/train_dpo.py   --checkpointing_steps="best" --save_every=5 --config='configs/tangoflux_config.yaml'
hungchiayu1's avatar
updates  
hungchiayu1 committed
79
```
Soujanya Poria's avatar
Soujanya Poria committed
80
81
82

## Evaluation Scripts

Navonil Majumder's avatar
Navonil Majumder committed
83
## TangoFlux vs. Other Audio Generation Models
Soujanya Poria's avatar
Soujanya Poria committed
84

Navonil Majumder's avatar
Navonil Majumder committed
85
This key comparison metrics include:
Soujanya Poria's avatar
Soujanya Poria committed
86
87

- **Output Length**: Represents the duration of the generated audio.
Navonil Majumder's avatar
Navonil Majumder committed
88
- **FD**<sub>openl3</sub>: Fréchet Distance.
Soujanya Poria's avatar
Soujanya Poria committed
89
90
91
92
- **KL**<sub>passt</sub>: KL divergence.
- **CLAP**<sub>score</sub>: Alignment score.


Navonil Majumder's avatar
Navonil Majumder committed
93
All the inference times are observed on the same A40 GPU. The counts of trainable parameters are reported in the **\#Params** column.
Soujanya Poria's avatar
Soujanya Poria committed
94

mrfakename's avatar
mrfakename committed
95
96
97
98
99
100
101
| Model | Params | Duration | Steps | FD<sub>openl3</sub> ↓ | KL<sub>passt</sub> ↓ | CLAP<sub>score</sub> ↑ | IS ↑ | Inference Time (s) |
|---|---|---|---|---|---|---|---|---|
| **AudioLDM 2 (Large)** | 712M | 10 sec | 200 | 108.3 | 1.81 | 0.419 | 7.9 | 24.8 |
| **Stable Audio Open** | 1056M | 47 sec | 100 | 89.2 | 2.58 | 0.291 | 9.9 | 8.6 |
| **Tango 2** | 866M | 10 sec | 200 | 108.4 | 1.11 | 0.447 | 9.0 | 22.8 |
| **TangoFlux (Base)** | 515M | 30 sec | 50 | 80.2 | 1.22 | 0.431 | 11.7 | 3.7 |
| **TangoFlux** | 515M | 30 sec | 50 | 75.1 | 1.15 | 0.480 | 12.2 | 3.7 |
Soujanya Poria's avatar
Soujanya Poria committed
102
103
104

## Citation

Soujanya Poria's avatar
Soujanya Poria committed
105
106
107
108
109
110
111
112
113
```bibtex
@misc{hung2024tangofluxsuperfastfaithful,
      title={TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization}, 
      author={Chia-Yu Hung and Navonil Majumder and Zhifeng Kong and Ambuj Mehrish and Rafael Valle and Bryan Catanzaro and Soujanya Poria},
      year={2024},
      eprint={2412.21037},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2412.21037}, 
Soujanya Poria's avatar
Soujanya Poria committed
114
}
Soujanya Poria's avatar
Soujanya Poria committed
115
```
mrfakename's avatar
mrfakename committed
116
117
118

## License

mrfakename's avatar
mrfakename committed
119
TangoFlux is licensed under the MIT License. See the `LICENSE` file for more details.