README.md 3.89 KB
Newer Older
Yoach Lacombe's avatar
Yoach Lacombe committed
1
# Parler-TTS
sanchit-gandhi's avatar
setup  
sanchit-gandhi committed
2

Yoach Lacombe's avatar
Yoach Lacombe committed
3
4
5
[[Paper we reproduce]](https://arxiv.org/abs/2402.01912)
[[Models]](https://huggingface.co/parler-tts)
[[Training Code]](training)
Yoach Lacombe's avatar
Yoach Lacombe committed
6
[[Interactive Demo]](https://huggingface.co/spaces/parler-tts/parler_tts_mini)
Yoach Lacombe's avatar
Yoach Lacombe committed
7
8
9
10
11
12
13
14
15

> [!IMPORTANT]
> We're proud to release Parler-TTS v0.1, our first 300M-parameters Parler-TTS model, trained on 10.5K hours of audio data.

Parler-TTS is a reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively. 

Contrarily to standard TTS models, Parler-TTS allows you to directly describe the speaker characteristics with a simple text description where you can modulate gender, pitch, speaking style, accent, etc.

Yoach Lacombe's avatar
Yoach Lacombe committed
16
## Usage
Yoach Lacombe's avatar
Yoach Lacombe committed
17
18

> [!TIP]
Yoach Lacombe's avatar
Yoach Lacombe committed
19
> You can directly try it out in an interactive demo [here](https://huggingface.co/spaces/parler-tts/parler_tts_mini)!
Yoach Lacombe's avatar
Yoach Lacombe committed
20
21
22
23
24
25
26
27

Using Parler-TTS is as simple as "bonjour". Simply use the following inference snippet.

```py
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer, AutoFeatureExtractor
import soundfile as sf

Yoach Lacombe's avatar
Yoach Lacombe committed
28
29
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_300M_v0.1")
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_300M_v0.1")
Yoach Lacombe's avatar
Yoach Lacombe committed
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

prompt = "Hey, how are you doing today?"
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."

input_ids = tokenizer(description, return_tensors="pt").input_ids
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
```


## Installation steps

Parler-TTS has light-weight dependencies and can be installed in one line:
```sh
pip install parler-tts
sanchit-gandhi's avatar
sanchit-gandhi committed
48
49
```

Yoach Lacombe's avatar
Yoach Lacombe committed
50
## Gradio demo
sanchit-gandhi's avatar
sanchit-gandhi committed
51

Yoach Lacombe's avatar
Yoach Lacombe committed
52
You can host your own Parler-TTS demo. First, install [`gradio`](https://www.gradio.app/) with:
sanchit-gandhi's avatar
sanchit-gandhi committed
53

Yoach Lacombe's avatar
Yoach Lacombe committed
54
55
56
```sh
pip install gradio
```
sanchit-gandhi's avatar
sanchit-gandhi committed
57

Yoach Lacombe's avatar
Yoach Lacombe committed
58
Then, run:
sanchit-gandhi's avatar
sanchit-gandhi committed
59

Yoach Lacombe's avatar
Yoach Lacombe committed
60
61
62
```python
python helpers/gradio_demo/app.py
```
sanchit-gandhi's avatar
sanchit-gandhi committed
63

Yoach Lacombe's avatar
Yoach Lacombe committed
64
## Acknowledgements
sanchit-gandhi's avatar
sanchit-gandhi committed
65

Yoach Lacombe's avatar
Yoach Lacombe committed
66
This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!
sanchit-gandhi's avatar
sanchit-gandhi committed
67

Yoach Lacombe's avatar
Yoach Lacombe committed
68
69
70
Special thanks to:
- Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively, for publishing such a promising and clear research paper: [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://arxiv.org/abs/2402.01912).
- and the many libraries used, namely [datasets](https://huggingface.co/docs/datasets/v2.17.0/en/index), [accelerate](https://huggingface.co/docs/accelerate/en/index), [jiwer](https://github.com/jitsi/jiwer), [wandb](https://wandb.ai/), and [transformers](https://huggingface.co/docs/transformers/index).
sanchit-gandhi's avatar
setup  
sanchit-gandhi committed
71

Yoach Lacombe's avatar
Yoach Lacombe committed
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
## Contribution

Contributions are welcome, as the project offers many possibilities for improvement and exploration.

Namely, we're looking at ways to improve both quality and speed:
- Datasets:
    - Train on more data
    - Add more features such as accents
- Training:
    - Add PEFT compatibility to do Lora fine-tuning.
    - Add possibility to train without description column.
    - Explore multilingual training.
    - Explore mono-speaker finetuning.
    - Explore more architectures.
- Optimization:
    - Compilation and static cache
    - Support to FA2 and SDPA
Yoach Lacombe's avatar
Yoach Lacombe committed
89
90
- Evaluation:
    - Add more evaluation metrics
Yoach Lacombe's avatar
Yoach Lacombe committed
91

Yoach Lacombe's avatar
Yoach Lacombe committed
92
93
94
95
96
97
98
99
100
101
102
## Citation
```
@misc{lacombe-etal-2024-parler-tts,
  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
  title = {Parler-TTS},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ylacombe/dataspeech}}
}
```