README.md 3.85 KB
Newer Older
Yoach Lacombe's avatar
Yoach Lacombe committed
1
# Parler-TTS
sanchit-gandhi's avatar
setup  
sanchit-gandhi committed
2

Yoach Lacombe's avatar
Yoach Lacombe committed
3
4
5
6
7
8
9
10
11
12
13
14
15
[[Paper we reproduce]](https://arxiv.org/abs/2402.01912)
[[Models]](https://huggingface.co/parler-tts)
[[Training Code]](training)
[[Interactive Demo]](TODO - linked to spaces)

> [!IMPORTANT]
> We're proud to release Parler-TTS v0.1, our first 300M-parameters Parler-TTS model, trained on 10.5K hours of audio data.

Parler-TTS is a reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively. 

Contrarily to standard TTS models, Parler-TTS allows you to directly describe the speaker characteristics with a simple text description where you can modulate gender, pitch, speaking style, accent, etc.

Yoach Lacombe's avatar
Yoach Lacombe committed
16
## Usage
Yoach Lacombe's avatar
Yoach Lacombe committed
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

> [!TIP]
> You can directly try it out in an interactive demo [here](TODO: add link to spaces)!

Using Parler-TTS is as simple as "bonjour". Simply use the following inference snippet.

```py
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer, AutoFeatureExtractor
import soundfile as sf

# TODO: change repo id

model = ParlerTTSForConditionalGeneration.from_pretrained("ylacombe/parler_tts_300M_v0.09")
tokenizer = AutoTokenizer.from_pretrained("ylacombe/parler_tts_300M_v0.09")

prompt = "Hey, how are you doing today?"
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."

input_ids = tokenizer(description, return_tensors="pt").input_ids
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
```


## Installation steps

Parler-TTS has light-weight dependencies and can be installed in one line:
```sh
pip install parler-tts
sanchit-gandhi's avatar
sanchit-gandhi committed
50
51
```

Yoach Lacombe's avatar
Yoach Lacombe committed
52
## Gradio demo
sanchit-gandhi's avatar
sanchit-gandhi committed
53

Yoach Lacombe's avatar
Yoach Lacombe committed
54
You can host your own Parler-TTS demo. First, install [`gradio`](https://www.gradio.app/) with:
sanchit-gandhi's avatar
sanchit-gandhi committed
55

Yoach Lacombe's avatar
Yoach Lacombe committed
56
57
58
```sh
pip install gradio
```
sanchit-gandhi's avatar
sanchit-gandhi committed
59

Yoach Lacombe's avatar
Yoach Lacombe committed
60
Then, run:
sanchit-gandhi's avatar
sanchit-gandhi committed
61

Yoach Lacombe's avatar
Yoach Lacombe committed
62
63
64
```python
python helpers/gradio_demo/app.py
```
sanchit-gandhi's avatar
sanchit-gandhi committed
65

Yoach Lacombe's avatar
Yoach Lacombe committed
66
## Acknowledgements
sanchit-gandhi's avatar
sanchit-gandhi committed
67

Yoach Lacombe's avatar
Yoach Lacombe committed
68
This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!
sanchit-gandhi's avatar
sanchit-gandhi committed
69

Yoach Lacombe's avatar
Yoach Lacombe committed
70
71
72
Special thanks to:
- Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively, for publishing such a promising and clear research paper: [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://arxiv.org/abs/2402.01912).
- and the many libraries used, namely [datasets](https://huggingface.co/docs/datasets/v2.17.0/en/index), [accelerate](https://huggingface.co/docs/accelerate/en/index), [jiwer](https://github.com/jitsi/jiwer), [wandb](https://wandb.ai/), and [transformers](https://huggingface.co/docs/transformers/index).
sanchit-gandhi's avatar
setup  
sanchit-gandhi committed
73

Yoach Lacombe's avatar
Yoach Lacombe committed
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
## Contribution

Contributions are welcome, as the project offers many possibilities for improvement and exploration.

Namely, we're looking at ways to improve both quality and speed:
- Datasets:
    - Train on more data
    - Add more features such as accents
- Training:
    - Add PEFT compatibility to do Lora fine-tuning.
    - Add possibility to train without description column.
    - Explore multilingual training.
    - Explore mono-speaker finetuning.
    - Explore more architectures.
- Optimization:
    - Compilation and static cache
    - Support to FA2 and SDPA
Yoach Lacombe's avatar
Yoach Lacombe committed
91
92
- Evaluation:
    - Add more evaluation metrics
Yoach Lacombe's avatar
Yoach Lacombe committed
93

Yoach Lacombe's avatar
Yoach Lacombe committed
94
95
96
97
98
99
100
101
102
103
104
## Citation
```
@misc{lacombe-etal-2024-parler-tts,
  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
  title = {Parler-TTS},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ylacombe/dataspeech}}
}
```