README.md 3.87 KB
Newer Older
Yoach Lacombe's avatar
Yoach Lacombe committed
1
2
ATTENTION: don't forget to add group_by_length in configs.

Yoach Lacombe's avatar
Yoach Lacombe committed
3
# Parler-TTS
sanchit-gandhi's avatar
setup  
sanchit-gandhi committed
4

Yoach Lacombe's avatar
Yoach Lacombe committed
5
6
7
8
9
10
11
12
13
14
15
16
17
[[Paper we reproduce]](https://arxiv.org/abs/2402.01912)
[[Models]](https://huggingface.co/parler-tts)
[[Training Code]](training)
[[Interactive Demo]](TODO - linked to spaces)

> [!IMPORTANT]
> We're proud to release Parler-TTS v0.1, our first 300M-parameters Parler-TTS model, trained on 10.5K hours of audio data.

Parler-TTS is a reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively. 

Contrarily to standard TTS models, Parler-TTS allows you to directly describe the speaker characteristics with a simple text description where you can modulate gender, pitch, speaking style, accent, etc.

Yoach Lacombe's avatar
Yoach Lacombe committed
18
## Usage
Yoach Lacombe's avatar
Yoach Lacombe committed
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

> [!TIP]
> You can directly try it out in an interactive demo [here](TODO: add link to spaces)!

Using Parler-TTS is as simple as "bonjour". Simply use the following inference snippet.

```py
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer, AutoFeatureExtractor
import soundfile as sf

# TODO: change repo id

model = ParlerTTSForConditionalGeneration.from_pretrained("ylacombe/parler_tts_300M_v0.09")
tokenizer = AutoTokenizer.from_pretrained("ylacombe/parler_tts_300M_v0.09")

prompt = "Hey, how are you doing today?"
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."

input_ids = tokenizer(description, return_tensors="pt").input_ids
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
```


## Installation steps

Parler-TTS has light-weight dependencies and can be installed in one line:
```sh
pip install parler-tts
sanchit-gandhi's avatar
sanchit-gandhi committed
52
53
```

Yoach Lacombe's avatar
Yoach Lacombe committed
54
## Gradio demo
sanchit-gandhi's avatar
sanchit-gandhi committed
55

Yoach Lacombe's avatar
Yoach Lacombe committed
56
You can host your own Parler-TTS demo. First, install [`gradio`](https://www.gradio.app/) with:
sanchit-gandhi's avatar
sanchit-gandhi committed
57

Yoach Lacombe's avatar
Yoach Lacombe committed
58
59
60
```sh
pip install gradio
```
sanchit-gandhi's avatar
sanchit-gandhi committed
61

Yoach Lacombe's avatar
Yoach Lacombe committed
62
Then, run:
sanchit-gandhi's avatar
sanchit-gandhi committed
63

Yoach Lacombe's avatar
Yoach Lacombe committed
64
65
66
```python
python helpers/gradio_demo/app.py
```
sanchit-gandhi's avatar
sanchit-gandhi committed
67
68


Yoach Lacombe's avatar
Yoach Lacombe committed
69
## Acknowledgements
sanchit-gandhi's avatar
sanchit-gandhi committed
70

Yoach Lacombe's avatar
Yoach Lacombe committed
71
This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!
sanchit-gandhi's avatar
sanchit-gandhi committed
72

Yoach Lacombe's avatar
Yoach Lacombe committed
73
74
75
Special thanks to:
- Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively, for publishing such a promising and clear research paper: [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://arxiv.org/abs/2402.01912).
- and the many libraries used, namely [datasets](https://huggingface.co/docs/datasets/v2.17.0/en/index), [accelerate](https://huggingface.co/docs/accelerate/en/index), [jiwer](https://github.com/jitsi/jiwer), [wandb](https://wandb.ai/), and [transformers](https://huggingface.co/docs/transformers/index).
sanchit-gandhi's avatar
setup  
sanchit-gandhi committed
76

Yoach Lacombe's avatar
Yoach Lacombe committed
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
## Contribution

Contributions are welcome, as the project offers many possibilities for improvement and exploration.

Namely, we're looking at ways to improve both quality and speed:
- Datasets:
    - Train on more data
    - Add more features such as accents
- Training:
    - Add PEFT compatibility to do Lora fine-tuning.
    - Add possibility to train without description column.
    - Explore multilingual training.
    - Explore mono-speaker finetuning.
    - Explore more architectures.
- Optimization:
    - Compilation and static cache
    - Support to FA2 and SDPA

Yoach Lacombe's avatar
Yoach Lacombe committed
95
96
97
98
99
100
101
102
103
104
105
## Citation
```
@misc{lacombe-etal-2024-parler-tts,
  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
  title = {Parler-TTS},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ylacombe/dataspeech}}
}
```