Apply suggestions from code review

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

Apply suggestions from code review
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
c40c6de2 · Yoach Lacombe · GitHub · ac9c881d · c40c6de2 · c40c6de2
Unverified Commit c40c6de2 authored Apr 09, 2024 by Yoach Lacombe Committed by GitHub Apr 09, 2024
Hide whitespace changes
Inline Side-by-side

Showing with 20 additions and 15 deletions

README.md README.md +10 -5

parler_tts/modeling_parler_tts.py parler_tts/modeling_parler_tts.py +2 -2

training/TRAINING.md training/TRAINING.md +8 -8

No files found.
--- a/README.md
+++ b/README.md
@@ -6,12 +6,15 @@
 [[Interactive Demo]](https://huggingface.co/spaces/parler-tts/parler_tts_mini)

 > [!IMPORTANT]
-> We're proud to release Parler-TTS v0.1, our first 300M-parameters Parler-TTS model, trained on 10.5K hours of audio data.
+> We're proud to release Parler-TTS v0.1, our first 300M parameter model, trained on 10.5K hours of audio data.
+> In the coming weeks, we'll be working on scaling up to 50k hours of data, in preparation for the v1 model.

-Parler-TTS is a reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
+Parler-TTS is a lightweight text-to-speech (TTS) model that can generate high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc). It is a reproduction of work from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
 by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively. 

-Contrarily to standard TTS models, Parler-TTS allows you to directly describe the speaker characteristics with a simple text description where you can modulate gender, pitch, speaking style, accent, etc.
+Contrarily to other TTS models, Parler-TTS is a **fully open-source** release. All of the datasets, pre-processing, training code and weights are released publicly under permissive license, enabling the community to build on our work and develop their own powerful TTS models.
+
+This repository contains the inference and training code for Parler-TTS. It is designed to accompany the [Data-Speech](https://github.com/ylacombe/dataspeech) repository for dataset annotation.

 ## Usage

@@ -22,7 +25,7 @@ Using Parler-TTS is as simple as "bonjour". Simply use the following inference s

 ```py
 from parler_tts import ParlerTTSForConditionalGeneration
-from transformers import AutoTokenizer, AutoFeatureExtractor
+from transformers import AutoTokenizer
 import soundfile as sf

 model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_300M_v0.1")
@@ -68,7 +71,8 @@ This library builds on top of a number of open-source giants, to whom we'd like
 Special thanks to:
 - Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively, for publishing such a promising and clear research paper: [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://arxiv.org/abs/2402.01912).
 - the many libraries used, namely [🤗 datasets](https://huggingface.co/docs/datasets/v2.17.0/en/index), [🤗 accelerate](https://huggingface.co/docs/accelerate/en/index), [jiwer](https://github.com/jitsi/jiwer), [wandb](https://wandb.ai/), and [🤗 transformers](https://huggingface.co/docs/transformers/index).
- HuggingFace 🤗 for providing compute resources and time to explore!
+- Descript for the [DAC codec model](https://github.com/descriptinc/descript-audio-codec)
+- Hugging Face 🤗 for providing compute resources and time to explore!

 ## Contribution

@@ -92,6 +96,7 @@ Namely, we're looking at ways to improve both quality and speed:
    - Add more evaluation metrics

 ## Citation
+If you found this repository useful, please consider citing this work and also the original Stability AI paper:
 ```
 @misc{lacombe-etal-2024-parler-tts,
  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},

--- a/parler_tts/modeling_parler_tts.py
+++ b/parler_tts/modeling_parler_tts.py
@@ -1704,7 +1704,7 @@ class ParlerTTSForConditionalGeneration(PreTrainedModel):
        Example:

        ```python
-        >>> from transformers import ParlerTTSForConditionalGeneration
+        >>> from parler_tts import ParlerTTSForConditionalGeneration

        >>> model = ParlerTTSForConditionalGeneration.from_pretrained("facebook/parler_tts-small")
        ```"""
@@ -1783,7 +1783,7 @@ class ParlerTTSForConditionalGeneration(PreTrainedModel):
        Example:

        ```python
-        >>> from transformers import ParlerTTSForConditionalGeneration
+        >>> from parler_tts import ParlerTTSForConditionalGeneration

        >>> # initialize a parler_tts model from a t5 text encoder, encodec audio encoder, and parler_tts decoder
        >>> model = ParlerTTSForConditionalGeneration.from_sub_models_pretrained(

--- a/training/TRAINING.md
+++ b/training/TRAINING.md
 # Training Parler-TTS

-This sub-folder contains all the information to train or finetune you own Parler-TTS model. It consists in:
- [A. An introduction to Parler-TTS architecture](#a-architecture)
+This sub-folder contains all the information to train or fine-tune your own Parler-TTS model. It consists of:
+- [A. An introduction to the Parler-TTS architecture](#a-architecture)
 - [B. First steps to get started](#b-getting-started)
 - [C. Training guide](#c-training)
 - [E. Scaling up to 10.5K hours](#d-scaling-up---discussions-and-tips)
@@ -9,10 +9,10 @@ This sub-folder contains all the information to train or finetune you own Parler

 ## A. Architecture

-At the moment, Parler-TTS architecture is a carbon copy of [Musicgen architecture](https://huggingface.co/docs/transformers/v4.39.3/en/model_doc/musicgen#model-structure) and can be decomposed into three distinct stages:
->1. Text encoder: maps the text inputs to a sequence of hidden-state representations. The pre-trained MusicGen models use a frozen text encoder from either T5 or Flan-T5
+At the moment, Parler-TTS architecture is a carbon copy of the [MusicGen architecture](https://huggingface.co/docs/transformers/v4.39.3/en/model_doc/musicgen#model-structure) and can be decomposed into three distinct stages:
+>1. Text encoder: maps the text descriptions to a sequence of hidden-state representations. Parler-TTS uses a frozen text encoder initialised entirely from Flan-T5
 >2. Parler-TTS decoder: a language model (LM) that auto-regressively generates audio tokens (or codes) conditional on the encoder hidden-state representations
->3. Audio encoder: used to recover the audio waveform from the audio tokens predicted by the decoder
+>3. Audio codec: used to recover the audio waveform from the audio tokens predicted by the decoder. We use the [DAC model](https://github.com/descriptinc/descript-audio-codec) from Descript, although other codec models, such as [EnCodec](https://huggingface.co/facebook/encodec_48khz), can also be used

 Parler-TTS however introduces some small tweaks:
 - The text **description** is passed through the text encoder and used in the cross-attention layers of the decoder.
@@ -38,13 +38,13 @@ git clone https://github.com/huggingface/parler-tts.git
 cd parler-tts
 ```

-... And then to install requirements.
+... And then install the requirements:

 ```bash
 pip install -e .[train]
 ```

-Optionnally, you can create a wandb account and login to it by following [this guide](https://docs.wandb.ai/quickstart). [`wandb`](https://docs.wandb.ai/) allows for better tracking of the experiments metrics and losses.
+Optionally, you can create a wandb account and login to it by following [this guide](https://docs.wandb.ai/quickstart). [`wandb`](https://docs.wandb.ai/) allows for better tracking of the experiments metrics and losses.

 You also have the option to configure Accelerate by running the following command. Note that you should set the number of GPUs you wish to use for training, and also the data type (dtype) to your preferred dtype for training/inference (e.g. `bfloat16` on A100 GPUs, `float16` on V100 GPUs, etc.):

@@ -164,7 +164,7 @@ accelerate launch ./training/run_parler_tts_training.py \
 > For example: `--model_name_or_path parler-tts/parler_tts_300M_v0.1`.


-Additionnally, you can also write a JSON config file. Here, [librispeech_tts_r_300M_dummy.json](/helpers/training_configs/librispeech_tts_r_300M_dummy.json) contains the exact same hyper-parameters than above and can be launched like that:
+Additionally, you can also write a JSON config file. Here, [librispeech_tts_r_300M_dummy.json](/helpers/training_configs/librispeech_tts_r_300M_dummy.json) contains the exact same hyper-parameters than above and can be launched like that:
 ```sh
 accelerate launch ./training/run_parler_tts_training.py ./helpers/training_configs/librispeech_tts_r_300M_dummy.json
 ```