update README

c2f3296f · Yoach Lacombe · 6732a076 · c2f3296f · c2f3296f
Commit c2f3296f authored Apr 08, 2024 by Yoach Lacombe
Hide whitespace changes
Inline Side-by-side

Showing with 76 additions and 81 deletions

README.md README.md +74 -81

parler_tts/dac_wrapper/configuration_dac.py parler_tts/dac_wrapper/configuration_dac.py +2 -0

No files found.
--- a/README.md
+++ b/README.md
+ATTENTION: don't forget to add group_by_length in configs.
+
 # Parler-TTS

-Work in-progress reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
-by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
-
-Reproducing the TTS model requires the following 5 steps to be completed in order:
-1. Train the Accent Classifier
-2. Annotate the Training Set
-3. Aggregate Statistics
-4. Create Descriptions
-5. Train the Model
-
-## Step 1: Train the Accent Classifier
-
-The script [`run_audio_classification.py`](run_audio_classification.py) can be used to train an audio encoder model from 
-the [Transformers library](https://github.com/huggingface/transformers) (e.g. Wav2Vec2, MMS, or Whisper) for the accent
-classification task.
-
-Starting with a pre-trained audio encoder model, a simple linear classifier is appended to the last hidden-layer to map the 
-audio embeddings to class label predictions. The audio encoder can either be frozen (`--freeze_base_model`) or trained. 
-The linear classifier is randomly initialised, and is thus always trained.
-
-The script can be used to train on a single accent dataset, or a combination of datasets, which should be specified by
-separating dataset names, configs and splits by the `+` character in the launch command (see below for an example).
-
-In the proceeding example, we follow Stability's approach by taking audio embeddings from a frozen [MMS-LID](https://huggingface.co/facebook/mms-lid-126) 
-model, and training the linear classifier on a combination of three open-source datasets:
-1. The English Accented (`en_accented`) subset of [Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli)
-2. The train split of [VCTK](https://huggingface.co/datasets/vctk) 
-3. The dev split of [EdAcc](https://huggingface.co/datasets/sanchit-gandhi/edacc)
-
-The model is subsequently evaluated on the test split of [EdAcc](https://huggingface.co/datasets/sanchit-gandhi/edacc)
-to give the final classification accuracy.
-
-```bash
-#!/usr/bin/env bash
-
-python run_audio_classification.py \
-    --model_name_or_path "facebook/mms-lid-126" \
-    --train_dataset_name "vctk+facebook/voxpopuli+sanchit-gandhi/edacc" \
-    --train_dataset_config_name "main+en_accented+default" \
-    --train_split_name "train+test+validation" \
-    --train_label_column_name "accent+accent+accent" \
-    --eval_dataset_name "sanchit-gandhi/edacc" \
-    --eval_dataset_config_name "default" \
-    --eval_split_name "test" \
-    --eval_label_column_name "accent" \
-    --output_dir "./" \
-    --do_train \
-    --do_eval \
-    --overwrite_output_dir \
-    --remove_unused_columns False \
-    --fp16 \
-    --learning_rate 1e-4 \
-    --max_length_seconds 20 \
-    --attention_mask False \
-    --warmup_ratio 0.1 \
-    --num_train_epochs 5 \
-    --per_device_train_batch_size 32 \
-    --per_device_eval_batch_size 32 \
-    --preprocessing_num_workers 16 \
-    --dataloader_num_workers 4 \
-    --logging_strategy "steps" \
-    --logging_steps 10 \
-    --evaluation_strategy "epoch" \
-    --save_strategy "epoch" \
-    --load_best_model_at_end True \
-    --metric_for_best_model "accuracy" \
-    --save_total_limit 3 \
-    --freeze_base_model \
-    --push_to_hub \
-    --trust_remote_code
+[[Paper we reproduce]](https://arxiv.org/abs/2402.01912)
+[[Models]](https://huggingface.co/parler-tts)
+[[Training Code]](training)
+[[Interactive Demo]](TODO - linked to spaces)
+
+> [!IMPORTANT]
+> We're proud to release Parler-TTS v0.1, our first 300M-parameters Parler-TTS model, trained on 10.5K hours of audio data.
+
+Parler-TTS is a reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
+by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively. 
+
+Contrarily to standard TTS models, Parler-TTS allows you to directly describe the speaker characteristics with a simple text description where you can modulate gender, pitch, speaking style, accent, etc.
+
+## Inference
+
+> [!TIP]
+> You can directly try it out in an interactive demo [here](TODO: add link to spaces)!
+
+Using Parler-TTS is as simple as "bonjour". Simply use the following inference snippet.
+
+```py
+from parler_tts import ParlerTTSForConditionalGeneration
+from transformers import AutoTokenizer, AutoFeatureExtractor
+import soundfile as sf
+
+# TODO: change repo id
+
+model = ParlerTTSForConditionalGeneration.from_pretrained("ylacombe/parler_tts_300M_v0.09")
+tokenizer = AutoTokenizer.from_pretrained("ylacombe/parler_tts_300M_v0.09")
+
+prompt = "Hey, how are you doing today?"
+description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."
+
+input_ids = tokenizer(description, return_tensors="pt").input_ids
+prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids
+
+generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
+audio_arr = generation.cpu().numpy().squeeze()
+sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
+```
+
+
+## Installation steps
+
+Parler-TTS has light-weight dependencies and can be installed in one line:
+```sh
+pip install parler-tts
 ```

-Tips:
-1. **Number of labels:** normalisation should be applied to the target class labels to group linguistically similar accents together (e.g. "Southern Irish" and "Irish" should both be "Irish"). This helps _balance_ the dataset by removing labels with very few examples. You can modify the function `preprocess_labels` to implement any custom normalisation strategy.
+## Gradio demo

-## Step 2: Annotate the Training Set
+You can host your own Parler-TTS demo. First, install [`gradio`](https://www.gradio.app/) with:

-Annotate the training dataset with information on: SNR, C50, pitch and speaking rate. 
+```sh
+pip install gradio
+```

-## Step 3: Aggregate Statistics
+Then, run:

-Aggregate statistics from Step 2. Convert continuous values to discrete labels.
+```python
+python helpers/gradio_demo/app.py
+```

-## Step 4: Create Descriptions

-Convert sequence of discrete labels to text description (using an LLM). 
+## Acknowledgements

-## Step 5: Train the Model
+This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!

-Train MusicGen-style model on the TTS task.
-Needs DAC.
+Special thanks to:
+- Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively, for publishing such a promising and clear research paper: [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://arxiv.org/abs/2402.01912).
+- and the many libraries used, namely [datasets](https://huggingface.co/docs/datasets/v2.17.0/en/index), [accelerate](https://huggingface.co/docs/accelerate/en/index), [jiwer](https://github.com/jitsi/jiwer), [wandb](https://wandb.ai/), and [transformers](https://huggingface.co/docs/transformers/index).

+## Citation
+```
+@misc{lacombe-etal-2024-parler-tts,
+  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
+  title = {Parler-TTS},
+  year = {2024},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/ylacombe/dataspeech}}
+}
+```
--- a/parler_tts/dac_wrapper/configuration_dac.py
+++ b/parler_tts/dac_wrapper/configuration_dac.py
@@ -12,6 +12,7 @@ class DACConfig(PretrainedConfig):
        codebook_size: int = 1024,
        latent_dim: int = 1024,
        frame_rate: int = 86,
+        sampling_rate: int = 44100,
        **kwargs,
    ):
        self.codebook_size = codebook_size
@@ -19,5 +20,6 @@ class DACConfig(PretrainedConfig):
        self.latent_dim = latent_dim
        self.num_codebooks = num_codebooks
        self.frame_rate = frame_rate
+        self.sampling_rate = sampling_rate

        super().__init__(**kwargs)