v1.0

72f5785f · huaerkl · 72f5785f · 72f5785f · 72f5785f · 72f5785f
Commit 72f5785f authored Aug 15, 2023 by huaerkl
20 changed files
--- a/examples/emotion_conversion/README.md
+++ b/examples/emotion_conversion/README.md
+# Textless speech emotion conversion using decomposed and discrete representations
+[Felix Kreuk](https://felixkreuk.github.io), Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, [Yossi Adi](https://adiyoss.github.io)
+
+_abstract_: Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion. First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is superior to the baselines in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples and code will be publicly available under the following link: https://speechbot.github.io/emotion.
+
+## Installation
+First, create a conda virtual environment and activate it:
+```
+conda create -n emotion python=3.8 -y
+conda activate emotion
+```
+
+Then, clone this repository:
+```
+git clone https://github.com/facebookresearch/fairseq.git
+cd fairseq/examples/emotion_conversion
+git clone https://github.com/felixkreuk/speech-resynthesis
+```
+
+Next, download the EmoV discrete tokens:
+```
+wget https://dl.fbaipublicfiles.com/textless_nlp/emotion_conversion/data.tar.gz  # (still in fairseq/examples/emotion_conversion)
+tar -xzvf data.tar.gz
+```
+
+Your `fairseq/examples/emotion_conversion` directory should like this:
+```
+drwxrwxr-x 3 felixkreuk felixkreuk   0 Feb  6  2022 data
+drwxrwxr-x 3 felixkreuk felixkreuk   0 Sep 28 10:41 emotion_models
+drwxr-xr-x 3 felixkreuk felixkreuk   0 Jun 29 05:43 fairseq_models
+drwxr-xr-x 3 felixkreuk felixkreuk   0 Sep 28 10:41 preprocess
+-rw-rw-r-- 1 felixkreuk felixkreuk 11K Dec  5 09:00 README.md
+-rw-rw-r-- 1 felixkreuk felixkreuk  88 Mar  6  2022 requirements.txt
+-rw-rw-r-- 1 felixkreuk felixkreuk 13K Jun 29 06:26 synthesize.py
+```
+
+Lastly, install fairseq and the other packages:
+```
+pip install --editable ./
+pip install -r examples/emotion_conversion/requirements.txt
+```
+
+## Data preprocessing
+
+### Convert your audio to discrete representations
+Please follow the steps described [here](https://github.com/pytorch/fairseq/tree/main/examples/hubert/simple_kmeans).
+To generate the same discrete representations please use the following:
+1. [HuBERT checkpoint](https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt)
+2. k-means model at `data/hubert_base_ls960_layer9_clusters200/data_hubert_base_ls960_layer9_clusters200.bin`
+
+### Construct data splits
+This step will use the discrete representations from the previous step and split them to train/valid/test sets for 3 tasks:
+1. Translation model pre-training (BART language denoising)
+2. Translation model training (content units emotion translation mechanism)
+3. HiFiGAN model training (for synthesizing audio from discrete representations)
+
+Your processed data should be at `data/`:
+1. `hubert_base_ls960_layer9_clusters200` - discrete representations extracted using HuBERT layer 9, clustered into 200 clusters.
+2. `data.tsv` - a tsv file pointing to the EmoV dataset in your environment (Please edit the first line of this file according to your path).
+
+The following command will create the above splits:
+```
+python examples/emotion_conversion/preprocess/create_core_manifest.py \
+    --tsv data/data.tsv \
+    --emov-km data/hubert_base_ls960_layer9_clusters200/data.km \
+    --km data/hubert_base_ls960_layer9_clusters200/vctk.km \
+    --dict data/hubert_base_ls960_layer9_clusters200/dict.txt \
+    --manifests-dir $DATA
+```
+* Set `$DATA` as the directory that will contain the processed data.
+
+### Extract F0
+To train the HiFiGAN vocoder we need to first extract the F0 curves:
+```
+python examples/emotion_conversion/preprocess/extract_f0.py \
+    --tsv data/data.tsv \
+    --extractor pyaapt \
+```
+
+## HiFiGAN training
+Now we are all set to train the HiFiGAN vocoder:
+```
+python examples/emotion_conversion/speech-resynthesis/train.py 
+    --checkpoint_path <hifigan-checkpoint-dir> \
+    --config examples/emotion_conversion/speech-resynthesis/configs/EmoV/emov_hubert-layer9-cluster200_fixed-spkr-embedder_f0-raw_gst.json
+```
+
+## Translation Pre-training
+Before translating emotions, we first need to pre-train the translation model as a denoising autoencoder (similarly to BART).
+```
+python train.py \
+    $DATA/fairseq-data/emov_multilingual_denoising_cross-speaker_dedup_nonzeroshot/tokenized \
+    --save-dir <your-save-dir> \
+    --tensorboard-logdir <your-tb-dir> \
+    --langs neutral,amused,angry,sleepy,disgusted,vctk.km \
+    --dataset-impl mmap \
+    --task multilingual_denoising \
+    --arch transformer_small --criterion cross_entropy \
+    --multilang-sampling-alpha 1.0 --sample-break-mode eos --max-tokens 16384 \
+    --update-freq 1 --max-update 3000000 \
+    --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.0 \
+    --optimizer adam --weight-decay 0.01 --adam-eps 1e-06 \
+    --clip-norm 0.1 --lr-scheduler polynomial_decay --lr 0.0003 \
+    --total-num-update 3000000 --warmup-updates 10000 --fp16 \
+    --poisson-lambda 3.5 --mask 0.3 --mask-length span-poisson --replace-length 1 --rotate 0 --mask-random 0.1 --insert 0 --permute-sentences 1.0 \
+    --skip-invalid-size-inputs-valid-test \
+    --user-dir examples/emotion_conversion/fairseq_models
+```
+
+## Translation Training
+Now we are ready to train our emotion translation model:
+```
+python train.py \
+    --distributed-world-size 1 \
+    $DATA/fairseq-data/emov_multilingual_translation_cross-speaker_dedup/tokenized/ \
+    --save-dir <your-save-dir> \
+    --tensorboard-logdir <your-tb-dir> \
+    --arch multilingual_small --task multilingual_translation \
+    --criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
+    --lang-pairs neutral-amused,neutral-sleepy,neutral-disgusted,neutral-angry,amused-sleepy,amused-disgusted,amused-neutral,amused-angry,angry-amused,angry-sleepy,angry-disgusted,angry-neutral,disgusted-amused,disgusted-sleepy,disgusted-neutral,disgusted-angry,sleepy-amused,sleepy-neutral,sleepy-disgusted,sleepy-angry \
+    --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
+    --lr 1e-05 --clip-norm 0 --dropout 0.1 --attention-dropout 0.1 \
+    --weight-decay 0.01 --warmup-updates 2000 --lr-scheduler inverse_sqrt \
+    --max-tokens 4096 --update-freq 1 --max-update 100000 \
+    --required-batch-size-multiple 8 --fp16 --num-workers 4 \
+    --seed 2 --log-format json --log-interval 25 --save-interval-updates 1000 \
+    --no-epoch-checkpoints --keep-best-checkpoints 1 --keep-interval-updates 1 \
+    --finetune-from-model <path-to-model-from-previous-step> \
+    --user-dir examples/emotion_conversion/fairseq_models
+```
+* To share encoders/decoders use the `--share-encoders` and `--share-decoders` flags.
+* To add source/target emotion tokens use the `--encoder-langtok {'src'|'tgt'}` and `--decoder-langtok` flags.
+
+## F0-predictor Training
+The following command trains the F0 prediction module:
+```
+cd examples/emotion_conversion
+python -m emotion_models.pitch_predictor n_tokens=200 \
+    train_tsv="$DATA/denoising/emov/train.tsv" \
+    train_km="$DATA/denoising/emov/train.km" \
+    valid_tsv="$DATA/denoising/emov/valid.tsv" \
+    valid_km="$DATA/denoising/emov/valid.km"
+```
+* See `hyra.run.dir` to configure directory for saving models.
+
+## Duration-predictor Training
+The following command trains the duration prediction modules:
+```
+cd examples/emotion_conversion
+for emotion in "neutral" "amused" "angry" "disgusted" "sleepy"; do
+    python -m emotion_models.duration_predictor n_tokens=200 substring=$emotion \
+        train_tsv="$DATA/denoising/emov/train.tsv" \
+        train_km="$DATA/denoising/emov/train.km" \
+        valid_tsv="$DATA/denoising/emov/valid.tsv" \
+        valid_km="$DATA/denoising/emov/valid.km"
+done
+```
+* See `hyra.run.dir` to configure directory for saving models.
+* After the above command you should have 5 duration models in your checkpoint directory:
+```
+❯ ll duration_predictor/
+total 21M
+-rw-rw-r-- 1 felixkreuk felixkreuk 4.1M Nov 15  2021 amused.ckpt
+-rw-rw-r-- 1 felixkreuk felixkreuk 4.1M Nov 15  2021 angry.ckpt
+-rw-rw-r-- 1 felixkreuk felixkreuk 4.1M Nov 15  2021 disgusted.ckpt
+-rw-rw-r-- 1 felixkreuk felixkreuk 4.1M Nov 15  2021 neutral.ckpt
+-rw-rw-r-- 1 felixkreuk felixkreuk 4.1M Nov 15  2021 sleepy.ckpt
+```
+
+## Token Generation
+The following command uses `fairseq-generate` to generate the token sequences based on the source and target emotions.
+```
+fairseq-generate \
+    $DATA/fairseq-data/emov_multilingual_translation_cross-speaker_dedup/tokenized/ \
+    --task multilingual_translation \
+    --gen-subset test \
+    --path <your-saved-translation-checkpoint> \
+    --beam 5 \
+    --batch-size 4 --max-len-a 1.8 --max-len-b 10 --lenpen 1 --min-len 1 \
+    --skip-invalid-size-inputs-valid-test --distributed-world-size 1 \
+    --source-lang neutral --target-lang amused \
+    --lang-pairs neutral-amused,neutral-sleepy,neutral-disgusted,neutral-angry,amused-sleepy,amused-disgusted,amused-neutral,amused-angry,angry-amused,angry-sleepy,angry-disgusted,angry-neutral,disgusted-amused,disgusted-sleepy,disgusted-neutral,disgusted-angry,sleepy-amused,sleepy-neutral,sleepy-disgusted,sleepy-angry \
+    --results-path <token-output-path> \
+    --user-dir examples/emotion_conversion/fairseq_models
+```
+* Modify `--source-lang` and `--target-lang` to control for the source and target emotions.
+* See [fairseq documentation](https://fairseq.readthedocs.io/en/latest/command_line_tools.html#fairseq-generate) for a full overview of generation parameters (e.g., top-k/top-p sampling).
+
+## Waveform Synthesis
+Using the output of the above command, the HiFiGAN vocoder, and the prosody prediction modules (F0 and duration) we can now generate the output waveforms:
+```
+python examples/emotion_conversion/synthesize.py \
+    --result-path <token-output-path>/generate-test.txt \
+    --data $DATA/fairseq-data/emov_multilingual_translation_cross-speaker_dedup/neutral-amused \
+    --orig-tsv examples/emotion_conversion/data/data.tsv \
+    --orig-km examples/emotion_conversion/data/hubert_base_ls960_layer9_clusters200/data.km \
+    --checkpoint-file <hifigan-checkpoint-dir>/g_00400000 \
+    --dur-model duration_predictor/ \
+    --f0-model pitch_predictor/pitch_predictor.ckpt \
+    -s neutral -t amused \
+    --outdir ~/tmp/emotion_results/wavs/neutral-amused
+```
+* Please make sure the source and target emotions here match those of the previous command.
+
+# Citation
+If you find this useful in your research, please use the following BibTeX entry for citation.
+```
+@article{kreuk2021textless,
+  title={Textless speech emotion conversion using decomposed and discrete representations},
+  author={Kreuk, Felix and Polyak, Adam and Copet, Jade and Kharitonov, Eugene and Nguyen, Tu-Anh and Rivi{\`e}re, Morgane and Hsu, Wei-Ning and Mohamed, Abdelrahman and Dupoux, Emmanuel and Adi, Yossi},
+  journal={Conference on Empirical Methods in Natural Language Processing (EMNLP)},
+  year={2022}
+}
+```
--- a/examples/emotion_conversion/emotion_models/__init__.py
+++ b/examples/emotion_conversion/emotion_models/__init__.py
--- a/examples/emotion_conversion/emotion_models/duration_predictor.py
+++ b/examples/emotion_conversion/emotion_models/duration_predictor.py
+import logging
+import os
+
+import hydra
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops.layers.torch import Rearrange
+from torch.utils.data import DataLoader, Dataset
+
+from .utils import Accuracy
+
+logger = logging.getLogger(__name__)
+
+
+def save_ckpt(model, path, model_class):
+    ckpt = {
+        "state_dict": model.state_dict(),
+        "padding_token": model.padding_token,
+        "model_class": model_class,
+    }
+    torch.save(ckpt, path)
+
+
+def load_ckpt(path):
+    ckpt = torch.load(path)
+    ckpt["model_class"]["_target_"] = "emotion_models.duration_predictor.CnnPredictor"
+    model = hydra.utils.instantiate(ckpt["model_class"])
+    model.load_state_dict(ckpt["state_dict"])
+    model.padding_token = ckpt["padding_token"]
+    model = model.cpu()
+    model.eval()
+    return model
+
+
+class Collator:
+    def __init__(self, padding_idx):
+        self.padding_idx = padding_idx
+
+    def __call__(self, batch):
+        x = [item[0] for item in batch]
+        lengths = [len(item) for item in x]
+        x = torch.nn.utils.rnn.pad_sequence(x, batch_first=True, padding_value=self.padding_idx)
+        y = [item[1] for item in batch]
+        y = torch.nn.utils.rnn.pad_sequence(y, batch_first=True, padding_value=self.padding_idx)
+        mask = (x != self.padding_idx)
+        return x, y, mask, lengths
+
+
+class Predictor(nn.Module):
+    def __init__(self, n_tokens, emb_dim):
+        super(Predictor, self).__init__()
+        self.n_tokens = n_tokens
+        self.emb_dim = emb_dim
+        self.padding_token = n_tokens
+        # add 1 extra embedding for padding token, set the padding index to be the last token
+        # (tokens from the clustering start at index 0)
+        self.emb = nn.Embedding(n_tokens + 1, emb_dim, padding_idx=self.padding_token)
+
+    def inflate_input(self, batch):
+        """ get a sequence of tokens, predict their durations
+        and inflate them accordingly """
+        batch_durs = self.forward(batch)
+        batch_durs = torch.exp(batch_durs) - 1
+        batch_durs = batch_durs.round()
+        output = []
+        for seq, durs in zip(batch, batch_durs):
+            inflated_seq = []
+            for token, n in zip(seq, durs):
+                if token == self.padding_token:
+                    break
+                n = int(n.item())
+                token = int(token.item())
+                inflated_seq.extend([token for _ in range(n)])
+            output.append(inflated_seq)
+        output = torch.LongTensor(output)
+        return output
+
+
+class CnnPredictor(Predictor):
+    def __init__(self, n_tokens, emb_dim, channels, kernel, output_dim, dropout, n_layers):
+        super(CnnPredictor, self).__init__(n_tokens=n_tokens, emb_dim=emb_dim)
+        layers = [
+            Rearrange("b t c -> b c t"),
+            nn.Conv1d(emb_dim, channels, kernel_size=kernel, padding=(kernel - 1) // 2),
+            Rearrange("b c t -> b t c"),
+            nn.ReLU(),
+            nn.LayerNorm(channels),
+            nn.Dropout(dropout),
+        ]
+        for _ in range(n_layers-1):
+            layers += [
+                Rearrange("b t c -> b c t"),
+                nn.Conv1d(channels, channels, kernel_size=kernel, padding=(kernel - 1) // 2),
+                Rearrange("b c t -> b t c"),
+                nn.ReLU(),
+                nn.LayerNorm(channels),
+                nn.Dropout(dropout),
+            ]
+        self.conv_layer = nn.Sequential(*layers)
+        self.proj = nn.Linear(channels, output_dim)
+
+    def forward(self, x):
+        x = self.emb(x)
+        x = self.conv_layer(x)
+        x = self.proj(x)
+        x = x.squeeze(-1)
+        return x
+
+
+def l2_log_loss(input, target):
+    return F.mse_loss(
+        input=input.float(),
+        target=torch.log(target.float() + 1),
+        reduce=False
+    )
+
+
+class DurationDataset(Dataset):
+    def __init__(self, tsv_path, km_path, substring=""):
+        lines = open(tsv_path, "r").readlines()
+        self.root, self.tsv = lines[0], lines[1:]
+        self.km = open(km_path, "r").readlines()
+        logger.info(f"loaded {len(self.km)} files")
+
+        if substring != "":
+            tsv, km = [], []
+            for tsv_line, km_line in zip(self.tsv, self.km):
+                if substring.lower() in tsv_line.lower():
+                    tsv.append(tsv_line)
+                    km.append(km_line)
+            self.tsv, self.km = tsv, km
+            logger.info(f"after filtering: {len(self.km)} files")
+
+    def __len__(self):
+        return len(self.km)
+
+    def __getitem__(self, i):
+        x = self.km[i]
+        x = x.split(" ")
+        x = list(map(int, x))
+
+        y = []
+        xd = []
+        count = 1
+        for x1, x2 in zip(x[:-1], x[1:]):
+            if x1 == x2:
+                count += 1
+                continue
+            else:
+                y.append(count)
+                xd.append(x1)
+                count = 1
+
+        xd = torch.LongTensor(xd)
+        y = torch.LongTensor(y)
+        return xd, y
+
+
+def train(cfg):
+    device = "cuda:0"
+    model = hydra.utils.instantiate(cfg[cfg.model]).to(device)
+    optimizer = hydra.utils.instantiate(cfg.optimizer, model.parameters())
+    # add 1 extra embedding for padding token, set the padding index to be the last token
+    # (tokens from the clustering start at index 0)
+    collate_fn = Collator(padding_idx=model.padding_token)
+    logger.info(f"data: {cfg.train_tsv}")
+    train_ds = DurationDataset(cfg.train_tsv, cfg.train_km, substring=cfg.substring)
+    valid_ds = DurationDataset(cfg.valid_tsv, cfg.valid_km, substring=cfg.substring)
+    train_dl = DataLoader(train_ds, batch_size=32, shuffle=True, collate_fn=collate_fn)
+    valid_dl = DataLoader(valid_ds, batch_size=32, shuffle=False, collate_fn=collate_fn)
+
+    best_loss = float("inf")
+    for epoch in range(cfg.epochs):
+        train_loss, train_loss_scaled = train_epoch(model, train_dl, l2_log_loss, optimizer, device)
+        valid_loss, valid_loss_scaled, *acc = valid_epoch(model, valid_dl, l2_log_loss, device)
+        acc0, acc1, acc2, acc3 = acc
+        if valid_loss_scaled < best_loss:
+            path = f"{os.getcwd()}/{cfg.substring}.ckpt"
+            save_ckpt(model, path, cfg[cfg.model])
+            best_loss = valid_loss_scaled
+            logger.info(f"saved checkpoint: {path}")
+            logger.info(f"[epoch {epoch}] train loss: {train_loss:.3f}, train scaled: {train_loss_scaled:.3f}")
+            logger.info(f"[epoch {epoch}] valid loss: {valid_loss:.3f}, valid scaled: {valid_loss_scaled:.3f}")
+            logger.info(f"acc: {acc0,acc1,acc2,acc3}")
+
+
+def train_epoch(model, loader, criterion, optimizer, device):
+    model.train()
+    epoch_loss = 0
+    epoch_loss_scaled = 0
+    for x, y, mask, _ in loader:
+        x, y, mask = x.to(device), y.to(device), mask.to(device)
+        yhat = model(x)
+        loss = criterion(yhat, y) * mask
+        loss = torch.mean(loss)
+        loss.backward()
+        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+        optimizer.step()
+        epoch_loss += loss.item()
+        # get normal scale loss
+        yhat_scaled = torch.exp(yhat) - 1
+        yhat_scaled = torch.round(yhat_scaled)
+        scaled_loss = torch.mean(torch.abs(yhat_scaled - y) * mask)
+        epoch_loss_scaled += scaled_loss.item()
+    return epoch_loss / len(loader), epoch_loss_scaled / len(loader)
+
+
+def valid_epoch(model, loader, criterion, device):
+    model.eval()
+    epoch_loss = 0
+    epoch_loss_scaled = 0
+    acc = Accuracy()
+    for x, y, mask, _ in loader:
+        x, y, mask = x.to(device), y.to(device), mask.to(device)
+        yhat = model(x)
+        loss = criterion(yhat, y) * mask
+        loss = torch.mean(loss)
+        epoch_loss += loss.item()
+        # get normal scale loss
+        yhat_scaled = torch.exp(yhat) - 1
+        yhat_scaled = torch.round(yhat_scaled)
+        scaled_loss = torch.sum(torch.abs(yhat_scaled - y) * mask) / mask.sum()
+        acc.update(yhat_scaled[mask].view(-1).float(), y[mask].view(-1).float())
+        epoch_loss_scaled += scaled_loss.item()
+    logger.info(f"example y: {y[0, :10].tolist()}")
+    logger.info(f"example yhat: {yhat_scaled[0, :10].tolist()}")
+    acc0 = acc.acc(tol=0)
+    acc1 = acc.acc(tol=1)
+    acc2 = acc.acc(tol=2)
+    acc3 = acc.acc(tol=3)
+    logger.info(f"accs: {acc0,acc1,acc2,acc3}")
+    return epoch_loss / len(loader), epoch_loss_scaled / len(loader), acc0, acc1, acc2, acc3
+
+
+@hydra.main(config_path=".", config_name="duration_predictor.yaml")
+def main(cfg):
+    logger.info(f"{cfg}")
+    train(cfg)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/emotion_conversion/emotion_models/duration_predictor.yaml
+++ b/examples/emotion_conversion/emotion_models/duration_predictor.yaml
+train_tsv: "<your-processed-data>/denoising/emov/train.tsv" 
+train_km:  "<your-processed-data>/denoising/emov/train.km" 
+valid_tsv: "<your-processed-data>/denoising/emov/valid.tsv" 
+valid_km:  "<your-processed-data>/denoising/emov/valid.km"
+
+n_tokens: 200
+batch_size: 32
+lr: 0.0001
+epochs: 300
+model: "cnn"
+substring: ""
+
+rnn:
+  _target_: emotion_models.duration_predictor.RnnPredictor
+  n_tokens: ${n_tokens}
+  emb_dim: 128
+  rnn_hidden: 128
+  output_dim: 1
+  dropout: 0
+  n_layers: 1
+
+optimizer:
+  _target_: torch.optim.Adam
+  lr: ${lr}
+  betas: [0.9, 0.98]
+  eps: 0.000000001
+  weight_decay: 0
+
+cnn:
+  _target_: emotion_models.duration_predictor.CnnPredictor
+  n_tokens: ${n_tokens}
+  emb_dim: 128
+  channels: 256
+  kernel: 3
+  output_dim: 1
+  dropout: 0.5
+  n_layers: 1
+
+hydra:
+  run:
+    dir: /checkpoint/felixkreuk/experiments/duration_predictor/${hydra.job.override_dirname}
+  job:
+    config:
+      # configuration for the ${hydra.job.override_dirname} runtime variable
+      override_dirname:
+        kv_sep: '='
+        item_sep: ','
+        exclude_keys: ['train_tsv', 'train_km', 'valid_tsv', 'valid_km']
--- a/examples/emotion_conversion/emotion_models/pitch_predictor.py
+++ b/examples/emotion_conversion/emotion_models/pitch_predictor.py
+import logging
+import os
+import random
+import sys
+from collections import defaultdict
+
+import hydra
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange
+from einops.layers.torch import Rearrange
+from scipy.io.wavfile import read
+from scipy.ndimage import gaussian_filter1d
+from torch.utils.data import DataLoader, Dataset
+from tqdm import tqdm
+
+dir_path = os.path.dirname(__file__)
+resynth_path = os.path.dirname(dir_path) + "/speech-resynthesis"
+sys.path.append(resynth_path)
+from dataset import parse_speaker, parse_style
+from .utils import F0Stat
+
+MAX_WAV_VALUE = 32768.0
+logger = logging.getLogger(__name__)
+
+
+def quantize_f0(speaker_to_f0, nbins, normalize, log):
+    f0_all = []
+    for speaker, f0 in speaker_to_f0.items():
+        f0 = f0.raw_data
+        if log:
+            f0 = f0.log()
+        mean = speaker_to_f0[speaker].mean_log if log else speaker_to_f0[speaker].mean
+        std = speaker_to_f0[speaker].std_log if log else speaker_to_f0[speaker].std
+        if normalize == "mean":
+            f0 = f0 - mean
+        elif normalize == "meanstd":
+            f0 = (f0 - mean) / std
+        f0_all.extend(f0.tolist())
+
+    hist, bin_x = np.histogram(f0_all, 100000)
+    cum_hist = np.cumsum(hist) / len(f0_all) * 100
+
+    bin_offset = []
+    bin_size = 100 / nbins
+    threshold = bin_size
+    for i in range(nbins - 1):
+        index = (np.abs(cum_hist - threshold)).argmin()
+        bin_offset.append(bin_x[index])
+        threshold += bin_size
+    bins = np.array(bin_offset)
+    bins = torch.FloatTensor(bins)
+
+    return bins
+
+
+def save_ckpt(model, path, model_class, f0_min, f0_max, f0_bins, speaker_stats):
+    ckpt = {
+        "state_dict": model.state_dict(),
+        "padding_token": model.padding_token,
+        "model_class": model_class,
+        "speaker_stats": speaker_stats,
+        "f0_min": f0_min,
+        "f0_max": f0_max,
+        "f0_bins": f0_bins,
+    }
+    torch.save(ckpt, path)
+
+
+def load_ckpt(path):
+    ckpt = torch.load(path)
+    ckpt["model_class"]["_target_"] = "emotion_models.pitch_predictor.CnnPredictor"
+    model = hydra.utils.instantiate(ckpt["model_class"])
+    model.load_state_dict(ckpt["state_dict"])
+    model.setup_f0_stats(
+        ckpt["f0_min"],
+        ckpt["f0_max"],
+        ckpt["f0_bins"],
+        ckpt["speaker_stats"],
+    )
+    return model
+
+
+def freq2bin(f0, f0_min, f0_max, bins):
+    f0 = f0.clone()
+    f0[f0 < f0_min] = f0_min
+    f0[f0 > f0_max] = f0_max
+    f0 = torch.bucketize(f0, bins)
+    return f0
+
+
+def bin2freq(x, f0_min, f0_max, bins, mode):
+    n_bins = len(bins) + 1
+    assert x.shape[-1] == n_bins
+    bins = torch.cat([torch.tensor([f0_min]), bins]).to(x.device)
+    if mode == "mean":
+        f0 = (x * bins).sum(-1, keepdims=True) / x.sum(-1, keepdims=True)
+    elif mode == "argmax":
+        idx = F.one_hot(x.argmax(-1), num_classes=n_bins)
+        f0 = (idx * bins).sum(-1, keepdims=True)
+    else:
+        raise NotImplementedError()
+    return f0[..., 0]
+
+
+def load_wav(full_path):
+    sampling_rate, data = read(full_path)
+    return data, sampling_rate
+
+
+def l1_loss(input, target):
+    return F.l1_loss(input=input.float(), target=target.float(), reduce=False)
+
+
+def l2_loss(input, target):
+    return F.mse_loss(input=input.float(), target=target.float(), reduce=False)
+
+
+class Collator:
+    def __init__(self, padding_idx):
+        self.padding_idx = padding_idx
+
+    def __call__(self, batch):
+        tokens = [item[0] for item in batch]
+        lengths = [len(item) for item in tokens]
+        tokens = torch.nn.utils.rnn.pad_sequence(
+            tokens, batch_first=True, padding_value=self.padding_idx
+        )
+        f0 = [item[1] for item in batch]
+        f0 = torch.nn.utils.rnn.pad_sequence(
+            f0, batch_first=True, padding_value=self.padding_idx
+        )
+        f0_raw = [item[2] for item in batch]
+        f0_raw = torch.nn.utils.rnn.pad_sequence(
+            f0_raw, batch_first=True, padding_value=self.padding_idx
+        )
+        spk = [item[3] for item in batch]
+        spk = torch.LongTensor(spk)
+        gst = [item[4] for item in batch]
+        gst = torch.LongTensor(gst)
+        mask = tokens != self.padding_idx
+        return tokens, f0, f0_raw, spk, gst, mask, lengths
+
+
+class CnnPredictor(nn.Module):
+    def __init__(
+        self,
+        n_tokens,
+        emb_dim,
+        channels,
+        kernel,
+        dropout,
+        n_layers,
+        spk_emb,
+        gst_emb,
+        n_bins,
+        f0_pred,
+        f0_log,
+        f0_norm,
+    ):
+        super(CnnPredictor, self).__init__()
+        self.n_tokens = n_tokens
+        self.emb_dim = emb_dim
+        self.f0_log = f0_log
+        self.f0_pred = f0_pred
+        self.padding_token = n_tokens
+        self.f0_norm = f0_norm
+        # add 1 extra embedding for padding token, set the padding index to be the last token
+        # (tokens from the clustering start at index 0)
+        self.token_emb = nn.Embedding(
+            n_tokens + 1, emb_dim, padding_idx=self.padding_token
+        )
+
+        self.spk_emb = spk_emb
+        self.gst_emb = nn.Embedding(20, gst_emb)
+        self.setup = False
+
+        feats = emb_dim + gst_emb
+        # feats = emb_dim + gst_emb + (256 if spk_emb else 0)
+        layers = [
+            nn.Sequential(
+                Rearrange("b t c -> b c t"),
+                nn.Conv1d(
+                    feats, channels, kernel_size=kernel, padding=(kernel - 1) // 2
+                ),
+                Rearrange("b c t -> b t c"),
+                nn.ReLU(),
+                nn.LayerNorm(channels),
+                nn.Dropout(dropout),
+            )
+        ]
+        for _ in range(n_layers - 1):
+            layers += [
+                nn.Sequential(
+                    Rearrange("b t c -> b c t"),
+                    nn.Conv1d(
+                        channels,
+                        channels,
+                        kernel_size=kernel,
+                        padding=(kernel - 1) // 2,
+                    ),
+                    Rearrange("b c t -> b t c"),
+                    nn.ReLU(),
+                    nn.LayerNorm(channels),
+                    nn.Dropout(dropout),
+                )
+            ]
+        self.conv_layer = nn.ModuleList(layers)
+        self.proj = nn.Linear(channels, n_bins)
+
+    def forward(self, x, gst=None):
+        x = self.token_emb(x)
+        feats = [x]
+
+        if gst is not None:
+            gst = self.gst_emb(gst)
+            gst = rearrange(gst, "b c -> b c 1")
+            gst = F.interpolate(gst, x.shape[1])
+            gst = rearrange(gst, "b c t -> b t c")
+            feats.append(gst)
+
+        x = torch.cat(feats, dim=-1)
+
+        for i, conv in enumerate(self.conv_layer):
+            if i != 0:
+                x = conv(x) + x
+            else:
+                x = conv(x)
+
+        x = self.proj(x)
+        x = x.squeeze(-1)
+
+        if self.f0_pred == "mean":
+            x = torch.sigmoid(x)
+        elif self.f0_pred == "argmax":
+            x = torch.softmax(x, dim=-1)
+        else:
+            raise NotImplementedError
+        return x
+
+    def setup_f0_stats(self, f0_min, f0_max, f0_bins, speaker_stats):
+        self.f0_min = f0_min
+        self.f0_max = f0_max
+        self.f0_bins = f0_bins
+        self.speaker_stats = speaker_stats
+        self.setup = True
+
+    def inference(self, x, spk_id=None, gst=None):
+        assert (
+            self.setup == True
+        ), "make sure that `setup_f0_stats` was called before inference!"
+        probs = self(x, gst)
+        f0 = bin2freq(probs, self.f0_min, self.f0_max, self.f0_bins, self.f0_pred)
+        for i in range(f0.shape[0]):
+            mean = (
+                self.speaker_stats[spk_id[i].item()].mean_log
+                if self.f0_log
+                else self.speaker_stats[spk_id[i].item()].mean
+            )
+            std = (
+                self.speaker_stats[spk_id[i].item()].std_log
+                if self.f0_log
+                else self.speaker_stats[spk_id[i].item()].std
+            )
+            if self.f0_norm == "mean":
+                f0[i] = f0[i] + mean
+            if self.f0_norm == "meanstd":
+                f0[i] = (f0[i] * std) + mean
+        if self.f0_log:
+            f0 = f0.exp()
+        return f0
+
+
+class PitchDataset(Dataset):
+    def __init__(
+        self,
+        tsv_path,
+        km_path,
+        substring,
+        spk,
+        spk2id,
+        gst,
+        gst2id,
+        f0_bins,
+        f0_bin_type,
+        f0_smoothing,
+        f0_norm,
+        f0_log,
+    ):
+        lines = open(tsv_path, "r").readlines()
+        self.root, self.tsv = lines[0], lines[1:]
+        self.root = self.root.strip()
+        self.km = open(km_path, "r").readlines()
+        print(f"loaded {len(self.km)} files")
+
+        self.spk = spk
+        self.spk2id = spk2id
+        self.gst = gst
+        self.gst2id = gst2id
+
+        self.f0_bins = f0_bins
+        self.f0_smoothing = f0_smoothing
+        self.f0_norm = f0_norm
+        self.f0_log = f0_log
+
+        if substring != "":
+            tsv, km = [], []
+            for tsv_line, km_line in zip(self.tsv, self.km):
+                if substring.lower() in tsv_line.lower():
+                    tsv.append(tsv_line)
+                    km.append(km_line)
+            self.tsv, self.km = tsv, km
+            print(f"after filtering: {len(self.km)} files")
+
+        self.speaker_stats = self._compute_f0_stats()
+        self.f0_min, self.f0_max = self._compute_f0_minmax()
+        if f0_bin_type == "adaptive":
+            self.f0_bins = quantize_f0(
+                self.speaker_stats, self.f0_bins, self.f0_norm, self.f0_log
+            )
+        elif f0_bin_type == "uniform":
+            self.f0_bins = torch.linspace(self.f0_min, self.f0_max, self.f0_bins + 1)[
+                1:-1
+            ]
+        else:
+            raise NotImplementedError
+        print(f"f0 min: {self.f0_min}, f0 max: {self.f0_max}")
+        print(f"bins: {self.f0_bins} (shape: {self.f0_bins.shape})")
+
+    def __len__(self):
+        return len(self.km)
+
+    def _load_f0(self, tsv_line):
+        tsv_line = tsv_line.split("\t")[0]
+        f0 = self.root + "/" + tsv_line.replace(".wav", ".yaapt.f0.npy")
+        f0 = np.load(f0)
+        f0 = torch.FloatTensor(f0)
+        return f0
+
+    def _preprocess_f0(self, f0, spk):
+        mask = f0 != -999999  # process all frames
+        # mask = (f0 != 0)  # only process voiced frames
+        mean = (
+            self.speaker_stats[spk].mean_log
+            if self.f0_log
+            else self.speaker_stats[spk].mean
+        )
+        std = (
+            self.speaker_stats[spk].std_log
+            if self.f0_log
+            else self.speaker_stats[spk].std
+        )
+        if self.f0_log:
+            f0[f0 == 0] = 1e-5
+            f0[mask] = f0[mask].log()
+        if self.f0_norm == "mean":
+            f0[mask] = f0[mask] - mean
+        if self.f0_norm == "meanstd":
+            f0[mask] = (f0[mask] - mean) / std
+        return f0
+
+    def _compute_f0_minmax(self):
+        f0_min, f0_max = float("inf"), -float("inf")
+        for tsv_line in tqdm(self.tsv, desc="computing f0 minmax"):
+            spk = self.spk2id[parse_speaker(tsv_line, self.spk)]
+            f0 = self._load_f0(tsv_line)
+            f0 = self._preprocess_f0(f0, spk)
+            f0_min = min(f0_min, f0.min().item())
+            f0_max = max(f0_max, f0.max().item())
+        return f0_min, f0_max
+
+    def _compute_f0_stats(self):
+        from functools import partial
+
+        speaker_stats = defaultdict(partial(F0Stat, True))
+        for tsv_line in tqdm(self.tsv, desc="computing speaker stats"):
+            spk = self.spk2id[parse_speaker(tsv_line, self.spk)]
+            f0 = self._load_f0(tsv_line)
+            mask = f0 != 0
+            f0 = f0[mask]  # compute stats only on voiced parts
+            speaker_stats[spk].update(f0)
+        return speaker_stats
+
+    def __getitem__(self, i):
+        x = self.km[i]
+        x = x.split(" ")
+        x = list(map(int, x))
+        x = torch.LongTensor(x)
+
+        gst = parse_style(self.tsv[i], self.gst)
+        gst = self.gst2id[gst]
+        spk = parse_speaker(self.tsv[i], self.spk)
+        spk = self.spk2id[spk]
+
+        f0_raw = self._load_f0(self.tsv[i])
+        f0 = self._preprocess_f0(f0_raw.clone(), spk)
+
+        f0 = F.interpolate(f0.unsqueeze(0).unsqueeze(0), x.shape[0])[0, 0]
+        f0_raw = F.interpolate(f0_raw.unsqueeze(0).unsqueeze(0), x.shape[0])[0, 0]
+
+        f0 = freq2bin(f0, f0_min=self.f0_min, f0_max=self.f0_max, bins=self.f0_bins)
+        f0 = F.one_hot(f0.long(), num_classes=len(self.f0_bins) + 1).float()
+        if self.f0_smoothing > 0:
+            f0 = torch.tensor(
+                gaussian_filter1d(f0.float().numpy(), sigma=self.f0_smoothing)
+            )
+        return x, f0, f0_raw, spk, gst
+
+
+def train(cfg):
+    device = "cuda:0"
+    # add 1 extra embedding for padding token, set the padding index to be the last token
+    # (tokens from the clustering start at index 0)
+    padding_token = cfg.n_tokens
+    collate_fn = Collator(padding_idx=padding_token)
+    train_ds = PitchDataset(
+        cfg.train_tsv,
+        cfg.train_km,
+        substring=cfg.substring,
+        spk=cfg.spk,
+        spk2id=cfg.spk2id,
+        gst=cfg.gst,
+        gst2id=cfg.gst2id,
+        f0_bins=cfg.f0_bins,
+        f0_bin_type=cfg.f0_bin_type,
+        f0_smoothing=cfg.f0_smoothing,
+        f0_norm=cfg.f0_norm,
+        f0_log=cfg.f0_log,
+    )
+    valid_ds = PitchDataset(
+        cfg.valid_tsv,
+        cfg.valid_km,
+        substring=cfg.substring,
+        spk=cfg.spk,
+        spk2id=cfg.spk2id,
+        gst=cfg.gst,
+        gst2id=cfg.gst2id,
+        f0_bins=cfg.f0_bins,
+        f0_bin_type=cfg.f0_bin_type,
+        f0_smoothing=cfg.f0_smoothing,
+        f0_norm=cfg.f0_norm,
+        f0_log=cfg.f0_log,
+    )
+    train_dl = DataLoader(
+        train_ds,
+        num_workers=0,
+        batch_size=cfg.batch_size,
+        shuffle=True,
+        collate_fn=collate_fn,
+    )
+    valid_dl = DataLoader(
+        valid_ds, num_workers=0, batch_size=16, shuffle=False, collate_fn=collate_fn
+    )
+
+    f0_min = train_ds.f0_min
+    f0_max = train_ds.f0_max
+    f0_bins = train_ds.f0_bins
+    speaker_stats = train_ds.speaker_stats
+
+    model = hydra.utils.instantiate(cfg["model"]).to(device)
+    model.setup_f0_stats(f0_min, f0_max, f0_bins, speaker_stats)
+
+    optimizer = hydra.utils.instantiate(cfg.optimizer, model.parameters())
+
+    best_loss = float("inf")
+    for epoch in range(cfg.epochs):
+        train_loss, train_l2_loss, train_l2_voiced_loss = run_epoch(
+            model, train_dl, optimizer, device, cfg, mode="train"
+        )
+        valid_loss, valid_l2_loss, valid_l2_voiced_loss = run_epoch(
+            model, valid_dl, None, device, cfg, mode="valid"
+        )
+        print(
+            f"[epoch {epoch}] train loss: {train_loss:.3f}, l2 loss: {train_l2_loss:.3f}, l2 voiced loss: {train_l2_voiced_loss:.3f}"
+        )
+        print(
+            f"[epoch {epoch}] valid loss: {valid_loss:.3f}, l2 loss: {valid_l2_loss:.3f}, l2 voiced loss: {valid_l2_voiced_loss:.3f}"
+        )
+        if valid_l2_voiced_loss < best_loss:
+            path = f"{os.getcwd()}/pitch_predictor.ckpt"
+            save_ckpt(model, path, cfg["model"], f0_min, f0_max, f0_bins, speaker_stats)
+            best_loss = valid_l2_voiced_loss
+            print(f"saved checkpoint: {path}")
+        print(f"[epoch {epoch}] best loss: {best_loss:.3f}")
+
+
+def run_epoch(model, loader, optimizer, device, cfg, mode):
+    if mode == "train":
+        model.train()
+    else:
+        model.eval()
+
+    epoch_loss = 0
+    l1 = 0
+    l1_voiced = 0
+    for x, f0_bin, f0_raw, spk_id, gst, mask, _ in tqdm(loader):
+        x, f0_bin, f0_raw, spk_id, gst, mask = (
+            x.to(device),
+            f0_bin.to(device),
+            f0_raw.to(device),
+            spk_id.to(device),
+            gst.to(device),
+            mask.to(device),
+        )
+        b, t, n_bins = f0_bin.shape
+        yhat = model(x, gst)
+        nonzero_mask = (f0_raw != 0).logical_and(mask)
+        yhat_raw = model.inference(x, spk_id, gst)
+        expanded_mask = mask.unsqueeze(-1).expand(-1, -1, n_bins)
+        if cfg.f0_pred == "mean":
+            loss = F.binary_cross_entropy(
+                yhat[expanded_mask], f0_bin[expanded_mask]
+            ).mean()
+        elif cfg.f0_pred == "argmax":
+            loss = F.cross_entropy(
+                rearrange(yhat, "b t d -> (b t) d"),
+                rearrange(f0_bin.argmax(-1), "b t -> (b t)"),
+                reduce=False,
+            )
+            loss = rearrange(loss, "(b t) -> b t", b=b, t=t)
+            loss = (loss * mask).sum() / mask.float().sum()
+        else:
+            raise NotImplementedError
+        l1 += F.l1_loss(yhat_raw[mask], f0_raw[mask]).item()
+        l1_voiced += F.l1_loss(yhat_raw[nonzero_mask], f0_raw[nonzero_mask]).item()
+        epoch_loss += loss.item()
+
+        if mode == "train":
+            loss.backward()
+            nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+            optimizer.step()
+
+    print(f"{mode} example    y: {f0_bin.argmax(-1)[0, 50:60].tolist()}")
+    print(f"{mode} example yhat: {yhat.argmax(-1)[0, 50:60].tolist()}")
+    print(f"{mode} example    y: {f0_raw[0, 50:60].round().tolist()}")
+    print(f"{mode} example yhat: {yhat_raw[0, 50:60].round().tolist()}")
+    return epoch_loss / len(loader), l1 / len(loader), l1_voiced / len(loader)
+
+
+@hydra.main(config_path=dir_path, config_name="pitch_predictor.yaml")
+def main(cfg):
+    np.random.seed(1)
+    random.seed(1)
+    torch.manual_seed(1)
+    from hydra.core.hydra_config import HydraConfig
+
+    overrides = {
+        x.split("=")[0]: x.split("=")[1]
+        for x in HydraConfig.get().overrides.task
+        if "/" not in x
+    }
+    print(f"{cfg}")
+    train(cfg)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/emotion_conversion/emotion_models/pitch_predictor.yaml
+++ b/examples/emotion_conversion/emotion_models/pitch_predictor.yaml
+train_tsv: "<your-processed-data>/denoising/emov/train.tsv" 
+train_km:  "<your-processed-data>/denoising/emov/train.km" 
+valid_tsv: "<your-processed-data>/denoising/emov/valid.tsv" 
+valid_km:  "<your-processed-data>/denoising/emov/valid.km"
+
+n_tokens: 200
+batch_size: 64
+lr: 0.0001
+epochs: 1000
+
+substring: ""
+loss: "l2"
+spk: "parent_parent_name"
+gst: "emotion"
+
+f0_bins: 50
+f0_pred: "mean" # [argmax, mean]
+f0_smoothing: 0.1
+f0_norm: "mean"
+f0_log: false
+f0_bin_type: "adaptive" # [uniform, adaptive]
+
+spk2id:
+  bea: 0
+  jenie: 1
+  josh: 2
+  sam: 3
+
+gst2id:
+  amused: 0
+  angry: 1
+  disgusted: 2
+  neutral: 3
+  sleepy: 4
+
+optimizer:
+  _target_: torch.optim.Adam
+  lr: ${lr}
+
+model:
+  _target_: emotion_models.pitch_predictor.CnnPredictor
+  n_tokens: ${n_tokens}
+  emb_dim: 256
+  channels: 256
+  kernel: 5
+  dropout: 0.1
+  n_layers: 6
+  spk_emb: true
+  gst_emb: 8
+  n_bins: ${f0_bins}
+  f0_pred: ${f0_pred}
+  f0_log: ${f0_log}
+  f0_norm: ${f0_norm}
+
+hydra:
+  run:
+    dir: /checkpoint/felixkreuk/experiments/pitch_predictor/${hydra.job.override_dirname}
+  job:
+    config:
+      # configuration for the ${hydra.job.override_dirname} runtime variable
+      override_dirname:
+        kv_sep: '='
+        item_sep: ','
+        exclude_keys: ['train_tsv', 'train_km', 'valid_tsv', 'valid_km']
--- a/examples/emotion_conversion/emotion_models/utils.py
+++ b/examples/emotion_conversion/emotion_models/utils.py
+import torch
+
+
+class Stat:
+    def __init__(self, keep_raw=False):
+        self.x = 0.0
+        self.x2 = 0.0
+        self.z = 0.0  # z = logx
+        self.z2 = 0.0
+        self.n = 0.0
+        self.u = 0.0
+        self.keep_raw = keep_raw
+        self.raw = []
+
+    def update(self, new_x):
+        new_z = new_x.log()
+
+        self.x += new_x.sum()
+        self.x2 += (new_x**2).sum()
+        self.z += new_z.sum()
+        self.z2 += (new_z**2).sum()
+        self.n += len(new_x)
+        self.u += 1
+
+        if self.keep_raw:
+            self.raw.append(new_x)
+
+    @property
+    def mean(self):
+        return self.x / self.n
+
+    @property
+    def std(self):
+        return (self.x2 / self.n - self.mean**2) ** 0.5
+
+    @property
+    def mean_log(self):
+        return self.z / self.n
+
+    @property
+    def std_log(self):
+        return (self.z2 / self.n - self.mean_log**2) ** 0.5
+
+    @property
+    def n_frms(self):
+        return self.n
+
+    @property
+    def n_utts(self):
+        return self.u
+
+    @property
+    def raw_data(self):
+        assert self.keep_raw, "does not support storing raw data!"
+        return torch.cat(self.raw)
+
+
+class F0Stat(Stat):
+    def update(self, new_x):
+        # assume unvoiced frames are 0 and consider only voiced frames
+        if new_x is not None:
+            super().update(new_x[new_x != 0])
+
+
+class Accuracy:
+    def __init__(self):
+        self.y, self.yhat = [], []
+
+    def update(self, yhat, y):
+        self.yhat.append(yhat)
+        self.y.append(y)
+
+    def acc(self, tol):
+        yhat = torch.cat(self.yhat)
+        y = torch.cat(self.y)
+        acc = torch.abs(yhat - y) <= tol
+        acc = acc.float().mean().item()
+        return acc
--- a/examples/emotion_conversion/fairseq_models/__init__.py
+++ b/examples/emotion_conversion/fairseq_models/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+from fairseq import utils
+from fairseq.models import (
+    FairseqMultiModel,
+    register_model,
+    register_model_architecture,
+)
+from fairseq.models.transformer import (
+    Embedding,
+    base_architecture,
+)
+from fairseq.models.multilingual_transformer import (
+    MultilingualTransformerModel,
+    base_multilingual_architecture,
+)
+from fairseq.utils import safe_hasattr
+from collections import OrderedDict
+
+
+@register_model("multilingual_transformer_from_mbart")
+class MultilingualTransformerModelFromMbart(MultilingualTransformerModel):
+    @classmethod
+    def build_model(cls, args, task):
+        """Build a new model instance."""
+        from fairseq.tasks.multilingual_translation import MultilingualTranslationTask
+
+        assert isinstance(task, MultilingualTranslationTask)
+
+        # make sure all arguments are present in older models
+        base_multilingual_architecture(args)
+
+        if not safe_hasattr(args, "max_source_positions"):
+            args.max_source_positions = 1024
+        if not safe_hasattr(args, "max_target_positions"):
+            args.max_target_positions = 1024
+
+        src_langs = [lang_pair.split("-")[0] for lang_pair in task.model_lang_pairs]
+        tgt_langs = [lang_pair.split("-")[1] for lang_pair in task.model_lang_pairs]
+
+        if args.share_encoders:
+            args.share_encoder_embeddings = True
+        if args.share_decoders:
+            args.share_decoder_embeddings = True
+
+        def build_embedding(dictionary, embed_dim, path=None):
+            num_embeddings = len(dictionary)
+            padding_idx = dictionary.pad()
+            emb = Embedding(num_embeddings, embed_dim, padding_idx)
+            # if provided, load from preloaded dictionaries
+            if path:
+                embed_dict = utils.parse_embedding(path)
+                utils.load_embedding(embed_dict, dictionary, emb)
+            return emb
+
+        # build shared embeddings (if applicable)
+        shared_encoder_embed_tokens, shared_decoder_embed_tokens = None, None
+        if args.share_all_embeddings:
+            if args.encoder_embed_dim != args.decoder_embed_dim:
+                raise ValueError(
+                    "--share-all-embeddings requires --encoder-embed-dim to match --decoder-embed-dim"
+                )
+            if args.decoder_embed_path and (
+                args.decoder_embed_path != args.encoder_embed_path
+            ):
+                raise ValueError(
+                    "--share-all-embeddings not compatible with --decoder-embed-path"
+                )
+            shared_encoder_embed_tokens = FairseqMultiModel.build_shared_embeddings(
+                dicts=task.dicts,
+                langs=task.langs,
+                embed_dim=args.encoder_embed_dim,
+                build_embedding=build_embedding,
+                pretrained_embed_path=args.encoder_embed_path,
+            )
+            shared_decoder_embed_tokens = shared_encoder_embed_tokens
+            args.share_decoder_input_output_embed = True
+        else:
+            if args.share_encoder_embeddings:
+                shared_encoder_embed_tokens = FairseqMultiModel.build_shared_embeddings(
+                    dicts=task.dicts,
+                    langs=src_langs,
+                    embed_dim=args.encoder_embed_dim,
+                    build_embedding=build_embedding,
+                    pretrained_embed_path=args.encoder_embed_path,
+                )
+            if args.share_decoder_embeddings:
+                shared_decoder_embed_tokens = FairseqMultiModel.build_shared_embeddings(
+                    dicts=task.dicts,
+                    langs=tgt_langs,
+                    embed_dim=args.decoder_embed_dim,
+                    build_embedding=build_embedding,
+                    pretrained_embed_path=args.decoder_embed_path,
+                )
+
+        # encoders/decoders for each language
+        lang_encoders, lang_decoders = {}, {}
+
+        def get_encoder(lang):
+            if lang not in lang_encoders:
+                if shared_encoder_embed_tokens is not None:
+                    encoder_embed_tokens = shared_encoder_embed_tokens
+                else:
+                    encoder_embed_tokens = build_embedding(
+                        task.dicts[lang],
+                        args.encoder_embed_dim,
+                        args.encoder_embed_path,
+                    )
+                lang_encoders[lang] = MultilingualTransformerModel._get_module_class(
+                    True, args, task.dicts[lang], encoder_embed_tokens, src_langs
+                )
+            return lang_encoders[lang]
+
+        def get_decoder(lang):
+            if lang not in lang_decoders:
+                if shared_decoder_embed_tokens is not None:
+                    decoder_embed_tokens = shared_decoder_embed_tokens
+                else:
+                    decoder_embed_tokens = build_embedding(
+                        task.dicts[lang],
+                        args.decoder_embed_dim,
+                        args.decoder_embed_path,
+                    )
+                lang_decoders[lang] = MultilingualTransformerModel._get_module_class(
+                    False, args, task.dicts[lang], decoder_embed_tokens, tgt_langs
+                )
+            return lang_decoders[lang]
+
+        # shared encoders/decoders (if applicable)
+        shared_encoder, shared_decoder = None, None
+        if args.share_encoders:
+            shared_encoder = get_encoder(src_langs[0])
+        if args.share_decoders:
+            shared_decoder = get_decoder(tgt_langs[0])
+
+        encoders, decoders = OrderedDict(), OrderedDict()
+        for lang_pair, src, tgt in zip(task.model_lang_pairs, src_langs, tgt_langs):
+            encoders[lang_pair] = (
+                shared_encoder if shared_encoder is not None else get_encoder(src)
+            )
+            decoders[lang_pair] = (
+                shared_decoder if shared_decoder is not None else get_decoder(tgt)
+            )
+
+        return MultilingualTransformerModelFromMbart(encoders, decoders)
+
+    def load_state_dict(self, state_dict, strict=True, model_cfg=None):
+        state_dict_subset = state_dict.copy()
+        lang_pairs = set([x.split(".")[1] for x in state_dict.keys()])
+        finetune_mode = not any("neutral" in lp for lp in lang_pairs)
+
+        if finetune_mode:
+            # load a pre-trained mBART/BART model
+            # we need this code because mBART/BART are not of type FairseqMultiModel but FairseqModel
+            # so we hackishly load the weights by replicating them for all lang pairs
+            print("loading pre-trained BART")
+            self_state_dict = self.state_dict()
+            for k, v in state_dict.items():
+                for lang_pair in self.models:
+                    new_key = k if "models." in k else f"models.{lang_pair}.{k}"
+                    # print(new_key)
+                    if self_state_dict[new_key].shape == v.shape:
+                        state_dict_subset[new_key] = v
+                    elif any(
+                        w in k
+                        for w in [
+                            "encoder.embed_tokens.weight",
+                            "decoder.embed_tokens.weight",
+                            "decoder.output_projection.weight",
+                        ]
+                    ):
+                        # why vocab_size - 5? because there are `vocab_size` tokens from the language
+                        # and 5 additional tokens in the denoising task: eos,bos,pad,unk,mask.
+                        # but in the translation task there are only `vocab_size` + 4 (no mask).
+                        print(
+                            f"{k}: {self_state_dict[new_key].shape} != {v.shape}",
+                            end="",
+                            flush=True,
+                        )
+                        vocab_size = v.shape[0] - 5
+                        state_dict_subset[new_key] = self_state_dict[new_key]
+                        state_dict_subset[new_key] = v[: vocab_size + 4]
+                        print(f" => fixed by using first {vocab_size + 4} dims")
+                    else:
+                        raise ValueError("unable to load model due to mimatched dims!")
+                del state_dict_subset[k]
+        else:
+            print("loading pre-trained emotion translation model")
+            for k, _ in state_dict.items():
+                assert k.startswith("models.")
+                lang_pair = k.split(".")[1]
+                if lang_pair not in self.models:
+                    del state_dict_subset[k]
+
+        super().load_state_dict(state_dict_subset, strict=strict, model_cfg=model_cfg)
+
+
+@register_model_architecture("transformer", "transformer_small")
+def transformer_small(args):
+    args.encoder_embed_dim = getattr(args, "encoder_embed_dim", 512)
+    args.encoder_ffn_embed_dim = getattr(args, "encoder_ffn_embed_dim", 512)
+    args.encoder_attention_heads = getattr(args, "encoder_attention_heads", 4)
+    args.encoder_layers = getattr(args, "encoder_layers", 3)
+    args.decoder_embed_dim = getattr(args, "decoder_embed_dim", 512)
+    args.decoder_ffn_embed_dim = getattr(args, "decoder_ffn_embed_dim", 512)
+    args.decoder_attention_heads = getattr(args, "decoder_attention_heads", 4)
+    args.decoder_layers = getattr(args, "decoder_layers", 3)
+    base_architecture(args)
+
+
+@register_model_architecture(
+    "multilingual_transformer_from_mbart", "multilingual_small"
+)
+def multilingual_small(args):
+    args.encoder_embed_dim = getattr(args, "encoder_embed_dim", 512)
+    args.encoder_ffn_embed_dim = getattr(args, "encoder_ffn_embed_dim", 512)
+    args.encoder_attention_heads = getattr(args, "encoder_attention_heads", 4)
+    args.encoder_layers = getattr(args, "encoder_layers", 3)
+    args.decoder_embed_dim = getattr(args, "decoder_embed_dim", 512)
+    args.decoder_ffn_embed_dim = getattr(args, "decoder_ffn_embed_dim", 512)
+    args.decoder_attention_heads = getattr(args, "decoder_attention_heads", 4)
+    args.decoder_layers = getattr(args, "decoder_layers", 3)
+    base_multilingual_architecture(args)
--- a/examples/emotion_conversion/preprocess/__init__.py
+++ b/examples/emotion_conversion/preprocess/__init__.py
--- a/examples/emotion_conversion/preprocess/build_hifigan_manifest.py
+++ b/examples/emotion_conversion/preprocess/build_hifigan_manifest.py
+import torchaudio
+import argparse
+import json
+
+def main():
+    parser = argparse.ArgumentParser(description="example: python create_hifigan_manifest.py --tsv /checkpoint/felixkreuk/datasets/vctk/splits/vctk_16khz/train.tsv --km /checkpoint/felixkreuk/experiments/hubert/hubert_feats/vctk_16khz_km_100/train.km --km_type hubert_100km > ~/tmp/tmp_mani.txt")
+    parser.add_argument("--tsv", required=True, help="path to fairseq tsv file")
+    parser.add_argument("--km", required=True, help="path to a km file generated by HuBERT clustering")
+    parser.add_argument("--km_type", required=True, help="name of the codes in the output json (for example: 'cpc_100km')")
+    args = parser.parse_args()
+
+    km_lines = open(args.km, "r").readlines()
+    tsv_lines = open(args.tsv, "r").readlines()
+    assert len(km_lines) == len(tsv_lines) - 1, "tsv and km files are not of the same length!"
+
+    wav_root = tsv_lines[0].strip()
+    tsv_lines = tsv_lines[1:]
+
+    for tsv_line, km_line in zip(tsv_lines, km_lines):
+        tsv_line, km_line = tsv_line.strip(), km_line.strip()
+        wav_basename, wav_num_frames = tsv_line.split("\t")
+        wav_path = wav_root + "/" + wav_basename
+        wav_info = torchaudio.info(wav_path)
+        assert int(wav_num_frames) == wav_info.num_frames, "tsv duration and actual duration don't match!"
+        wav_duration = wav_info.num_frames / wav_info.sample_rate
+        manifest_line = {"audio": wav_path, "duration": wav_duration, args.km_type: km_line}
+        print(json.dumps(manifest_line))
+
+if __name__ == "__main__":
+    """
+    usage:
+    python create_hifigan_manifest.py \
+            --tsv /checkpoint/felixkreuk/datasets/vctk/manifests/vctk_16khz/valid.tsv \
+            --km /checkpoint/felixkreuk/datasets/vctk/manifests/vctk_16khz/hubert_km_100/valid.km \
+            --km_type hubert \
+            > /checkpoint/felixkreuk/datasets/vctk/manifests/vctk_16khz/hubert_km_100/hifigan_valid_manifest.txt
+    """
+    main()
--- a/examples/emotion_conversion/preprocess/build_translation_manifests.py
+++ b/examples/emotion_conversion/preprocess/build_translation_manifests.py
+from glob import glob
+import argparse
+from collections import defaultdict, Counter
+from itertools import combinations, product, groupby
+from pathlib import Path
+import os
+from sklearn.utils import shuffle
+import numpy as np
+import random
+from shutil import copy
+from subprocess import check_call
+
+np.random.seed(42)
+random.seed(42)
+
+
+def get_fname(s):
+    return s.split("\t")[0]
+
+def get_emotion(s):
+    return get_fname(s).split("_")[0].split("/")[1].lower()
+
+def get_utt_id(s):
+    return get_fname(s).split(".")[0].split("_")[-1]
+
+def dedup(seq):
+    """ >> remove_repetitions("1 2 2 3 100 2 2 1")
+    '1 2 3 100 2 1' """
+    seq = seq.strip().split(" ")
+    result = seq[:1]
+    reps = []
+    rep_counter = 1
+    for k in seq[1:]:
+        if k != result[-1]:
+            result += [k]
+            reps += [rep_counter]
+            rep_counter = 1
+        else:
+            rep_counter += 1
+    reps += [rep_counter]
+    assert len(reps) == len(result) and sum(reps) == len(seq)
+    return " ".join(result) + "\n" #, reps
+
+def remove_under_k(seq, k):
+    """ remove tokens that repeat less then k times in a row
+    >> remove_under_k("a a a a b c c c", 1) ==> a a a a c c c """
+    seq = seq.strip().split(" ")
+    result = []
+
+    freqs = [(k,len(list(g))) for k, g in groupby(seq)]
+    for c, f in freqs:
+        if f > k:
+            result += [c for _ in range(f)]
+    return " ".join(result) + "\n" #, reps
+
+
+def call(cmd):
+    print(cmd)
+    check_call(cmd, shell=True)
+
+
+def denoising_preprocess(path, lang, dict):
+    bin = 'fairseq-preprocess'
+    cmd = [
+        bin,
+        f'--trainpref {path}/train.{lang} --validpref {path}/valid.{lang} --testpref {path}/test.{lang}',
+        f'--destdir {path}/tokenized/{lang}',
+        '--only-source',
+        '--task multilingual_denoising',
+        '--workers 40',
+    ]
+    if dict != "":
+        cmd += [f'--srcdict {dict}']
+    cmd = " ".join(cmd)
+    call(cmd)
+
+
+def translation_preprocess(path, src_lang, trg_lang, dict, only_train=False):
+    bin = 'fairseq-preprocess'
+    cmd = [
+        bin,
+        f'--source-lang {src_lang} --target-lang {trg_lang}',
+        f'--trainpref {path}/train',
+        f'--destdir {path}/tokenized',
+        '--workers 40',
+    ]
+    if not only_train:
+        cmd += [f'--validpref {path}/valid --testpref {path}/test']
+    if dict != "":
+        cmd += [
+            f'--srcdict {dict}',
+            f'--tgtdict {dict}',
+        ]
+    cmd = " ".join(cmd)
+    call(cmd)
+
+
+def load_tsv_km(tsv_path, km_path):
+    assert tsv_path.exists() and km_path.exists()
+    tsv_lines = open(tsv_path, "r").readlines()
+    root, tsv_lines = tsv_lines[0], tsv_lines[1:]
+    km_lines = open(km_path, "r").readlines()
+    assert len(tsv_lines) == len(km_lines), ".tsv and .km should be the same length!"
+    return root, tsv_lines, km_lines
+
+
+def main():
+    desc = """
+    this script takes as input .tsv and .km files for EMOV dataset, and a pairs of emotions.
+    it generates parallel .tsv and .km files for these emotions. for exmaple:
+    ❯ python build_emov_translation_manifests.py \
+            /checkpoint/felixkreuk/datasets/emov/manifests/emov_16khz/train.tsv \
+            /checkpoint/felixkreuk/datasets/emov/manifests/emov_16khz/emov_16khz_km_100/train.km \
+            ~/tmp/emov_pairs \
+            --src-emotion amused --trg-emotion neutral \
+            --dedup --shuffle --cross-speaker --dry-run
+    """
+    parser = argparse.ArgumentParser(description=desc)
+    parser.add_argument("data", type=Path, help="path to a dir containing .tsv and .km files containing emov dataset")
+    parser.add_argument("output_path", type=Path, help="output directory with the manifests will be created")
+    parser.add_argument("-cs", "--cross-speaker", action='store_true', help="if set then translation will occur also between speakers, meaning the same sentence can be translated between different speakers (default: false)")
+    parser.add_argument("-dd", "--dedup", action='store_true', help="remove repeated tokens (example: 'aaabc=>abc')")
+    parser.add_argument("-sh", "--shuffle", action='store_true', help="shuffle the data")
+    parser.add_argument("-ae", "--autoencode", action='store_true', help="include training pairs from the same emotion (this includes examples of the same sentence uttered by different people and examples where the src and trg are the exact same seq)")
+    parser.add_argument("-dr", "--dry-run", action='store_true', help="don't write anything to disk")
+    parser.add_argument("-zs", "--zero-shot", action='store_true', help="if true, the denoising task will train on the same splits as the translation task (split by utterance id). if false, the denoising task will train on randomly sampled splits (not split by utterance id)")
+    parser.add_argument("--km-ext", default="km", help="")
+    parser.add_argument("--dict", default="/checkpoint/felixkreuk/datasets/emov/manifests/emov_16khz/fairseq.dict.txt", help="")
+    args = parser.parse_args()
+    SPEAKERS = ["bea", "jenie", "josh", "sam", "SAME"]
+    EMOTIONS = ['neutral', 'amused', 'angry', 'disgusted', 'sleepy']
+
+    suffix = ""
+    if args.cross_speaker: suffix += "_cross-speaker"
+    if args.dedup: suffix += "_dedup"
+    translation_suffix = ""
+    if args.autoencode: translation_suffix += "_autoencode"
+    denoising_suffix = ""
+    denoising_suffix += "_zeroshot" if args.zero_shot else "_nonzeroshot"
+
+    translation_dir = Path(args.output_path) / ("emov_multilingual_translation" + suffix + translation_suffix)
+    os.makedirs(translation_dir, exist_ok=True)
+    denoising_dir = Path(args.output_path) / ("emov_multilingual_denoising" + suffix + denoising_suffix)
+    os.makedirs(denoising_dir, exist_ok=True)
+
+    denoising_data = [p.name for p in (args.data / "denoising").glob("*") if "emov" not in p.name]
+
+    for split in ["train", "valid", "test"]:
+        root, tsv_lines, km_lines = load_tsv_km(
+            tsv_path = args.data / "denoising" / "emov" / f"{split}.tsv",
+            km_path = args.data / "denoising" / "emov" / f"{split}.{args.km_ext}"
+        )
+
+        # generate data for the multilingual denoising task
+        for EMOTION in EMOTIONS:
+            print("---")
+            print(split)
+            print(f"denoising: {EMOTION}")
+            emotion_tsv, emotion_km = [], []
+            for tsv_line, km_line in zip(tsv_lines, km_lines):
+                if EMOTION.lower() in tsv_line.lower():
+                    km_line = km_line if not args.dedup else dedup(km_line)
+                    emotion_tsv.append(tsv_line)
+                    emotion_km.append(km_line)
+            print(f"{len(emotion_km)} samples")
+            open(denoising_dir / f"files.{split}.{EMOTION}", "w").writelines([root] + emotion_tsv)
+            open(denoising_dir / f"{split}.{EMOTION}", "w").writelines(emotion_km)
+
+        for data in denoising_data:
+            with open(args.data / "denoising" / data / f"{split}.{args.km_ext}", "r") as f1:
+                with open(denoising_dir / f"{split}.{data}", "w") as f2:
+                    f2.writelines([l if not args.dedup else dedup(l) for l in f1.readlines()])
+
+        # start of translation preprocessing
+        root, tsv_lines, km_lines = load_tsv_km(
+            tsv_path = args.data / "translation" / f"{split}.tsv",
+            km_path = args.data / "translation" / f"{split}.{args.km_ext}"
+        )
+
+        # generate data for the multilingual translation task
+        for SRC_EMOTION in EMOTIONS:
+            TRG_EMOTIONS = EMOTIONS if args.autoencode else set(EMOTIONS) - set([SRC_EMOTION])
+            for TRG_EMOTION in TRG_EMOTIONS:
+                # when translating back to the same emotion - we dont want these emotion
+                # pairs to be part of the validation/test sets (because its not really emotion conversino)
+                #  if SRC_EMOTION == TRG_EMOTION and split in ["valid", "test"]: continue
+                print("---")
+                print(split)
+                print(f"src emotions: {SRC_EMOTION}\ntrg emotions: {TRG_EMOTION}")
+
+                # create a dictionary with the following structure:
+                # output[SPEAKER][UTT_ID] = list with indexes of line from the tsv file
+                # that match the speaker and utterance id. for exmaple:
+                # output = {'sam': {'0493': [875, 1608, 1822], ...}, ...}
+                # meaning, for speaker 'sam', utterance id '0493', the indexes in tsv_lines
+                # are 875, 1608, 1822
+                spkr2utts = defaultdict(lambda: defaultdict(list))
+                for i, tsv_line in enumerate(tsv_lines):
+                    speaker = tsv_line.split("/")[0]
+                    if args.cross_speaker: speaker = "SAME"
+                    assert speaker in SPEAKERS, "unknown speaker! make sure the .tsv contains EMOV data"
+                    utt_id = get_utt_id(tsv_line)
+                    spkr2utts[speaker][utt_id].append(i)
+
+                # create a tsv and km files with all the combinations for translation
+                src_tsv, trg_tsv, src_km, trg_km = [], [], [], []
+                for speaker, utt_ids in spkr2utts.items():
+                    for utt_id, indices in utt_ids.items():
+                        # generate all pairs
+                        pairs = [(x,y) for x in indices for y in indices]
+                        # self-translation 
+                        if SRC_EMOTION == TRG_EMOTION:
+                            pairs = [(x,y) for (x,y) in pairs if x == y]
+                        # filter according to src and trg emotions
+                        pairs = [(x,y) for (x,y) in pairs 
+                                if get_emotion(tsv_lines[x]) == SRC_EMOTION and get_emotion(tsv_lines[y]) == TRG_EMOTION]
+
+                        for idx1, idx2 in pairs:
+                            assert get_utt_id(tsv_lines[idx1]) == get_utt_id(tsv_lines[idx2])
+                            src_tsv.append(tsv_lines[idx1])
+                            trg_tsv.append(tsv_lines[idx2])
+                            km_line_idx1 = km_lines[idx1]
+                            km_line_idx2 = km_lines[idx2]
+                            km_line_idx1 = km_line_idx1 if not args.dedup else dedup(km_line_idx1)
+                            km_line_idx2 = km_line_idx2 if not args.dedup else dedup(km_line_idx2)
+                            src_km.append(km_line_idx1)
+                            trg_km.append(km_line_idx2)
+                assert len(src_tsv) == len(trg_tsv) == len(src_km) == len(trg_km)
+                print(f"{len(src_tsv)} pairs")
+
+                if len(src_tsv) == 0:
+                    raise Exception("ERROR: generated 0 pairs!")
+
+                if args.dry_run: continue
+
+                # create files
+                os.makedirs(translation_dir / f"{SRC_EMOTION}-{TRG_EMOTION}", exist_ok=True)
+                open(translation_dir / f"{SRC_EMOTION}-{TRG_EMOTION}" / f"files.{split}.{SRC_EMOTION}", "w").writelines([root] + src_tsv)
+                open(translation_dir / f"{SRC_EMOTION}-{TRG_EMOTION}" / f"files.{split}.{TRG_EMOTION}", "w").writelines([root] + trg_tsv)
+                open(translation_dir / f"{SRC_EMOTION}-{TRG_EMOTION}" / f"{split}.{SRC_EMOTION}", "w").writelines(src_km)
+                open(translation_dir / f"{SRC_EMOTION}-{TRG_EMOTION}" / f"{split}.{TRG_EMOTION}", "w").writelines(trg_km)
+
+        
+    # fairseq-preprocess the denoising data
+    for EMOTION in EMOTIONS + denoising_data:
+        denoising_preprocess(denoising_dir, EMOTION, args.dict)
+    os.system(f"cp {args.dict} {denoising_dir}/tokenized/dict.txt")
+
+    # fairseq-preprocess the translation data
+    os.makedirs(translation_dir / "tokenized", exist_ok=True)
+    for SRC_EMOTION in EMOTIONS:
+        TRG_EMOTIONS = EMOTIONS if args.autoencode else set(EMOTIONS) - set([SRC_EMOTION])
+        for TRG_EMOTION in TRG_EMOTIONS:
+            translation_preprocess(translation_dir / f"{SRC_EMOTION}-{TRG_EMOTION}", SRC_EMOTION, TRG_EMOTION, args.dict)#, only_train=SRC_EMOTION==TRG_EMOTION)
+    os.system(f"cp -rf {translation_dir}/**/tokenized/* {translation_dir}/tokenized")
+
+if __name__ == "__main__":
+    main()
--- a/examples/emotion_conversion/preprocess/create_core_manifest.py
+++ b/examples/emotion_conversion/preprocess/create_core_manifest.py
+from pathlib import Path
+import os
+import sys
+import subprocess
+import argparse
+from datetime import datetime
+import logging
+
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s [%(levelname)s] %(message)s',
+    handlers=[logging.FileHandler('debug.log'), logging.StreamHandler()]
+)
+logger = logging.getLogger(__name__)
+
+
+def verify_dict_size(km, dict):
+    logger.info(f"verifying: {km}")
+    dict_size = len(open(dict, "r").readlines())
+    km_vocab = set(open(km, "r").read().replace("\n", " ").split(" "))
+    if "" in km_vocab: km_vocab.remove("")
+    km_vocab_size = len(km_vocab)
+    return dict_size == km_vocab_size
+
+
+def verify_files_exist(l):
+    for f in l:
+        if not f.exists():
+            logging.error(f"{f} doesn't exist!")
+            return False
+    return True
+
+
+def run_cmd(cmd, print_output=True):
+    try:
+        out = subprocess.check_output(cmd, stderr=subprocess.STDOUT, universal_newlines=True, shell=True)
+        if print_output:
+            logger.info(f"command output:\n{out}")
+        return out
+    except subprocess.CalledProcessError as grepexc:                                                                                                   
+        logger.info(f"error executing command!:\n{cmd}")
+        logger.info(grepexc.output)
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--tsv", default="/checkpoint/felixkreuk/datasets/emov/manifests/emov_16khz/data.tsv", type=Path)
+    parser.add_argument("--emov-km", required=True, type=Path)
+    parser.add_argument("--km", nargs='+', required=True, type=Path)
+    parser.add_argument("--seed", type=int, default=1)
+    parser.add_argument("--dict", default="/checkpoint/felixkreuk/datasets/emov/manifests/emov_16khz/fairseq.dict.txt")
+    parser.add_argument("--manifests-dir", type=Path, default="/checkpoint/felixkreuk/datasets/emov/manifests/emov_16khz")
+    args = parser.parse_args()
+
+    manifests_dir = args.manifests_dir
+    date = datetime.now().strftime('%d%m%y')
+    outdir = manifests_dir / f"{date}"
+
+    # verify input and create folders
+    all_kms = args.km + [args.emov_km]
+    assert verify_files_exist(all_kms), "make sure the km dir contains: train-clean-all.km, blizzard2013.km, data.km"
+    for codes in all_kms:
+        assert verify_dict_size(codes, args.dict), "dict argument doesn't match the vocabulary of the km file!"
+    assert not outdir.exists(), "data dir already exists!"
+    outdir.mkdir(parents=True, exist_ok=True)
+
+    logger.info("generating denoising split (emov)")
+    run_cmd(f"python preprocess/split_km_tsv.py {args.tsv} {args.emov_km} --destdir {outdir}/denoising/emov -sh --seed {args.seed}")
+    for codes in args.km:
+        codes_name = os.path.basename(codes)
+        run_cmd(f"python preprocess/split_km.py {codes} --destdir {outdir}/denoising/{codes_name} -sh --seed {args.seed}")
+
+    logger.info("generating translation split")
+    run_cmd(f"python preprocess/split_emov_km_tsv_by_uttid.py {args.tsv} {args.emov_km} --destdir {outdir}/translation --seed {args.seed}")
+
+    emov_code_name = os.path.basename(args.emov_km)
+    logger.info("generating hifigan split")
+    run_cmd(
+        f"mkdir -p {outdir}/hifigan &&"
+        f"python preprocess/build_hifigan_manifest.py --km_type hubert --tsv {outdir}/denoising/emov/train.tsv --km {outdir}/denoising/emov/train.km > {outdir}/hifigan/train.txt &&"
+        f"python preprocess/build_hifigan_manifest.py --km_type hubert --tsv {outdir}/denoising/emov/valid.tsv --km {outdir}/denoising/emov/valid.km > {outdir}/hifigan/valid.txt &&"
+        f"python preprocess/build_hifigan_manifest.py --km_type hubert --tsv {outdir}/denoising/emov/test.tsv --km {outdir}/denoising/emov/test.km > {outdir}/hifigan/test.txt"
+    )
+
+    logger.info("generating fairseq manifests")
+    run_cmd(f"python preprocess/build_translation_manifests.py {outdir} {outdir}/fairseq-data -dd -cs --dict {args.dict}")
+
+    logger.info(f"finished processing data at:\n{outdir}")
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/emotion_conversion/preprocess/extract_f0.py
+++ b/examples/emotion_conversion/preprocess/extract_f0.py
+import argparse
+from tqdm import tqdm
+from multiprocessing import Manager, Pool
+
+from scipy.io.wavfile import read
+from librosa.util import normalize
+import numpy as np
+import amfm_decompy.pYAAPT as pYAAPT
+import amfm_decompy.basic_tools as basic
+
+MAX_WAV_VALUE = 32768.0
+
+parser = argparse.ArgumentParser(description="")
+parser.add_argument("tsv", help="")
+parser.add_argument("--extractor", choices=["crepe", "pyaapt"], default="pyaapt", help="")
+parser.add_argument("--interp", action="store_true", help="")
+parser.add_argument("--n_workers", type=int, default=40, help="")
+args = parser.parse_args()
+
+tsv_lines = open(args.tsv, "r").readlines()
+root, tsv_lines = tsv_lines[0].strip(), tsv_lines[1:]
+
+
+def extract_f0(tsv_line):
+    wav_path, _ = tsv_line.split("\t")
+    wav_path = root.strip() + "/" + wav_path
+    sr, wav = read(wav_path)
+    wav = wav / MAX_WAV_VALUE
+    wav = normalize(wav) * 0.95
+
+    if args.extractor == "pyaapt":
+        frame_length = 20.0
+        pad = int(frame_length / 1000 * sr) // 2
+        wav = np.pad(wav.squeeze(), (pad, pad), "constant", constant_values=0)
+        signal = basic.SignalObj(wav, sr)
+        pitch = pYAAPT.yaapt(
+                signal,
+                **{
+                    'frame_length': frame_length,
+                    'frame_space': 5.0,
+                    'nccf_thresh1': 0.25,
+                    'tda_frame_length': 25.0
+                })
+        pitch = pitch.samp_interp[None, None, :] if args.interp else pitch.samp_values[None, None, :]
+        pitch = pitch[0, 0]
+        f0_path = wav_path.replace(".wav", ".yaapt")
+        f0_path += ".interp.f0" if args.interp else ".f0"
+        np.save(f0_path, pitch)
+
+
+def main():
+    with Pool(args.n_workers) as p:
+        r = list(tqdm(p.imap(extract_f0, tsv_lines), total=len(tsv_lines)))
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/emotion_conversion/preprocess/process_km.py
+++ b/examples/emotion_conversion/preprocess/process_km.py
+import sys
+import argparse
+from tqdm import tqdm
+from build_emov_translation_manifests import dedup, remove_under_k
+
+
+if __name__ == "__main__":
+    """
+    this is a standalone script to process a km file
+    specifically, to dedup or remove tokens that repeat less
+    than k times in a row
+    """
+    parser = argparse.ArgumentParser(description="")
+    parser.add_argument("km", type=str, help="path to km file")
+    parser.add_argument("--dedup", action='store_true')
+    parser.add_argument("--remove-under-k", type=int, default=0)
+    parser.add_argument("--output", default=None)
+    args = parser.parse_args()
+
+    if not args.dedup and args.remove_under_k == 0:
+        print("nothing to do! quitting...")
+        sys.exit(0)
+
+    km = open(args.km, "r").readlines()
+    out = []
+    for line in tqdm(km):
+        if args.remove_under_k > 0:
+            line = remove_under_k(line, args.remove_under_k)
+        if args.dedup:
+            line = dedup(line)
+        out.append(line)
+
+    path = args.km if args.output is None else args.output
+    if args.remove_under_k > 0:
+        path = path.replace(".km", f"-k{args.remove_under_k}.km")
+    if args.dedup:
+        path = path.replace(".km", f"-deduped.km")
+
+    open(path, "w").writelines(out)
+    print(f"written to {path}")
--- a/examples/emotion_conversion/preprocess/split_emov_km_tsv_by_uttid.py
+++ b/examples/emotion_conversion/preprocess/split_emov_km_tsv_by_uttid.py
+from pathlib import Path
+import os
+import sys
+import argparse
+import random
+import numpy as np
+from tqdm import tqdm
+from sklearn.model_selection import train_test_split
+from build_translation_manifests import get_utt_id
+
+
+def train_val_test_split(tsv_lines, km_lines, valid_percent, test_percent, seed=42):
+    utt_ids = list(sorted(set([get_utt_id(x) for x in tsv_lines])))
+    utt_ids, valid_utt_ids, _, _ = train_test_split(utt_ids, utt_ids, test_size=valid_percent, shuffle=True, random_state=seed)
+    train_utt_ids, test_utt_ids, _, _ = train_test_split(utt_ids, utt_ids, test_size=test_percent, shuffle=True, random_state=seed)
+
+    train_idx = [i for i, line in enumerate(tsv_lines) if get_utt_id(line) in train_utt_ids]
+    valid_idx = [i for i, line in enumerate(tsv_lines) if get_utt_id(line) in valid_utt_ids]
+    test_idx = [i for i, line in enumerate(tsv_lines) if get_utt_id(line) in test_utt_ids]
+
+    train_tsv, train_km = [tsv_lines[i] for i in train_idx], [km_lines[i] for i in train_idx]
+    valid_tsv, valid_km = [tsv_lines[i] for i in valid_idx], [km_lines[i] for i in valid_idx]
+    test_tsv, test_km = [tsv_lines[i] for i in test_idx], [km_lines[i] for i in test_idx]
+
+    print(f"train {len(train_km)}")
+    print(f"valid {len(valid_km)}")
+    print(f"test {len(test_km)}")
+
+    return train_tsv, train_km, valid_tsv, valid_km, test_tsv, test_km
+
+
+if __name__ == "__main__":
+    """
+    this is a standalone script to process a km file
+    specifically, to dedup or remove tokens that repeat less
+    than k times in a row
+    """
+    parser = argparse.ArgumentParser(description="")
+    parser.add_argument("tsv", type=str, help="path to tsv file")
+    parser.add_argument("km", type=str, help="path to km file")
+    parser.add_argument("--destdir", required=True, type=str)
+    parser.add_argument("--valid-percent", type=float, default=0.05, help="percent to allocate to validation set")
+    parser.add_argument("--test-percent", type=float, default=0.05, help="percent to allocate to test set")
+    parser.add_argument("--seed", type=int, default=42, help="")
+    args = parser.parse_args()
+    
+    np.random.seed(args.seed)
+    random.seed(args.seed)
+
+    os.makedirs(args.destdir, exist_ok=True)
+    km = open(args.km, "r").readlines()
+    tsv = open(args.tsv, "r").readlines()
+    root, tsv = tsv[0], tsv[1:]
+
+    assert args.tsv.endswith(".tsv") and args.km.endswith(".km")
+    assert len(tsv) == len(km)
+
+    train_tsv, train_km, valid_tsv, valid_km, test_tsv, test_km = train_val_test_split(tsv, km, args.valid_percent, args.test_percent, args.seed)
+
+    assert len(train_tsv) + len(valid_tsv) + len(test_tsv) == len(tsv)
+    assert len(train_tsv) == len(train_km) and len(valid_tsv) == len(valid_km) and len(test_tsv) == len(test_km)
+
+    dir = Path(args.destdir)
+    open(dir / f"train.tsv", "w").writelines([root] + train_tsv)
+    open(dir / f"valid.tsv", "w").writelines([root] + valid_tsv)
+    open(dir / f"test.tsv", "w").writelines([root] + test_tsv)
+    open(dir / f"train.km", "w").writelines(train_km)
+    open(dir / f"valid.km", "w").writelines(valid_km)
+    open(dir / f"test.km", "w").writelines(test_km)
+    print("done")
--- a/examples/emotion_conversion/preprocess/split_km.py
+++ b/examples/emotion_conversion/preprocess/split_km.py
+from pathlib import Path
+import os
+import argparse
+import random
+import numpy as np
+from sklearn.utils import shuffle
+
+
+if __name__ == "__main__":
+    """
+    this is a standalone script to process a km file
+    specifically, to dedup or remove tokens that repeat less
+    than k times in a row
+    """
+    parser = argparse.ArgumentParser(description="")
+    parser.add_argument("km", type=str, help="path to km file")
+    parser.add_argument("--destdir", required=True, type=str)
+    parser.add_argument("--valid-percent", type=float, default=0.05, help="percent to allocate to validation set")
+    parser.add_argument("--test-percent", type=float, default=0.05, help="percent to allocate to test set")
+    parser.add_argument("-sh", "--shuffle", action="store_true", help="path to km file")
+    parser.add_argument("--seed", type=int, default=42, help="")
+    args = parser.parse_args()
+    
+    np.random.seed(args.seed)
+    random.seed(args.seed)
+
+    os.makedirs(args.destdir, exist_ok=True)
+    km = open(args.km, "r").readlines()
+
+    if args.shuffle:
+        km = shuffle(km)
+        print(f"shuffled")
+
+    N = len(km)
+    N_tt = int(N * args.test_percent)
+    N_cv = int(N * args.valid_percent)
+    N_tr = N - N_tt - N_cv
+
+    train_km = km[:N_tr]
+    valid_km = km[N_tr:N_tr + N_cv]
+    test_km = km[N_tr + N_cv:]
+
+    dir = Path(args.destdir)
+    open(dir / f"train.km", "w").writelines(train_km)
+    open(dir / f"valid.km", "w").writelines(valid_km)
+    open(dir / f"test.km", "w").writelines(test_km)
+    print(f"train: {len(train_km)}")
+    print(f"valid: {len(valid_km)}")
+    print(f"test: {len(test_km)}")
+    print("done")
--- a/examples/emotion_conversion/preprocess/split_km_tsv.py
+++ b/examples/emotion_conversion/preprocess/split_km_tsv.py
+from pathlib import Path
+import os
+import argparse
+import random
+import numpy as np
+from sklearn.utils import shuffle
+
+
+if __name__ == "__main__":
+    """
+    this is a standalone script to process a km file
+    specifically, to dedup or remove tokens that repeat less
+    than k times in a row
+    """
+    parser = argparse.ArgumentParser(description="")
+    parser.add_argument("tsv", type=str, help="path to tsv file")
+    parser.add_argument("km", type=str, help="path to km file")
+    parser.add_argument("--destdir", required=True, type=str)
+    parser.add_argument("--valid-percent", type=float, default=0.05, help="percent to allocate to validation set")
+    parser.add_argument("--test-percent", type=float, default=0.05, help="percent to allocate to test set")
+    parser.add_argument("-sh", "--shuffle", action="store_true", help="path to km file")
+    parser.add_argument("--seed", type=int, default=42, help="")
+    args = parser.parse_args()
+    
+    np.random.seed(args.seed)
+    random.seed(args.seed)
+
+    os.makedirs(args.destdir, exist_ok=True)
+    km = open(args.km, "r").readlines()
+    tsv = open(args.tsv, "r").readlines()
+    root, tsv = tsv[0], tsv[1:]
+
+    assert args.tsv.endswith(".tsv") and args.km.endswith(".km")
+    assert len(tsv) == len(km)
+
+    if args.shuffle:
+        tsv, km = shuffle(tsv, km)
+        print(f"shuffled")
+
+    N = len(tsv)
+    N_tt = int(N * args.test_percent)
+    N_cv = int(N * args.valid_percent)
+    N_tr = N - N_tt - N_cv
+
+    train_tsv = tsv[:N_tr]
+    valid_tsv = tsv[N_tr:N_tr + N_cv]
+    test_tsv = tsv[N_tr + N_cv:]
+    train_km = km[:N_tr]
+    valid_km = km[N_tr:N_tr + N_cv]
+    test_km = km[N_tr + N_cv:]
+
+    assert len(train_tsv) + len(valid_tsv) + len(test_tsv) == len(tsv)
+    assert len(train_tsv) == len(train_km) and len(valid_tsv) == len(valid_km) and len(test_tsv) == len(test_km)
+
+    dir = Path(args.destdir)
+    open(dir / f"train.tsv", "w").writelines([root] + train_tsv)
+    open(dir / f"valid.tsv", "w").writelines([root] + valid_tsv)
+    open(dir / f"test.tsv", "w").writelines([root] + test_tsv)
+    open(dir / f"train.km", "w").writelines(train_km)
+    open(dir / f"valid.km", "w").writelines(valid_km)
+    open(dir / f"test.km", "w").writelines(test_km)
+    print(f"train: {len(train_km)}")
+    print(f"valid: {len(valid_km)}")
+    print(f"test: {len(test_km)}")
+    print("done")
--- a/examples/emotion_conversion/requirements.txt
+++ b/examples/emotion_conversion/requirements.txt
+scipy
+einops
+amfm_decompy
+joblib
+numba
+decorator
+requests
+appdirs
+packaging
+six
+sklearn
--- a/examples/emotion_conversion/synthesize.py
+++ b/examples/emotion_conversion/synthesize.py
+import logging
+import argparse
+import random
+import sys
+import os
+import numpy as np
+import torch
+import soundfile as sf
+import shutil
+import librosa
+import json
+from pathlib import Path
+from tqdm import tqdm
+import amfm_decompy.basic_tools as basic
+import amfm_decompy.pYAAPT as pYAAPT
+
+dir_path = os.path.dirname(__file__)
+resynth_path = os.path.dirname(os.path.abspath(__file__)) + "/speech-resynthesis"
+sys.path.append(resynth_path)
+
+from models import CodeGenerator
+from inference import scan_checkpoint, load_checkpoint, generate
+from emotion_models.pitch_predictor import load_ckpt as load_pitch_predictor
+from emotion_models.duration_predictor import load_ckpt as load_duration_predictor
+from dataset import load_audio, MAX_WAV_VALUE, parse_style, parse_speaker, EMOV_SPK2ID, EMOV_STYLE2ID
+
+
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s [%(levelname)s] %(message)s',
+    handlers=[logging.FileHandler('debug.log'), logging.StreamHandler()]
+)
+logger = logging.getLogger(__name__)
+
+
+class AttrDict(dict):
+    def __init__(self, *args, **kwargs):
+        super(AttrDict, self).__init__(*args, **kwargs)
+        self.__dict__ = self
+
+
+def parse_generation_file(fname):
+    lines = open(fname).read()
+    lines = lines.split('\n')
+
+    results = {}
+    for l in lines:
+        if len(l) == 0:
+            continue
+
+        if l[0] == 'H':
+            parts = l[2:].split('\t')
+            if len(parts) == 2:
+                sid, utt = parts
+            else:
+                sid, _, utt = parts
+            sid = int(sid)
+            utt = [int(x) for x in utt.split()]
+            if sid in results:
+                results[sid]['H'] = utt
+            else:
+                results[sid] = {'H': utt}
+        elif l[0] == 'S':
+            sid, utt = l[2:].split('\t')
+            sid = int(sid)
+            utt = [x for x in utt.split()]
+            if sid in results:
+                results[sid]['S'] = utt
+            else:
+                results[sid] = {'S': utt}
+        elif l[0] == 'T':
+            sid, utt = l[2:].split('\t')
+            sid = int(sid)
+            utt = [int(x) for x in utt.split()]
+            if sid in results:
+                results[sid]['T'] = utt
+            else:
+                results[sid] = {'T': utt}
+
+    for d, result in results.items():
+        if 'H' not in result:
+            result['H'] = result['S']
+
+    return results
+
+
+def get_code_to_fname(manifest, tokens):
+    if tokens is None:
+        code_to_fname = {}
+        with open(manifest) as f:
+            for line in f:
+                line = line.strip()
+                fname, code = line.split()
+                code = code.replace(',', ' ')
+                code_to_fname[code] = fname
+
+        return code_to_fname
+
+    with open(manifest) as f:
+        fnames = [l.strip() for l in f.readlines()]
+        root = Path(fnames[0])
+        fnames = fnames[1:]
+        if '\t' in fnames[0]:
+            fnames = [x.split()[0] for x in fnames]
+
+    with open(tokens) as f:
+        codes = [l.strip() for l in f.readlines()]
+
+    code_to_fname = {}
+    for fname, code in zip(fnames, codes):
+        code = code.replace(',', ' ')
+        code_to_fname[code] = str(root / fname)
+
+    return root, code_to_fname
+
+
+def code_to_str(s):
+    k = ' '.join([str(x) for x in s])
+    return k
+
+
+def get_praat_f0(audio, rate=16000, interp=False):
+    frame_length = 20.0
+    to_pad = int(frame_length / 1000 * rate) // 2
+
+    f0s = []
+    for y in audio.astype(np.float64):
+        y_pad = np.pad(y.squeeze(), (to_pad, to_pad), "constant", constant_values=0)
+        signal = basic.SignalObj(y_pad, rate)
+        pitch = pYAAPT.yaapt(signal, **{'frame_length': frame_length, 'frame_space': 5.0, 'nccf_thresh1': 0.25,
+                                        'tda_frame_length': 25.0})
+        if interp:
+            f0s += [pitch.samp_interp[None, None, :]]
+        else:
+            f0s += [pitch.samp_values[None, None, :]]
+
+    f0 = np.vstack(f0s)
+    return f0
+
+
+def generate_from_code(generator, h, code, spkr=None, f0=None, gst=None, device="cpu"):
+    batch = {
+        'code': torch.LongTensor(code).to(device).view(1, -1),
+    }
+    if spkr is not None:
+        batch['spkr'] = spkr.to(device).unsqueeze(0)
+    if f0 is not None:
+        batch['f0'] = f0.to(device)
+    if gst is not None:
+        batch['style'] = gst.to(device)
+
+    with torch.no_grad():
+        audio, rtf = generate(h, generator, batch)
+        audio = librosa.util.normalize(audio / 2 ** 15)
+
+    return audio
+
+
+@torch.no_grad()
+def synth(argv, interactive=False):
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--result-path', type=Path, help='Translation Model Output', required=True)
+    parser.add_argument('--data', type=Path, help='a directory with the files: src.tsv, src.km, trg.tsv, trg.km, orig.tsv, orig.km')
+    parser.add_argument("--orig-tsv", default="/checkpoint/felixkreuk/datasets/emov/manifests/emov_16khz/data.tsv")
+    parser.add_argument("--orig-km", default="/checkpoint/felixkreuk/datasets/emov/manifests/emov_16khz/core_manifests/emov_16khz_km_100/data.km")
+
+    parser.add_argument('--checkpoint-file', type=Path, help='Generator Checkpoint', required=True)
+    parser.add_argument('--dur-model', type=Path, help='a token duration prediction model (if tokens were deduped)')
+    parser.add_argument('--f0-model', type=Path, help='a f0 prediction model')
+
+    parser.add_argument('-s', '--src-emotion', default=None)
+    parser.add_argument('-t', '--trg-emotion', default=None)
+    parser.add_argument('-N', type=int, default=10)
+    parser.add_argument('--split', default="test")
+
+    parser.add_argument('--outdir', type=Path, default=Path('results'))
+    parser.add_argument('--orig-filename', action='store_true')
+
+    parser.add_argument('--device', type=int, default=0)
+    a = parser.parse_args(argv)
+
+    seed = 52
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+
+    if os.path.isdir(a.checkpoint_file):
+        config_file = os.path.join(a.checkpoint_file, 'config.json')
+    else:
+        config_file = os.path.join(os.path.split(a.checkpoint_file)[0], 'config.json')
+    with open(config_file) as f:
+        data = f.read()
+    json_config = json.loads(data)
+    h = AttrDict(json_config)
+
+    generator = CodeGenerator(h).to(a.device)
+    if os.path.isdir(a.checkpoint_file):
+        cp_g = scan_checkpoint(a.checkpoint_file, 'g_')
+    else:
+        cp_g = a.checkpoint_file
+    state_dict_g = load_checkpoint(cp_g)
+    generator.load_state_dict(state_dict_g['generator'])
+
+    generator.eval()
+    generator.remove_weight_norm()
+
+    dur_models = {
+        "neutral":   load_duration_predictor(f"{a.dur_model}/neutral.ckpt"),
+        "amused":    load_duration_predictor(f"{a.dur_model}/amused.ckpt"),
+        "disgusted": load_duration_predictor(f"{a.dur_model}/disgusted.ckpt"),
+        "angry":     load_duration_predictor(f"{a.dur_model}/angry.ckpt"),
+        "sleepy":    load_duration_predictor(f"{a.dur_model}/sleepy.ckpt"),
+    }
+    logger.info(f"loaded duration prediction model from {a.dur_model}")
+
+    f0_model = load_pitch_predictor(a.f0_model).to(a.device)
+    logger.info(f"loaded f0 prediction model from {a.f0_model}")
+
+    # we need to know how to map code back to the filename
+    # (if we want the original files names as output)
+    results = parse_generation_file(a.result_path)
+    _, src_code_to_fname = get_code_to_fname(f'{a.data}/files.{a.split}.{a.src_emotion}', f'{a.data}/{a.split}.{a.src_emotion}')
+    _, tgt_code_to_fname = get_code_to_fname(f'{a.data}/files.{a.split}.{a.trg_emotion}', f'{a.data}/{a.split}.{a.trg_emotion}')
+
+    # we need the originals (before dedup) to get the ground-truth durations
+    orig_tsv = open(a.orig_tsv, 'r').readlines()
+    orig_tsv_root, orig_tsv = orig_tsv[0].strip(), orig_tsv[1:]
+    orig_km = open(a.orig_km, 'r').readlines()
+    fname_to_idx = {orig_tsv_root + "/" + line.split("\t")[0]: i for i, line in enumerate(orig_tsv)}
+
+    outdir = a.outdir
+    outdir.mkdir(parents=True, exist_ok=True)
+    (outdir / '0-source').mkdir(exist_ok=True)
+    (outdir / '1-src-tokens-src-style-src-f0').mkdir(exist_ok=True)
+    (outdir / '2-src-tokens-trg-style-src-f0').mkdir(exist_ok=True)
+    (outdir / '2.5-src-tokens-trg-style-src-f0').mkdir(exist_ok=True)
+    (outdir / '3-src-tokens-trg-style-pred-f0').mkdir(exist_ok=True)
+    (outdir / '4-gen-tokens-trg-style-pred-f0').mkdir(exist_ok=True)
+    (outdir / '5-target').mkdir(exist_ok=True)
+
+    N = 0
+    results = list(results.items())
+    random.shuffle(results)
+    for i, (sid, result) in tqdm(enumerate(results)):
+        N += 1
+        if N > a.N and a.N != -1:
+            break
+
+        if '[' in result['S'][0]:
+            result['S'] = result['S'][1:]
+        if '_' in result['S'][-1]:
+            result['S'] = result['S'][:-1]
+        src_ref = src_code_to_fname[code_to_str(result['S'])]
+        trg_ref = tgt_code_to_fname[code_to_str(result['T'])]
+
+        src_style, trg_style = None, None
+        src_spkr, trg_spkr = None, None
+        src_f0 = None
+        src_audio = (load_audio(src_ref)[0] / MAX_WAV_VALUE) * 0.95
+        trg_audio = (load_audio(trg_ref)[0] / MAX_WAV_VALUE) * 0.95
+        src_audio = torch.FloatTensor(src_audio).unsqueeze(0).cuda()
+        trg_audio = torch.FloatTensor(trg_audio).unsqueeze(0).cuda()
+
+        src_spkr = parse_speaker(src_ref, h.multispkr)
+        src_spkr = src_spkr if src_spkr in EMOV_SPK2ID else random.choice(list(EMOV_SPK2ID.keys()))
+        src_spkr = EMOV_SPK2ID[src_spkr]
+        src_spkr = torch.LongTensor([src_spkr])
+        trg_spkr = parse_speaker(trg_ref, h.multispkr)
+        trg_spkr = trg_spkr if trg_spkr in EMOV_SPK2ID else random.choice(list(EMOV_SPK2ID.keys()))
+        trg_spkr = EMOV_SPK2ID[trg_spkr]
+        trg_spkr = torch.LongTensor([trg_spkr])
+
+        src_style = EMOV_STYLE2ID[a.src_emotion]
+        src_style = torch.LongTensor([src_style]).cuda()
+        trg_style_str = a.trg_emotion
+        trg_style = EMOV_STYLE2ID[a.trg_emotion]
+        trg_style = torch.LongTensor([trg_style]).cuda()
+
+        src_tokens = list(map(int, orig_km[fname_to_idx[src_ref]].strip().split(" ")))
+        src_tokens = torch.LongTensor(src_tokens).unsqueeze(0)
+        src_tokens_dur_pred = torch.LongTensor(list(map(int, result['S']))).unsqueeze(0)
+        src_tokens_dur_pred = dur_models[trg_style_str].inflate_input(src_tokens_dur_pred)
+        gen_tokens = torch.LongTensor(result['H']).unsqueeze(0)
+        gen_tokens = dur_models[trg_style_str].inflate_input(gen_tokens)
+        trg_tokens = torch.LongTensor(result['T']).unsqueeze(0)
+        trg_tokens = dur_models[trg_style_str].inflate_input(trg_tokens)
+
+        src_f0 = get_praat_f0(src_audio.unsqueeze(0).cpu().numpy())
+        src_f0 = torch.FloatTensor(src_f0).cuda()
+
+        pred_src_f0 = f0_model.inference(torch.LongTensor(src_tokens).to(a.device), src_spkr, trg_style).unsqueeze(0)
+        pred_src_dur_pred_f0 = f0_model.inference(torch.LongTensor(src_tokens_dur_pred).to(a.device), src_spkr, trg_style).unsqueeze(0)
+        pred_gen_f0 = f0_model.inference(torch.LongTensor(gen_tokens).to(a.device), src_spkr, trg_style).unsqueeze(0)
+        pred_trg_f0 = f0_model.inference(torch.LongTensor(trg_tokens).to(a.device), src_spkr, trg_style).unsqueeze(0)
+
+        if a.orig_filename:
+            path = src_code_to_fname[code_to_str(result['S'])]
+            sid = str(sid) + "__" + Path(path).stem
+        shutil.copy(src_code_to_fname[code_to_str(result['S'])], outdir / '0-source' / f'{sid}.wav')
+
+        audio = generate_from_code(generator, h, src_tokens, spkr=src_spkr, f0=src_f0, gst=src_style, device=a.device)
+        sf.write(outdir / '1-src-tokens-src-style-src-f0' / f'{sid}.wav', audio, samplerate=h.sampling_rate)
+
+        audio = generate_from_code(generator, h, src_tokens, spkr=src_spkr, f0=src_f0, gst=trg_style, device=a.device)
+        sf.write(outdir / '2-src-tokens-trg-style-src-f0' / f'{sid}.wav', audio, samplerate=h.sampling_rate)
+
+        audio = generate_from_code(generator, h, src_tokens_dur_pred, spkr=src_spkr, f0=src_f0, gst=trg_style, device=a.device)
+        sf.write(outdir / '2.5-src-tokens-trg-style-src-f0' / f'{sid}.wav', audio, samplerate=h.sampling_rate)
+
+        audio = generate_from_code(generator, h, src_tokens_dur_pred, spkr=src_spkr, f0=pred_src_dur_pred_f0, gst=trg_style, device=a.device)
+        sf.write(outdir / '3-src-tokens-trg-style-pred-f0' / f'{sid}.wav', audio, samplerate=h.sampling_rate)
+
+        audio = generate_from_code(generator, h, gen_tokens, spkr=src_spkr, f0=pred_gen_f0, gst=trg_style, device=a.device)
+        sf.write(outdir / '4-gen-tokens-trg-style-pred-f0' / f'{sid}.wav', audio, samplerate=h.sampling_rate)
+
+        shutil.copy(tgt_code_to_fname[code_to_str(result['T'])], outdir / '5-target' / f'{sid}.wav')
+
+    logger.info("Done.")
+
+
+if __name__ == '__main__':
+    synth(sys.argv[1:])
--- a/examples/fast_noisy_channel/README.md
+++ b/examples/fast_noisy_channel/README.md
+# Language Models not just for Pre-training: Fast Online Neural Noisy Channel Modeling
+
+## Introduction
+- [Yee et al. (2019)](https://www.aclweb.org/anthology/D19-1571.pdf) introduce a simple and effective noisy channel modeling approach for neural machine translation. However, the noisy channel online decoding approach introduced in this paper is too slow to be practical.
+- To address this, [Bhosale et al. (2020)](http://www.statmt.org/wmt20/pdf/2020.wmt-1.68.pdf) introduces 3 simple approximations to make this approach very fast and practical without much loss in accuracy.
+- This README provides intructions on how to run online decoding or generation with the noisy channel modeling approach, including ways to make it very fast without much loss in accuracy.
+
+## Noisy Channel Modeling
+
+[Yee et al. (2019)](https://www.aclweb.org/anthology/D19-1571.pdf) applies the Bayes Rule to predict `P(y|x)`, the probability of the target `y` given the source `x`.
+```P(y|x) = P(x|y) * P(y) / P(x)```
+- `P(x|y)` predicts the source `x` given the target `y` and is referred to as the **channel model**
+- `P(y)` is a **language model** over the target `y`
+- `P(x)` is generally not modeled since it is constant for all `y`.
+
+We use Transformer models to parameterize the direct model `P(y|x)`, the channel model `P(x|y)` and the language model `P(y)`.
+
+During online decoding with beam search, we generate the top `K2` candidates per beam and score them with the following linear combination of the channel model, the language model as well as the direct model scores.
+
+```(1 / t) * log(P(y|x) + (1 / s) * ( λ1 * log(P(x|y)) + λ2 * log(P(y) ) )```
+- `t` - Target Prefix Length
+- `s` - Source Length
+- `λ1` - Channel Model Weight
+- `λ2` - Language Model Weight
+
+The top `beam_size` candidates based on the above combined scores are chosen to continue the beams in beam search. In beam search with a direct model alone, the scores from the direct model `P(y|x)` are used to choose the top candidates in beam search.
+
+This framework provides a great way to utlize strong target language models trained on large amounts of unlabeled data. Language models can prefer targets unrelated to the source, so we also need a channel model whose role is to ensure that the target preferred by the language model also translates back to the source.
+
+### Training Translation Models and Language Models
+
+For training Transformer models in fairseq for machine translation, refer to instructions [here](https://github.com/pytorch/fairseq/tree/main/examples/translation)
+
+For training Transformer models in fairseq for language modeling, refer to instructions [here](https://github.com/pytorch/fairseq/tree/main/examples/language_model)
+
+### Generation with Language Model for German-English translation with fairseq
+
+Here are instructions to generate using a direct model and a target-side language model.
+
+Note:
+- Download and install fairseq as per instructions [here](https://github.com/pytorch/fairseq)
+- Preprocess and binarize the dataset as per instructions in section [Test Data Preprocessing](#test-data-preprocessing)
+
+```sh
+binarized_data=data_dir/binarized
+direct_model=de_en_seed4.pt
+lm_model=en_lm.pt
+lm_data=lm_data
+wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/direct_models/seed4.pt -O ${direct_model}
+wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/lm_model/transformer_lm.pt -O ${lm_model}
+mkdir -p ${lm_data}
+wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/lm_model/lm_dict/dict.txt -O ${lm_data}/dict.txt
+
+k2=10
+lenpen=0.16
+lm_wt=0.14
+fairseq-generate ${binarized_data} \
+    --user-dir examples/fast_noisy_channel \
+    --beam 5 \
+    --path ${direct_model} \
+    --lm-model ${lm_model} \
+    --lm-data ${lm_data}  \
+    --k2 ${k2} \
+    --combine-method lm_only \
+    --task noisy_channel_translation \
+    --lenpen ${lenpen} \
+    --lm-wt ${lm_wt} \
+    --gen-subset valid \
+    --remove-bpe \
+    --fp16 \
+    --batch-size 10
+```
+### Noisy Channel Generation for German-English translation with fairseq
+
+Here are instructions for noisy channel generation with a direct model, channel model and language model as explained in section [Noisy Channel Modeling](#noisy-channel-modeling).
+
+Note:
+- Download and install fairseq as per instructions [here](https://github.com/pytorch/fairseq)
+- Preprocess and binarize the dataset as per instructions in section [Test Data Preprocessing](#test-data-preprocessing)
+
+```sh
+binarized_data=data_dir/binarized
+direct_model=de_en_seed4.pt
+lm_model=en_lm.pt
+lm_data=lm_data
+ch_model=en_de.big.seed4.pt
+wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/direct_models/seed4.pt -O ${direct_model}
+wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/lm_model/transformer_lm.pt -O ${lm_model}
+mkdir -p ${lm_data}
+wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/lm_model/lm_dict/dict.txt -O ${lm_data}/dict.txt
+wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/big.seed4.pt -O ${ch_model}
+
+k2=10
+lenpen=0.21
+lm_wt=0.50
+bw_wt=0.30
+fairseq-generate ${binarized_data} \
+    --user-dir examples/fast_noisy_channel \
+    --beam 5 \
+    --path ${direct_model} \
+    --lm-model ${lm_model} \
+    --lm-data ${lm_data}  \
+    --channel-model ${ch_model} \
+    --k2 ${k2} \
+    --combine-method noisy_channel \
+    --task noisy_channel_translation \
+    --lenpen ${lenpen} \
+    --lm-wt ${lm_wt} \
+    --ch-wt ${bw_wt} \
+    --gen-subset test \
+    --remove-bpe \
+    --fp16 \
+    --batch-size 1
+```
+## Fast Noisy Channel Modeling
+
+[Bhosale et al. (2020)](http://www.statmt.org/wmt20/pdf/2020.wmt-1.68.pdf) introduces 3 approximations that speed up online noisy channel decoding -
+- Smaller channel models (`Tranformer Base` with 1 encoder and decoder layer each vs. `Transformer Big`)
+  - This involves training a channel model that is possibly smaller and less accurate in terms of BLEU than a channel model of the same size as the direct model.
+  - Since the role of the channel model is mainly to assign low scores to generations from the language model if they don't translate back to the source, we may not need the most accurate channel model for this purpose.
+- Smaller output vocabulary size for the channel model (~30,000 -> ~1000)
+  - The channel model doesn't need to score the full output vocabulary, it just needs to score the source tokens, which are completely known.
+  - This is specified using the arguments `--channel-scoring-type src_vocab --top-k-vocab 500`
+  - This means that the output vocabulary for the channel model will be the source tokens for all examples in the batch and the top-K most frequent tokens in the vocabulary
+  - This reduces the memory consumption needed to store channel model scores significantly
+- Smaller number of candidates (`k2`) scored per beam
+  - This is specified by reducing the argument `--k2`
+
+
+### Fast Noisy Channel Generation for German-English translation with fairseq
+
+Here are instructions for **fast** noisy channel generation with a direct model, channel model and language model as explained in section [Fast Noisy Channel Modeling](#fast-noisy-channel-modeling). The main differences are that we use a smaller channel model, reduce `--k2`, set `--channel-scoring-type src_vocab --top-k-vocab 500` and increase the `--batch-size`.
+
+Note:
+- Download and install fairseq as per instructions [here](https://github.com/pytorch/fairseq)
+- Preprocess and binarize the dataset as per instructions in section [Test Data Preprocessing](#test-data-preprocessing)
+
+```sh
+binarized_data=data_dir/binarized
+direct_model=de_en_seed4.pt
+lm_model=en_lm.pt
+lm_data=lm_data
+small_ch_model=en_de.base_1_1.seed4.pt
+wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/direct_models/seed4.pt -O ${direct_model}
+wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/lm_model/transformer_lm.pt -O ${lm_model}
+mkdir -p ${lm_data}
+wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/lm_model/lm_dict/dict.txt -O ${lm_data}/dict.txt
+wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/base_1_1.seed4.pt -O ${small_ch_model}
+
+k2=3
+lenpen=0.23
+lm_wt=0.58
+bw_wt=0.26
+fairseq-generate ${binarized_data} \
+    --user-dir examples/fast_noisy_channel \
+    --beam 5 \
+    --path ${direct_model} \
+    --lm-model ${lm_model} \
+    --lm-data ${lm_data}  \
+    --channel-model ${small_ch_model} \
+    --k2 ${k2} \
+    --combine-method noisy_channel \
+    --task noisy_channel_translation \
+    --lenpen ${lenpen} \
+    --lm-wt ${lm_wt} \
+    --ch-wt ${bw_wt} \
+    --gen-subset test \
+    --remove-bpe \
+    --fp16 \
+    --batch-size 50 \
+    --channel-scoring-type src_vocab --top-k-vocab 500
+```
+
+## Test Data Preprocessing
+
+For preprocessing and binarizing the test sets for Romanian-English and German-English translation, we use the following script -
+
+```sh
+FAIRSEQ=/path/to/fairseq
+cd $FAIRSEQ
+SCRIPTS=$FAIRSEQ/mosesdecoder/scripts
+if [ ! -d "${SCRIPTS}" ]; then
+    echo 'Cloning Moses github repository (for tokenization scripts)...'
+    git clone https://github.com/moses-smt/mosesdecoder.git
+fi
+TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
+NORMALIZE=$SCRIPTS/tokenizer/normalize-punctuation.perl
+
+s=de
+t=en
+test=wmt18
+
+mkdir -p data_dir
+
+# Tokenization
+if [ $s == "ro" ] ; then
+    # Note: Get normalise-romanian.py and remove-diacritics.py from
+    # https://github.com/rsennrich/wmt16-scripts/tree/master/preprocess
+    sacrebleu -t $test -l $s-$t --echo src | \
+        $NORMALIZE -l $s | \
+        python normalise-romanian.py | \
+        python remove-diacritics.py | \
+        $TOKENIZER -l $s -a -q > data_dir/$test.$s-$t.$s
+else
+    sacrebleu -t $test -l $s-$t --echo src | perl $NORMALIZE -l $s | perl $TOKENIZER -threads 8 -a -l $s > data_dir/$test.$s-$t.$s
+fi
+
+sacrebleu -t $test -l $s-$t --echo ref | perl $NORMALIZE -l $t | perl $TOKENIZER -threads 8 -a -l $t > data_dir/$test.$s-$t.$t
+
+
+# Applying BPE
+src_bpe_code=/path/to/source/language/bpe/code
+tgt_bpe_code=/path/to/target/language/bpe/code
+src_dict=/path/to/source/language/dict
+tgt_dict=/path/to/target/language/dict
+
+FASTBPE=$FAIRSEQ/fastBPE
+if [ ! -d "${FASTBPE}" ] ; then
+    git clone https://github.com/glample/fastBPE.git
+    # Follow compilation instructions at https://github.com/glample/fastBPE
+    g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
+fi
+
+${FASTBPE}/fast applybpe data_dir/bpe.$test.$s-$t.$s data_dir/$test.$s-$t.$s ${src_bpe_code}
+${FASTBPE}/fast applybpe data_dir/bpe.$test.$s-$t.$s data_dir/$test.$s-$t.$s ${tgt_bpe_code}
+
+fairseq-preprocess -s $s -t $t \
+    --testpref data_dir/bpe.$test.$s-$t \
+    --destdir data_dir/binarized \
+    --srcdict ${src_dict} \
+    --tgtdict ${tgt_dict}
+```
+
+## Calculating BLEU
+
+```sh
+DETOKENIZER=$SCRIPTS/tokenizer/detokenizer.perl
+cat ${generation_output} | grep -P "^H" | sort -V | cut -f 3- | $DETOKENIZER -l $t -q -a | sacrebleu -t $test -l $s-$t
+```
+
+
+## Romanian-English Translation
+
+The direct and channel models are trained using bitext data (WMT16) combined with backtranslated data (The monolingual data used for backtranslation comes from http://data.statmt.org/rsennrich/wmt16_backtranslations/ (Sennrich et al., 2016c))
+
+The backtranslated data is generated using an ensemble of 3 English-Romanian models trained on bitext training data (WMT16) with unrestricted sampling.
+
+### BPE Codes and Dictionary
+
+We learn a joint BPE vocabulary of 18K types on the bitext training data which is used for both the source and target.
+||Path|
+|----------|------|
+| BPE Code | [joint_bpe_18k](https://dl.fbaipublicfiles.com/fast_noisy_channel/ro_en/bpe_18k) |
+| Dictionary | [dict](https://dl.fbaipublicfiles.com/fast_noisy_channel/ro_en/dict) |
+
+### Direct Models
+For Ro-En with backtranslation, the direct and channel models use a Transformer-Big architecture.
+
+| Seed | Model |
+|----|----|
+| 2 | [ro_en_seed2.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/ro_en/direct_models/seed2.pt)
+| 4 | [ro_en_seed4.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/ro_en/direct_models/seed4.pt)
+| 6 | [ro_en_seed6.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/ro_en/direct_models/seed6.pt)
+
+### Channel Models
+For channel models, we follow the same steps as for the direct models. But backtranslated data is generated in the opposite direction using [this Romanian monolingual data](http://data.statmt.org/rsennrich/wmt16_backtranslations/).
+The best lenpen, LM weight and CH weight are obtained by sweeping over the validation set (wmt16/dev) using beam 5.
+| Model Size | Lenpen | LM Weight | CH Weight | Seed 2 | Seed 4 | Seed 6 |
+|----|----|----|----|----|----|----|
+| `big` | 0.84 | 0.64 | 0.56 | [big.seed2.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/ro_en/channel_models/big.seed2.pt) | [big.seed2.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/ro_en/channel_models/big.seed2.pt) | [big.seed2.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/ro_en/channel_models/big.seed2.pt) |
+| `base_1_1` | 0.63 | 0.40 | 0.37 | [base_1_1.seed2.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/ro_en/channel_models/base_1_1.seed2.pt) | [base_1_1.seed4.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/ro_en/channel_models/base_1_1.seed4.pt) | [base_1_1.seed6.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/ro_en/channel_models/base_1_1.seed6.pt) |
+
+### Language Model
+The model is trained on de-duplicated English Newscrawl data from 2007-2018 comprising 186 million sentences or 4.5B words after normalization and tokenization.
+|  | Path |
+|----|----|
+| `--lm-model` | [transformer_en_lm](https://dl.fbaipublicfiles.com/fast_noisy_channel/ro_en/lm_model/transformer_lm.pt) |
+| `--lm-data` | [lm_data](https://dl.fbaipublicfiles.com/fast_noisy_channel/ro_en/lm_model/lm_dict)
+
+## German-English Translation
+
+### BPE Codes and Dictionaries
+
+| | Path|
+|----------|------|
+| Source BPE Code | [de_bpe_code_24K](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/de_bpe_code_24K) |
+| Target BPE Code | [en_bpe_code_24K](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/en_bpe_code_24K)
+| Source Dictionary | [de_dict](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/de_dict) |
+| Target Dictionary | [en_dict](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/en_dict) |
+
+### Direct Models
+We train on WMT’19 training data. Following [Ng et al., 2019](http://statmt.org/wmt19/pdf/53/WMT33.pdf), we apply language identification filtering and remove sentences longer than 250 tokens as well as sentence pairs with a source/target length ratio exceeding 1.5. This results in 26.8M sentence pairs.
+We use the Transformer-Big architecture for the direct model.
+
+| Seed | Model |
+|:----:|----|
+| 4 | [de_en_seed4.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/direct_models/seed4.pt)
+| 5 | [de_en_seed5.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/direct_models/seed5.pt)
+| 6 | [de_en_seed6.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/direct_models/seed6.pt)
+
+### Channel Models
+
+We train on WMT’19 training data. Following [Ng et al., 2019](http://statmt.org/wmt19/pdf/53/WMT33.pdf), we apply language identification filtering and remove sentences longer than 250 tokens as well as sentence pairs with a source/target length ratio exceeding 1.5. This results in 26.8M sentence pairs.
+
+| Model Size | Seed 4 | Seed 5 | Seed 6 |
+|----|----|----|----|
+| `big` | [big.seed4.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/big.seed4.pt) | [big.seed5.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/big.seed5.pt) | [big.seed6.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/big.seed6.pt) |
+| `big_1_1` | [big_1_1.seed4.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/big_1_1.seed4.pt) | [big_1_1.seed5.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/big_1_1.seed5.pt) | [big_1_1.seed6.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/big_1_1.seed6.pt) |
+| `base` | [base.seed4.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/base.seed4.pt) | [base.seed5.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/base.seed5.pt) | [base.seed6.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/base.seed6.pt) |
+| `base_1_1` | [base_1_1.seed4.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/base_1_1.seed4.pt) | [base_1_1.seed5.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/base_1_1.seed5.pt) | [base_1_1.seed6.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/base_1_1.seed6.pt) |
+| `half` | [half.seed4.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/half.seed4.pt) | [half.seed5.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/half.seed5.pt) | [half.seed6.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/half.seed6.pt) |
+| `half_1_1` | [half_1_1.seed4.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/half_1_1.seed4.pt) | [half_1_1.seed5.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/half_1_1.seed5.pt) | [half_1_1.seed6.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/half_1_1.seed6.pt) |
+| `quarter` | [quarter.seed4.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/quarter.seed4.pt) | [quarter.seed5.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/quarter.seed5.pt) | [quarter.seed6.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/quarter.seed6.pt) |
+| `quarter_1_1` | [quarter_1_1.seed4.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/quarter_1_1.seed4.pt) | [quarter_1_1.seed5.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/quarter_1_1.seed5.pt) | [quarter_1_1.seed6.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/quarter_1_1.seed6.pt) |
+| `8th` | [8th.seed4.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/8th.seed4.pt) | [8th.seed5.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/8th.seed5.pt) | [8th.seed6.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/8th.seed6.pt) |
+| `8th_1_1` | [8th_1_1.seed4.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/8th_1_1.seed4.pt) | [8th_1_1.seed5.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/8th_1_1.seed5.pt) | [8th_1_1.seed6.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/8th_1_1.seed6.pt) |
+| `16th` | [16th.seed4.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/16th.seed4.pt) | [16th.seed5.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/16th.seed5.pt) | [16th.seed6.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/16th.seed6.pt) |
+| `16th_1_1` | [16th_1_1.seed4.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/16th_1_1.seed4.pt) | [16th_1_1.seed5.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/16th_1_1.seed5.pt) | [16th_1_1.seed6.pt](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/16th_1_1.seed6.pt) |
+
+### Language Model
+The model is trained on de-duplicated English Newscrawl data from 2007-2018 comprising 186 million sentences or 4.5B words after normalization and tokenization.
+|  | Path |
+|----|----|
+| `--lm-model` | [transformer_en_lm](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/lm_model/transformer_lm.pt) |
+| `--lm-data` | [lm_data](https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/lm_model/lm_dict/)
+
+
+## Citation
+
+```bibtex
+@inproceedings{bhosale2020language,
+    title={Language Models not just for Pre-training: Fast Online Neural Noisy Channel Modeling},
+    author={Shruti Bhosale and Kyra Yee and Sergey Edunov and Michael Auli},
+    booktitle={Proceedings of the Fifth Conference on Machine Translation (WMT)},
+    year={2020},
+}
+
+@inproceedings{yee2019simple,
+  title={Simple and Effective Noisy Channel Modeling for Neural Machine Translation},
+  author={Yee, Kyra and Dauphin, Yann and Auli, Michael},
+  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
+  pages={5700--5705},
+  year={2019}
+}
+```