# Discriminative Reranking for Neural Machine Translation
https://aclanthology.org/2021.acl-long.563/
This folder contains source code for training DrNMT, a discriminatively trained reranker for neural machine translation.
## Data preparation
1. Follow the instructions under `examples/translation` to build a base MT model. Prepare three files, one with source sentences, one with ground truth target sentences, and one with hypotheses generated from the base MT model. Each line in the file contains one sentence in raw text (i.e. no sentencepiece, etc.). Below is an example of the files with _N_ hypotheses for each source sentence.
```
# Example of the source sentence file: (The file should contain L lines.)
source_sentence_1
source_sentence_2
source_sentence_3
...
source_sentence_L
# Example of the target sentence file: (The file should contain L lines.)
target_sentence_1
target_sentence_2
target_sentence_3
...
target_sentence_L
# Example of the hypotheses file: (The file should contain L*N lines.)
source_sentence_1_hypo_1
source_sentence_1_hypo_2
...
source_sentence_1_hypo_N
source_sentence_2_hypo_1
...
source_sentence_2_hypo_N
...
source_sentence_L_hypo_1
...
source_sentence_L_hypo_N
```
2. Download the [XLMR model](https://github.com/fairinternal/fairseq-py/tree/main/examples/xlmr#pre-trained-models).
# The folder should contain dict.txt, model.pt and sentencepiece.bpe.model.
```
3. Prepare scores and BPE data.
*`N`: Number of hypotheses per each source sentence. We use 50 in the paper.
*`SPLIT`: Name of the data split, i.e. train, valid, test. Use split_name, split_name1, split_name2, ..., if there are multiple datasets for a split, e.g. train, train1, valid, valid1.
*`NUM_SHARDS`: Number of shards. Set this to 1 for non-train splits.
*`METRIC`: The metric for DrNMT to optimize for. We support either `bleu` or `ter`.
```
# For each data split, e.g. train, valid, test, etc., run the following: