[Doc] explaining romanian postprocessing for MBART BLEU hacking (#5943)

95d1962b · Sam Shleifer · GitHub · 604a2355 · 95d1962b
Unverified Commit 95d1962b authored Jul 21, 2020 by Sam Shleifer Committed by GitHub Jul 21, 2020
Show whitespace changes
Inline Side-by-side

Showing with 65 additions and 0 deletions

examples/seq2seq/romanian_postprocessing.md examples/seq2seq/romanian_postprocessing.md +65 -0

No files found.
--- a/examples/seq2seq/romanian_postprocessing.md
+++ b/examples/seq2seq/romanian_postprocessing.md
+### Motivation
+Without processing, english-> romanian mbart-large-en-ro gets BLEU score 26.8 on the WMT data.
+With post processing, it can score 37..
+Here is the postprocessing code, stolen from @mjpost in this [issue](https://github.com/pytorch/fairseq/issues/1758)
+### Instructions
+Note: You need to have your test_generations.txt before you start this process.
+(1) Setup `mosesdecoder` and `wmt16-scripts`
+```bash
+cd $HOME
+git clone git@github.com:moses-smt/mosesdecoder.git
+cd mosesdecoder  
+git@github.com:rsennrich/wmt16-scripts.git
+```
+(2) define a function for post processing.
+ It removes diacritics and does other things I don't understand 
+```bash
+ro_post_process () {
+  sys=$1
+  ref=$2
+  export MOSES_PATH=$HOME/mosesdecoder
+  REPLACE_UNICODE_PUNCT=$MOSES_PATH/scripts/tokenizer/replace-unicode-punctuation.perl
+  NORM_PUNC=$MOSES_PATH/scripts/tokenizer/normalize-punctuation.perl
+  REM_NON_PRINT_CHAR=$MOSES_PATH/scripts/tokenizer/remove-non-printing-char.perl
+  REMOVE_DIACRITICS=$MOSES_PATH/wmt16-scripts/preprocess/remove-diacritics.py
+  NORMALIZE_ROMANIAN=$MOSES_PATH/wmt16-scripts/preprocess/normalise-romanian.py
+  TOKENIZER=$MOSES_PATH/scripts/tokenizer/tokenizer.perl
+  lang=ro
+  for file in $sys $ref; do
+    cat $file \
+    | $REPLACE_UNICODE_PUNCT \
+    | $NORM_PUNC -l $lang \
+    | $REM_NON_PRINT_CHAR \
+    | $NORMALIZE_ROMANIAN \
+    | $REMOVE_DIACRITICS \
+    | $TOKENIZER -no-escape -l $lang \
+    > $(basename $file).tok
+  done
+  # compute BLEU
+  cat $(basename $sys).tok | sacrebleu -tok none -s none -b $(basename $ref).tok
+}
+```
+(3) Call the function on test_generations.txt and test.target
+For example,
+```bash
+ro_post_process enro_finetune/test_generations.txt wmt_en_ro/test.target
+```
+This will split out a new blue score and write a new fine called `test_generations.tok` with post-processed outputs.
+```