Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
95d1962b
Unverified
Commit
95d1962b
authored
Jul 21, 2020
by
Sam Shleifer
Committed by
GitHub
Jul 21, 2020
Browse files
[Doc] explaining romanian postprocessing for MBART BLEU hacking (#5943)
parent
604a2355
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
65 additions
and
0 deletions
+65
-0
examples/seq2seq/romanian_postprocessing.md
examples/seq2seq/romanian_postprocessing.md
+65
-0
No files found.
examples/seq2seq/romanian_postprocessing.md
0 → 100644
View file @
95d1962b
### Motivation
Without processing, english-> romanian mbart-large-en-ro gets BLEU score 26.8 on the WMT data.
With post processing, it can score 37..
Here is the postprocessing code, stolen from @mjpost in this
[
issue
](
https://github.com/pytorch/fairseq/issues/1758
)
### Instructions
Note: You need to have your test_generations.txt before you start this process.
(1) Setup
`mosesdecoder`
and
`wmt16-scripts`
```
bash
cd
$HOME
git clone git@github.com:moses-smt/mosesdecoder.git
cd
mosesdecoder
git@github.com:rsennrich/wmt16-scripts.git
```
(2) define a function for post processing.
It removes diacritics and does other things I don't understand
```
bash
ro_post_process
()
{
sys
=
$1
ref
=
$2
export
MOSES_PATH
=
$HOME
/mosesdecoder
REPLACE_UNICODE_PUNCT
=
$MOSES_PATH
/scripts/tokenizer/replace-unicode-punctuation.perl
NORM_PUNC
=
$MOSES_PATH
/scripts/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR
=
$MOSES_PATH
/scripts/tokenizer/remove-non-printing-char.perl
REMOVE_DIACRITICS
=
$MOSES_PATH
/wmt16-scripts/preprocess/remove-diacritics.py
NORMALIZE_ROMANIAN
=
$MOSES_PATH
/wmt16-scripts/preprocess/normalise-romanian.py
TOKENIZER
=
$MOSES_PATH
/scripts/tokenizer/tokenizer.perl
lang
=
ro
for
file
in
$sys
$ref
;
do
cat
$file
\
|
$REPLACE_UNICODE_PUNCT
\
|
$NORM_PUNC
-l
$lang
\
|
$REM_NON_PRINT_CHAR
\
|
$NORMALIZE_ROMANIAN
\
|
$REMOVE_DIACRITICS
\
|
$TOKENIZER
-no-escape
-l
$lang
\
>
$(
basename
$file
)
.tok
done
# compute BLEU
cat
$(
basename
$sys
)
.tok | sacrebleu
-tok
none
-s
none
-b
$(
basename
$ref
)
.tok
}
```
(3) Call the function on test_generations.txt and test.target
For example,
```
bash
ro_post_process enro_finetune/test_generations.txt wmt_en_ro/test.target
```
This will split out a new blue score and write a new fine called
`test_generations.tok`
with post-processed outputs.
```
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment