Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
95d1962b
Unverified
Commit
95d1962b
authored
Jul 21, 2020
by
Sam Shleifer
Committed by
GitHub
Jul 21, 2020
Browse files
[Doc] explaining romanian postprocessing for MBART BLEU hacking (#5943)
parent
604a2355
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
65 additions
and
0 deletions
+65
-0
examples/seq2seq/romanian_postprocessing.md
examples/seq2seq/romanian_postprocessing.md
+65
-0
No files found.
examples/seq2seq/romanian_postprocessing.md
0 → 100644
View file @
95d1962b
### Motivation
Without processing, english-> romanian mbart-large-en-ro gets BLEU score 26.8 on the WMT data.
With post processing, it can score 37..
Here is the postprocessing code, stolen from @mjpost in this
[
issue
](
https://github.com/pytorch/fairseq/issues/1758
)
### Instructions
Note: You need to have your test_generations.txt before you start this process.
(1) Setup
`mosesdecoder`
and
`wmt16-scripts`
```
bash
cd
$HOME
git clone git@github.com:moses-smt/mosesdecoder.git
cd
mosesdecoder
git@github.com:rsennrich/wmt16-scripts.git
```
(2) define a function for post processing.
It removes diacritics and does other things I don't understand
```
bash
ro_post_process
()
{
sys
=
$1
ref
=
$2
export
MOSES_PATH
=
$HOME
/mosesdecoder
REPLACE_UNICODE_PUNCT
=
$MOSES_PATH
/scripts/tokenizer/replace-unicode-punctuation.perl
NORM_PUNC
=
$MOSES_PATH
/scripts/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR
=
$MOSES_PATH
/scripts/tokenizer/remove-non-printing-char.perl
REMOVE_DIACRITICS
=
$MOSES_PATH
/wmt16-scripts/preprocess/remove-diacritics.py
NORMALIZE_ROMANIAN
=
$MOSES_PATH
/wmt16-scripts/preprocess/normalise-romanian.py
TOKENIZER
=
$MOSES_PATH
/scripts/tokenizer/tokenizer.perl
lang
=
ro
for
file
in
$sys
$ref
;
do
cat
$file
\
|
$REPLACE_UNICODE_PUNCT
\
|
$NORM_PUNC
-l
$lang
\
|
$REM_NON_PRINT_CHAR
\
|
$NORMALIZE_ROMANIAN
\
|
$REMOVE_DIACRITICS
\
|
$TOKENIZER
-no-escape
-l
$lang
\
>
$(
basename
$file
)
.tok
done
# compute BLEU
cat
$(
basename
$sys
)
.tok | sacrebleu
-tok
none
-s
none
-b
$(
basename
$ref
)
.tok
}
```
(3) Call the function on test_generations.txt and test.target
For example,
```
bash
ro_post_process enro_finetune/test_generations.txt wmt_en_ro/test.target
```
This will split out a new blue score and write a new fine called
`test_generations.tok`
with post-processed outputs.
```
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment