"git@developer.sourcefind.cn:change/sglang.git" did not exist on "c7962868c1a7b21f20f00507af43710c268ebfd2"
Unverified Commit 0e24e4c1 authored by Stas Bekman's avatar Stas Bekman Committed by GitHub
Browse files

[s2s] create doc for pegasus/fsmt replication (#7934)

parent 96f4828a
...@@ -15,7 +15,8 @@ For `bertabs` instructions, see [`bertabs/README.md`](bertabs/README.md). ...@@ -15,7 +15,8 @@ For `bertabs` instructions, see [`bertabs/README.md`](bertabs/README.md).
## Datasets ## Datasets
#### XSUM: #### XSUM
```bash ```bash
cd examples/seq2seq cd examples/seq2seq
wget https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz wget https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz
...@@ -26,6 +27,7 @@ this should make a directory called `xsum/` with files like `test.source`. ...@@ -26,6 +27,7 @@ this should make a directory called `xsum/` with files like `test.source`.
To use your own data, copy that files format. Each article to be summarized is on its own line. To use your own data, copy that files format. Each article to be summarized is on its own line.
#### CNN/DailyMail #### CNN/DailyMail
```bash ```bash
cd examples/seq2seq cd examples/seq2seq
wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz
...@@ -35,7 +37,8 @@ export CNN_DIR=${PWD}/cnn_dm ...@@ -35,7 +37,8 @@ export CNN_DIR=${PWD}/cnn_dm
``` ```
this should make a directory called `cnn_dm/` with 6 files. this should make a directory called `cnn_dm/` with 6 files.
#### WMT16 English-Romanian Translation Data: #### WMT16 English-Romanian Translation Data
download with this command: download with this command:
```bash ```bash
wget https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz wget https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz
...@@ -44,13 +47,25 @@ export ENRO_DIR=${PWD}/wmt_en_ro ...@@ -44,13 +47,25 @@ export ENRO_DIR=${PWD}/wmt_en_ro
``` ```
this should make a directory called `wmt_en_ro/` with 6 files. this should make a directory called `wmt_en_ro/` with 6 files.
#### WMT English-German: #### WMT English-German
```bash ```bash
wget https://cdn-datasets.huggingface.co/translation/wmt_en_de.tgz wget https://cdn-datasets.huggingface.co/translation/wmt_en_de.tgz
tar -xzvf wmt_en_de.tgz tar -xzvf wmt_en_de.tgz
export DATA_DIR=${PWD}/wmt_en_de export DATA_DIR=${PWD}/wmt_en_de
``` ```
#### FSMT datasets (wmt)
Refer to the scripts starting with `eval_` under:
https://github.com/huggingface/transformers/tree/master/scripts/fsmt
#### Pegasus (multiple datasets)
Multiple eval datasets are available for download from:
https://github.com/stas00/porting/tree/master/datasets/pegasus
#### Private Data #### Private Data
If you are using your own data, it must be formatted as one directory with 6 files: If you are using your own data, it must be formatted as one directory with 6 files:
...@@ -64,7 +79,6 @@ test.target ...@@ -64,7 +79,6 @@ test.target
``` ```
The `.source` files are the input, the `.target` files are the desired output. The `.source` files are the input, the `.target` files are the desired output.
### Tips and Tricks ### Tips and Tricks
General Tips: General Tips:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment