Unverified Commit f90bc44d authored by Sam Shleifer's avatar Sam Shleifer Committed by GitHub
Browse files

[examples] Cleanup summarization docs (#4876)

parent 2cfb947f
### Get CNN Data ### Get CNN Data
Both types of models do require CNN data and follow different procedures of obtaining so.
#### For BART models
To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets [from Kyunghyun Cho's website](https://cs.nyu.edu/~kcho/DMQA/) (the links next to "Stories") in the same folder. Then uncompress the archives by running: To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets [from Kyunghyun Cho's website](https://cs.nyu.edu/~kcho/DMQA/) (the links next to "Stories") in the same folder. Then uncompress the archives by running:
```bash ```bash
...@@ -12,40 +9,17 @@ tar -xzvf cnn_dm.tgz ...@@ -12,40 +9,17 @@ tar -xzvf cnn_dm.tgz
this should make a directory called cnn_dm/ with files like `test.source`. this should make a directory called cnn_dm/ with files like `test.source`.
To use your own data, copy that files format. Each article to be summarized is on its own line. To use your own data, copy that files format. Each article to be summarized is on its own line.
#### For T5 models
First, you need to download the CNN data. It's about ~400 MB and can be downloaded by
running
```bash
python download_cnn_daily_mail.py cnn_articles_input_data.txt cnn_articles_reference_summaries.txt
```
You should confirm that each file has 11490 lines:
```bash
wc -l cnn_articles_input_data.txt # should print 11490
wc -l cnn_articles_reference_summaries.txt # should print 11490
```
### Evaluation ### Evaluation
To create summaries for each article in dataset, run: To create summaries for each article in dataset, run:
```bash ```bash
python evaluate_cnn.py <path_to_test.source> test_generations.txt <model-name> python evaluate_cnn.py <path_to_test.source> test_generations.txt <model-name> --score_path rouge_scores.txt
``` ```
The default batch size, 8, fits in 16GB GPU memory, but may need to be adjusted to fit your system. The default batch size, 8, fits in 16GB GPU memory, but may need to be adjusted to fit your system.
### Training ### Training
Run/modify `finetune_bart.sh` or `finetune_t5.sh` Run/modify `finetune_bart.sh` or `finetune_t5.sh`
## (WIP) Rouge Scores
To create summaries for each article in dataset and also calculate rouge scores run:
```bash
python evaluate_cnn.py <path_to_test.source> test_generations.txt <model-name> --reference_path <path_to_correct_summaries> --score_path <path_to_save_rouge_scores>
```
The rouge scores "rouge1, rouge2, rougeL" are automatically created and saved in ``<path_to_save_rouge_scores>``.
### Stanford CoreNLP Setup ### Stanford CoreNLP Setup
``` ```
ptb_tokenize () { ptb_tokenize () {
......
# -*- coding: utf-8 -*-
import argparse
from pathlib import Path
import tensorflow_datasets as tfds
def main(input_path, reference_path, data_dir):
cnn_ds = tfds.load("cnn_dailymail", split="test", shuffle_files=False, data_dir=data_dir)
cnn_ds_iter = tfds.as_numpy(cnn_ds)
test_articles_file = Path(input_path).open("w", encoding="utf-8")
test_summaries_file = Path(reference_path).open("w", encoding="utf-8")
for example in cnn_ds_iter:
test_articles_file.write(example["article"].decode("utf-8") + "\n")
test_articles_file.flush()
test_summaries_file.write(example["highlights"].decode("utf-8").replace("\n", " ") + "\n")
test_summaries_file.flush()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("input_path", type=str, help="where to save the articles input data")
parser.add_argument(
"reference_path", type=str, help="where to save the reference summaries",
)
parser.add_argument(
"--data_dir", type=str, default="~/tensorflow_datasets", help="where to save the tensorflow datasets.",
)
args = parser.parse_args()
main(args.input_path, args.reference_path, args.data_dir)
...@@ -6,7 +6,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME} ...@@ -6,7 +6,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
mkdir -p $OUTPUT_DIR mkdir -p $OUTPUT_DIR
# Add parent directory to python path to access lightning_base.py # Add parent directory to python path to access lightning_base.py
export PYTHONPATH="../../":"${PYTHONPATH}" export PYTHONPATH="../":"${PYTHONPATH}"
python finetune.py \ python finetune.py \
--data_dir=./cnn-dailymail/cnn_dm \ --data_dir=./cnn-dailymail/cnn_dm \
......
...@@ -13,7 +13,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME} ...@@ -13,7 +13,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
mkdir -p $OUTPUT_DIR mkdir -p $OUTPUT_DIR
# Add parent directory to python path to access lightning_base.py and utils.py # Add parent directory to python path to access lightning_base.py and utils.py
export PYTHONPATH="../../":"${PYTHONPATH}" export PYTHONPATH="../":"${PYTHONPATH}"
python finetune.py \ python finetune.py \
--data_dir=cnn_tiny/ \ --data_dir=cnn_tiny/ \
--model_type=bart \ --model_type=bart \
......
...@@ -6,7 +6,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME} ...@@ -6,7 +6,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
mkdir -p $OUTPUT_DIR mkdir -p $OUTPUT_DIR
# Add parent directory to python path to access lightning_base.py # Add parent directory to python path to access lightning_base.py
export PYTHONPATH="../../":"${PYTHONPATH}" export PYTHONPATH="../":"${PYTHONPATH}"
python finetune.py \ python finetune.py \
--data_dir=./cnn-dailymail/cnn_dm \ --data_dir=./cnn-dailymail/cnn_dm \
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment