This folder contains part of the code necessary to reproduce the results on abstractive summarization from the article [Text Summarization with Pretrained Encoders](https://arxiv.org/pdf/1908.08345.pdf) by [Yang Liu](https://nlp-yang.github.io/) and [Mirella Lapata](https://homepages.inf.ed.ac.uk/mlap/). It can also be used to summarize any document.
The original code can be found on the Yang Liu's [github repository](https://github.com/nlpyang/PreSumm).
The model is loaded with the pre-trained weights for the abstractive summarization model trained on the CNN/Daily Mail dataset with an extractive and then abstractive tasks.
## Setup
```
git clone https://github.com/huggingface/transformers && cd transformers
pip install [--editable] .
pip install nltk py-rouge
cd examples/summarization
```
## Reproduce the authors' results on ROUGE
To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets [from Kyunghyun Cho's website](https://cs.nyu.edu/~kcho/DMQA/)(the links next to "Stories") in the same folder. Then uncompress the archives by running:
And move all the stories to the same folder. We will refer as `$DATA_PATH` the path to where you uncompressed both archive. Then run the following in the same folder as `run_summarization.py`:
```bash
python run_summarization.py \
--documents_dir$DATA_PATH\
--summaries_output_dir$SUMMARIES_PATH\ # optional
--to_cpufalse\
--batch_size 4 \
--min_length 50 \
--max_length 200 \
--beam_size 5 \
--alpha 0.95 \
--block_trigramtrue\
--compute_rougetrue
```
The scripts executes on GPU if one is available and if `to_cpu` is not set to `true`. Inference on multiple GPUs is not suported yet. The ROUGE scores will be displayed in the console at the end of evaluation and written in a `rouge_scores.txt` file. The script takes 30 hours to compute with a single Tesla V100 GPU and a batch size of 10 (300,000 texts to summarize).
## Summarize any text
Put the documents that you would like to summarize in a folder (the path to which is referred to as `$DATA_PATH` below) and run the following in the same folder as `run_summarization.py`:
```bash
python run_summarization.py \
--documents_dir$DATA_PATH\
--summaries_output_dir$SUMMARIES_PATH\ # optional
--to_cpufalse\
--batch_size 4 \
--min_length 50 \
--max_length 200 \
--beam_size 5 \
--alpha 0.95 \
--block_trigramtrue\
```
You may want to play around with `min_length`, `max_length` and `alpha` to suit your use case. If you want to compute ROUGE on another dataset you will need to tweak the stories/summaries import in `utils_summarization.py` and tell it where to fetch the reference summaries.
"question":"In what country is Normandy located?",
"id":"56ddde6b9a695914005b9628",
"answers":[{
"text":"France",
"answer_start":159
}],
"is_impossible":false
},{
"question":"When were the Normans in Normandy?",
"id":"56ddde6b9a695914005b9629",
"answers":[{
"text":"10th and 11th centuries",
"answer_start":94
}],
"is_impossible":false
},{
"question":"From which countries did the Norse originate?",
"id":"56ddde6b9a695914005b962a",
"answers":[{
"text":"Denmark, Iceland and Norway",
"answer_start":256
}],
"is_impossible":false
},{
"plausible_answers":[{
"text":"Rollo",
"answer_start":308
}],
"question":"Who did King Charles III swear fealty to?",
"id":"5ad39d53604f3c001a3fe8d3",
"answers":[],
"is_impossible":true
},{
"plausible_answers":[{
"text":"10th century",
"answer_start":671
}],
"question":"When did the Frankish identity emerge?",
"id":"5ad39d53604f3c001a3fe8d4",
"answers":[],
"is_impossible":true
}],
"context":"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries."
},{
"qas":[{
"question":"Who was the duke in the battle of Hastings?",
"id":"56dddf4066d3e219004dad5f",
"answers":[{
"text":"William the Conqueror",
"answer_start":1022
}],
"is_impossible":false
},{
"plausible_answers":[{
"text":"Antioch",
"answer_start":1295
}],
"question":"What principality did William the conquerer found?",
"id":"5ad3a266604f3c001a3fea2b",
"answers":[],
"is_impossible":true
}],
"context":"The Norman dynasty had a major political, cultural and military impact on medieval Europe and even the Near East. The Normans were famed for their martial spirit and eventually for their Christian piety, becoming exponents of the Catholic orthodoxy into which they assimilated. They adopted the Gallo-Romance language of the Frankish land they settled, their dialect becoming known as Norman, Normaund or Norman French, an important literary language. The Duchy of Normandy, which they formed by treaty with the French crown, was a great fief of medieval France, and under Richard I of Normandy was forged into a cohesive and formidable principality in feudal tenure. The Normans are noted both for their culture, such as their unique Romanesque architecture and musical traditions, and for their significant military accomplishments and innovations. Norman adventurers founded the Kingdom of Sicily under Roger II after conquering southern Italy on the Saracens and Byzantines, and an expedition on behalf of their duke, William the Conqueror, led to the Norman conquest of England at the Battle of Hastings in 1066. Norman cultural and military influence spread from these new European centres to the Crusader states of the Near East, where their prince Bohemond I founded the Principality of Antioch in the Levant, to Scotland and Wales in Great Britain, to Ireland, and to the coasts of north Africa and the Canary Islands."
}]
},{
"title":"Computational_complexity_theory",
"paragraphs":[{
"qas":[{
"question":"What branch of theoretical computer science deals with broadly classifying computational problems by difficulty and class of relationship?",
"id":"56e16182e3433e1400422e28",
"answers":[{
"text":"Computational complexity theory",
"answer_start":0
}],
"is_impossible":false
},{
"plausible_answers":[{
"text":"algorithm",
"answer_start":472
}],
"question":"What is a manual application of mathematical steps?",
"id":"5ad5316b5b96ef001a10ab76",
"answers":[],
"is_impossible":true
}],
"context":"Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty, and relating those classes to each other. A computational problem is understood to be a task that is in principle amenable to being solved by a computer, which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps, such as an algorithm."
},{
"qas":[{
"question":"What measure of a computational problem broadly defines the inherent difficulty of the solution?",
"id":"56e16839cd28a01900c67887",
"answers":[{
"text":"if its solution requires significant resources",
"answer_start":46
}],
"is_impossible":false
},{
"question":"What method is used to intuitively assess or quantify the amount of resources required to solve a computational problem?",
"id":"56e16839cd28a01900c67888",
"answers":[{
"text":"mathematical models of computation",
"answer_start":176
}],
"is_impossible":false
},{
"question":"What are two basic primary resources used to guage complexity?",
"id":"56e16839cd28a01900c67889",
"answers":[{
"text":"time and storage",
"answer_start":305
}],
"is_impossible":false
},{
"plausible_answers":[{
"text":"the number of gates in a circuit",
"answer_start":436
}],
"question":"What unit is measured to determine circuit simplicity?",
"id":"5ad532575b96ef001a10ab7f",
"answers":[],
"is_impossible":true
},{
"plausible_answers":[{
"text":"the number of processors",
"answer_start":502
}],
"question":"What number is used in perpendicular computing?",
"id":"5ad532575b96ef001a10ab80",
"answers":[],
"is_impossible":true
}],
"context":"A problem is regarded as inherently difficult if its solution requires significant resources, whatever the algorithm used. The theory formalizes this intuition, by introducing mathematical models of computation to study these problems and quantifying the amount of resources needed to solve them, such as time and storage. Other complexity measures are also used, such as the amount of communication (used in communication complexity), the number of gates in a circuit (used in circuit complexity) and the number of processors (used in parallel computing). One of the roles of computational complexity theory is to determine the practical limits on what computers can and cannot do."
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
author_email="thomas@huggingface.co",
description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
...
...
@@ -61,8 +67,11 @@ setup(
"transformers=transformers.__main__:main",
]
},
extras_require=extras,
scripts=[
'transformers-cli'
],
# python_requires='>=3.5.0',
tests_require=['pytest'],
classifiers=[
'Intended Audience :: Science/Research',
'License :: OSI Approved :: Apache Software License',
# How to add a new example script in 🤗Transformers
This folder provide a template for adding a new example script implementing a training or inference task with the models in the 🤗Transformers library.
Currently only examples for PyTorch are provided which are adaptations of the library's SQuAD examples which implement single-GPU and distributed training with gradient accumulation and mixed-precision (using NVIDIA's apex library) to cover a reasonable range of use cases.