This folder contains part of the code necessary to reproduce the results on abstractive summarization from the article [Text Summarization with Pretrained Encoders](https://arxiv.org/pdf/1908.08345.pdf) by [Yang Liu](https://nlp-yang.github.io/) and [Mirella Lapata](https://homepages.inf.ed.ac.uk/mlap/). It can also be used to summarize any document.
The original code can be found on the Yang Liu's [github repository](https://github.com/nlpyang/PreSumm).
The model is loaded with the pre-trained weights for the abstractive summarization model trained on the CNN/Daily Mail dataset with an extractive and then abstractive tasks.
## Setup
```
git clone https://github.com/huggingface/transformers && cd transformers
pip install [--editable] .
pip install nltk py-rouge
cd examples/summarization
```
## Reproduce the authors' results on ROUGE
To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets [from Kyunghyun Cho's website](https://cs.nyu.edu/~kcho/DMQA/)(the links next to "Stories") in the same folder. Then uncompress the archives by running:
And move all the stories to the same folder. We will refer as `$DATA_PATH` the path to where you uncompressed both archive. Then run the following in the same folder as `run_summarization.py`:
```bash
python run_summarization.py \
--documents_dir$DATA_PATH\
--summaries_output_dir$SUMMARIES_PATH\ # optional
--to_cpufalse\
--batch_size 4 \
--min_length 50 \
--max_length 200 \
--beam_size 5 \
--alpha 0.95 \
--block_trigramtrue\
--compute_rougetrue
```
The scripts executes on GPU if one is available and if `to_cpu` is not set to `true`. Inference on multiple GPUs is not suported yet. The ROUGE scores will be displayed in the console at the end of evaluation and written in a `rouge_scores.txt` file. The script takes 30 hours to compute with a single Tesla V100 GPU and a batch size of 10 (300,000 texts to summarize).
## Summarize any text
Put the documents that you would like to summarize in a folder (the path to which is referred to as `$DATA_PATH` below) and run the following in the same folder as `run_summarization.py`:
```bash
python run_summarization.py \
--documents_dir$DATA_PATH\
--summaries_output_dir$SUMMARIES_PATH\ # optional
--to_cpufalse\
--batch_size 4 \
--min_length 50 \
--max_length 200 \
--beam_size 5 \
--alpha 0.95 \
--block_trigramtrue\
```
You may want to play around with `min_length`, `max_length` and `alpha` to suit your use case. If you want to compute ROUGE on another dataset you will need to tweak the stories/summaries import in `utils_summarization.py` and tell it where to fetch the reference summaries.
"question":"In what country is Normandy located?",
"id":"56ddde6b9a695914005b9628",
"answers":[{
"text":"France",
"answer_start":159
}],
"is_impossible":false
},{
"question":"When were the Normans in Normandy?",
"id":"56ddde6b9a695914005b9629",
"answers":[{
"text":"10th and 11th centuries",
"answer_start":94
}],
"is_impossible":false
},{
"question":"From which countries did the Norse originate?",
"id":"56ddde6b9a695914005b962a",
"answers":[{
"text":"Denmark, Iceland and Norway",
"answer_start":256
}],
"is_impossible":false
},{
"plausible_answers":[{
"text":"Rollo",
"answer_start":308
}],
"question":"Who did King Charles III swear fealty to?",
"id":"5ad39d53604f3c001a3fe8d3",
"answers":[],
"is_impossible":true
},{
"plausible_answers":[{
"text":"10th century",
"answer_start":671
}],
"question":"When did the Frankish identity emerge?",
"id":"5ad39d53604f3c001a3fe8d4",
"answers":[],
"is_impossible":true
}],
"context":"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries."
},{
"qas":[{
"question":"Who was the duke in the battle of Hastings?",
"id":"56dddf4066d3e219004dad5f",
"answers":[{
"text":"William the Conqueror",
"answer_start":1022
}],
"is_impossible":false
},{
"plausible_answers":[{
"text":"Antioch",
"answer_start":1295
}],
"question":"What principality did William the conquerer found?",
"id":"5ad3a266604f3c001a3fea2b",
"answers":[],
"is_impossible":true
}],
"context":"The Norman dynasty had a major political, cultural and military impact on medieval Europe and even the Near East. The Normans were famed for their martial spirit and eventually for their Christian piety, becoming exponents of the Catholic orthodoxy into which they assimilated. They adopted the Gallo-Romance language of the Frankish land they settled, their dialect becoming known as Norman, Normaund or Norman French, an important literary language. The Duchy of Normandy, which they formed by treaty with the French crown, was a great fief of medieval France, and under Richard I of Normandy was forged into a cohesive and formidable principality in feudal tenure. The Normans are noted both for their culture, such as their unique Romanesque architecture and musical traditions, and for their significant military accomplishments and innovations. Norman adventurers founded the Kingdom of Sicily under Roger II after conquering southern Italy on the Saracens and Byzantines, and an expedition on behalf of their duke, William the Conqueror, led to the Norman conquest of England at the Battle of Hastings in 1066. Norman cultural and military influence spread from these new European centres to the Crusader states of the Near East, where their prince Bohemond I founded the Principality of Antioch in the Levant, to Scotland and Wales in Great Britain, to Ireland, and to the coasts of north Africa and the Canary Islands."
}]
},{
"title":"Computational_complexity_theory",
"paragraphs":[{
"qas":[{
"question":"What branch of theoretical computer science deals with broadly classifying computational problems by difficulty and class of relationship?",
"id":"56e16182e3433e1400422e28",
"answers":[{
"text":"Computational complexity theory",
"answer_start":0
}],
"is_impossible":false
},{
"plausible_answers":[{
"text":"algorithm",
"answer_start":472
}],
"question":"What is a manual application of mathematical steps?",
"id":"5ad5316b5b96ef001a10ab76",
"answers":[],
"is_impossible":true
}],
"context":"Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty, and relating those classes to each other. A computational problem is understood to be a task that is in principle amenable to being solved by a computer, which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps, such as an algorithm."
},{
"qas":[{
"question":"What measure of a computational problem broadly defines the inherent difficulty of the solution?",
"id":"56e16839cd28a01900c67887",
"answers":[{
"text":"if its solution requires significant resources",
"answer_start":46
}],
"is_impossible":false
},{
"question":"What method is used to intuitively assess or quantify the amount of resources required to solve a computational problem?",
"id":"56e16839cd28a01900c67888",
"answers":[{
"text":"mathematical models of computation",
"answer_start":176
}],
"is_impossible":false
},{
"question":"What are two basic primary resources used to guage complexity?",
"id":"56e16839cd28a01900c67889",
"answers":[{
"text":"time and storage",
"answer_start":305
}],
"is_impossible":false
},{
"plausible_answers":[{
"text":"the number of gates in a circuit",
"answer_start":436
}],
"question":"What unit is measured to determine circuit simplicity?",
"id":"5ad532575b96ef001a10ab7f",
"answers":[],
"is_impossible":true
},{
"plausible_answers":[{
"text":"the number of processors",
"answer_start":502
}],
"question":"What number is used in perpendicular computing?",
"id":"5ad532575b96ef001a10ab80",
"answers":[],
"is_impossible":true
}],
"context":"A problem is regarded as inherently difficult if its solution requires significant resources, whatever the algorithm used. The theory formalizes this intuition, by introducing mathematical models of computation to study these problems and quantifying the amount of resources needed to solve them, such as time and storage. Other complexity measures are also used, such as the amount of communication (used in communication complexity), the number of gates in a circuit (used in circuit complexity) and the number of processors (used in parallel computing). One of the roles of computational complexity theory is to determine the practical limits on what computers can and cannot do."
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
author_email="thomas@huggingface.co",
author_email="thomas@huggingface.co",
description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
...
@@ -61,8 +67,11 @@ setup(
...
@@ -61,8 +67,11 @@ setup(
"transformers=transformers.__main__:main",
"transformers=transformers.__main__:main",
]
]
},
},
extras_require=extras,
scripts=[
'transformers-cli'
],
# python_requires='>=3.5.0',
# python_requires='>=3.5.0',
tests_require=['pytest'],
classifiers=[
classifiers=[
'Intended Audience :: Science/Research',
'Intended Audience :: Science/Research',
'License :: OSI Approved :: Apache Software License',
'License :: OSI Approved :: Apache Software License',
@@ -7,7 +7,7 @@ The library is designed to incorporate a variety of models and code bases. As su
...
@@ -7,7 +7,7 @@ The library is designed to incorporate a variety of models and code bases. As su
One important point though is that the library has the following goals impacting the way models are incorporated:
One important point though is that the library has the following goals impacting the way models are incorporated:
- one specific feature of the API is the capability to run the model and tokenizer inline. The tokenization code thus often have to be slightly adapted to allow for running in the python interpreter.
- one specific feature of the API is the capability to run the model and tokenizer inline. The tokenization code thus often have to be slightly adapted to allow for running in the python interpreter.
- the package is also designed to be as self-consistent and with a small and reliable set of packages dependencies. In consequence, additional dependencies are usually not allowed when adding a model but can be allowed for the inclusion of a new tokenizer (recent examples of dependencies added for tokenizer specificites includes`sentencepiece` and `sacremoses`). Please make sure to check the existing dependencies when possible before adding a new one.
- the package is also designed to be as self-consistent and with a small and reliable set of packages dependencies. In consequence, additional dependencies are usually not allowed when adding a model but can be allowed for the inclusion of a new tokenizer (recent examples of dependencies added for tokenizer specificities include `sentencepiece` and `sacremoses`). Please make sure to check the existing dependencies when possible before adding a new one.
For a quick overview of the library organization, please check the [QuickStart section of the documentation](https://huggingface.co/transformers/quickstart.html).
For a quick overview of the library organization, please check the [QuickStart section of the documentation](https://huggingface.co/transformers/quickstart.html).
...
@@ -20,7 +20,7 @@ Here an overview of the general workflow:
...
@@ -20,7 +20,7 @@ Here an overview of the general workflow:
@@ -28,16 +28,16 @@ Here is the workflow for adding model/configuration/tokenization classes:
...
@@ -28,16 +28,16 @@ Here is the workflow for adding model/configuration/tokenization classes:
- [ ] copy the python files from the present folder to the main folder and rename them, replacing `xxx` with your model name,
- [ ] copy the python files from the present folder to the main folder and rename them, replacing `xxx` with your model name,
- [ ] edit the files to replace `XXX` (with various casing) with your model name
- [ ] edit the files to replace `XXX` (with various casing) with your model name
- [ ] copy-past or create a simple configuration class for your model in the `configuration_...` file
- [ ] copy-paste or create a simple configuration class for your model in the `configuration_...` file
- [ ] copy-past or create the code for your model in the `modeling_...` files (PyTorch and TF 2.0)
- [ ] copy-paste or create the code for your model in the `modeling_...` files (PyTorch and TF 2.0)
- [ ] copy-past or create a tokenizer class for your model in the `tokenization_...` file
- [ ] copy-paste or create a tokenizer class for your model in the `tokenization_...` file
# Adding conversion scripts
# Adding conversion scripts
Here is the workflow for the conversion scripts:
Here is the workflow for the conversion scripts:
- [ ] copy the conversion script (`convert_...`) from the present folder to the main folder.
- [ ] copy the conversion script (`convert_...`) from the present folder to the main folder.
- [ ] edit this scipt to convert your original checkpoint weights to the current pytorch ones.
- [ ] edit this script to convert your original checkpoint weights to the current pytorch ones.
# Adding tests:
# Adding tests:
...
@@ -58,5 +58,5 @@ You can then finish the addition step by adding imports for your classes in the
...
@@ -58,5 +58,5 @@ You can then finish the addition step by adding imports for your classes in the
- [ ] add your models and tokenizer to `pipeline.py`
- [ ] add your models and tokenizer to `pipeline.py`
- [ ] add a link to your conversion script in the main conversion utility (currently in `__main__` but will be moved to the `commands` subfolder in the near future)
- [ ] add a link to your conversion script in the main conversion utility (currently in `__main__` but will be moved to the `commands` subfolder in the near future)
- [ ] edit the PyTorch to TF 2.0 conversion script to add your model in the `convert_pytorch_checkpoint_to_tf2.py` file
- [ ] edit the PyTorch to TF 2.0 conversion script to add your model in the `convert_pytorch_checkpoint_to_tf2.py` file
- [ ] add a mention of your model in the doc: `README.md` and the documentation it-self at `docs/source/pretrained_models.rst`.
- [ ] add a mention of your model in the doc: `README.md` and the documentation itself at `docs/source/pretrained_models.rst`.
- [ ] upload the pretrained weigths, configurations and vocabulary files.
- [ ] upload the pretrained weigths, configurations and vocabulary files.