Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language: es
thumbnail: https://i.imgur.com/uxAvBfh.png
---
## ELECTRICIDAD: The Spanish Electra [Imgur](https://imgur.com/uxAvBfh)
**ELECTRICIDAD** is a small Electra like model (discriminator in this case) trained on a + 20 GB of the [OSCAR](https://oscar-corpus.com/) Spanish corpus.
As mentioned in the original [paper](https://openreview.net/pdf?id=r1xMH1BtvB):
**ELECTRA** is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a [GAN](https://arxiv.org/pdf/1406.2661.pdf). At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset.
For a detailed description and experimental results, please refer the paper [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/pdf?id=r1xMH1BtvB).
## Model details ⚙
|Param| # Value|
|-----|--------|
|Layers| 12 |
|Hidden |256 |
|Params| 14M|
## Evaluation metrics (for discriminator) 🧾
|Metric | # Score |
|-------|---------|
|Accuracy| 0.94|
|Precision| 0.76|
|AUC | 0.92|
## Benchmarks 🔨
WIP 🚧
## How to use the discriminator in `transformers`
```python
from transformers import ElectraForPreTraining, ElectraTokenizerFast
import torch
discriminator = ElectraForPreTraining.from_pretrained("mrm8488/electricidad-small-discriminator")
tokenizer = ElectraTokenizerFast.from_pretrained("mrm8488/electricidad-small-discriminator")
sentence = "el zorro rojo es muy rápido"
fake_sentence = "el zorro rojo es muy ser"
fake_tokens = tokenizer.tokenize(sentence)
fake_inputs = tokenizer.encode(sentence, return_tensors="pt")
discriminator_outputs = discriminator(fake_inputs)
predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
[print("%7s" % token, end="") for token in fake_tokens]
[print("%7s" % int(prediction), end="") for prediction in predictions.tolist()[1:-1]]
# Output:
'''
el zorro rojo es muy ser 0 0 0 0 0 1[None, None, None, None, None, None]
'''
```
As you can see there is a **1** in the place where the model detected the fake token (**ser**). So, it works! 🎉
## Acknowledgments
I thank [🤗/transformers team](https://github.com/huggingface/transformers) for answering my doubts and Google for helping me with the [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc) program.
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: es
thumbnail: https://imgur.com/uxAvBfh
---
# Electricidad small + Spanish SQuAD v1 ⚡❓
[Electricidad-small-discriminator](https://huggingface.co/mrm8488/electricidad-small-discriminator) fine-tuned on [Spanish SQUAD v1.1 dataset](https://github.com/ccasimiro88/TranslateAlignRetrieve/tree/master/SQuAD-es-v1.1) for **Q&A** downstream task.
## Details of the downstream task (Q&A) - Dataset 📚
[SQuAD-es-v1.1](https://github.com/ccasimiro88/TranslateAlignRetrieve/tree/master/SQuAD-es-v1.1)
| Dataset split | # Samples |
| ------------- | --------- |
| Train | 130 K |
| Test | 11 K |
## Model training 🏋️‍
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
python /content/transformers/examples/question-answering/run_squad.py \
--model_type electra \
--model_name_or_path 'mrm8488/electricidad-small-discriminator' \
--do_eval \
--do_train \
--do_lower_case \
--train_file '/content/dataset/train-v1.1-es.json' \
--predict_file '/content/dataset/dev-v1.1-es.json' \
--per_gpu_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 10 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir '/content/electricidad-small-finetuned-squadv1-es' \
--overwrite_output_dir \
--save_steps 1000
```
## Test set Results 🧾
| Metric | # Value |
| ------ | --------- |
| **EM** | **46.82** |
| **F1** | **64.79** |
```json
{
'exact': 46.82119205298013,
'f1': 64.79435260021918,
'total': 10570,
'HasAns_exact': 46.82119205298013,
HasAns_f1': 64.79435260021918,
'HasAns_total': 10570,
'best_exact': 46.82119205298013,
'best_exact_thresh': 0.0,
'best_f1': 64.79435260021918,
'best_f1_thresh': 0.0
}
```
### Model in action 🚀
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="mrm8488/electricidad-small-finetuned-squadv1-es",
tokenizer="mrm8488/electricidad-small-finetuned-squadv1-es"
)
context = "Manuel ha creado una versión del modelo Electra small en español que alcanza una puntuación F1 de 65 en el dataset SQUAD-es y sólo pesa 50 MB"
q1 = "Cuál es su marcador F1?"
q2 = "¿Cuál es el tamaño del modelo?"
q3 = "¿Quién lo ha creado?"
q4 = "¿Que es lo que ha hecho Manuel?"
questions = [q1, q2, q3, q4]
for question in questions:
result = qa_pipeline({
'context': context,
'question': question})
print(result)
# Output:
{'score': 0.14836778166355025, 'start': 98, 'end': 100, 'answer': '65'}
{'score': 0.32219420810758237, 'start': 136, 'end': 140, 'answer': '50 MB'}
{'score': 0.9672326951118713, 'start': 0, 'end': 6, 'answer': 'Manuel'}
{'score': 0.23552458113848118, 'start': 10, 'end': 53, 'answer': 'creado una versión del modelo Electra small'}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
thumbnail:
widget:
- text: "HuggingFace Cake:"
---
---
language: en
thumbnail:
widget:
- text: "HuggingFace Cake:"
---
# GPT2-IMDB-neg (LM + RL) 🎞😡✍
All credits to [@lvwerra](https://twitter.com/lvwerra)
## What is it?
A small GPT2 (`lvwerra/gpt2-imdb`) language model fine-tuned to produce **negative** movie reviews based the [IMDB dataset](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). The model is trained with rewards from a BERT sentiment classifier (`lvwerra/gpt2-imdb`) via **PPO**.
## Why?
I wanted to reproduce the experiment [lvwerra/gpt2-imdb-pos](https://huggingface.co/lvwerra/gpt2-imdb-pos) but for generating **negative** movie reviews.
## Training setting
The model was trained for `100` optimisation steps with a batch size of `256` which corresponds to `25600` training samples. The full experiment setup (for positive samples) in [trl repo](https://lvwerra.github.io/trl/04-gpt2-sentiment-ppo-training/).
## Examples
A few examples of the model response to a query before and after optimisation:
| query | response (before) | response (after) | rewards (before) | rewards (after) |
|-------|-------------------|------------------|------------------|-----------------|
|This movie is a fine | attempt as far as live action is concerned, n...|example of how bad Hollywood in theatrics pla...| 2.118391 | -3.31625|
|I have watched 3 episodes |with this guy and he is such a talented actor...| but the show is just plain awful and there ne...| 2.681171| -4.512792|
|We know that firefighters and| police officers are forced to become populari...| other chains have going to get this disaster ...| 1.367811| -3.34017|
## Training logs and metrics <img src="https://gblobscdn.gitbook.com/spaces%2F-Lqya5RvLedGEWPhtkjU%2Favatar.png?alt=media" width="25" height="25">
Watch the whole training logs and metrics on [W&B](https://app.wandb.ai/mrm8488/gpt2-sentiment-negative?workspace=user-mrm8488)
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
thumbnail:
---
# GPT2-IMDB-neutral (LM + RL) 🎞😐✍
## What is it?
A small GPT2 (`lvwerra/gpt2-imdb`) language model fine-tuned to produce **neutral**-ish movie reviews based on the [IMDB dataset](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). The model is trained with rewards from a BERT sentiment classifier (`lvwerra/gpt2-imdb`) via **PPO**.
## Why?
After reproducing the experiment [lvwerra/gpt2-imdb-pos](https://huggingface.co/lvwerra/gpt2-imdb-pos) but for generating **negative** movie reviews ([mrm8488/gpt2-imdb-neg](https://huggingface.co/mrm8488/gpt2-imdb-neg)) I wanted to check if I could generate neutral-ish movie reviews. So, based on the classifier output (logit), I saw that clearly negative reviews gives around *-4* values and clearly positive reviews around *4*. Then, it was esay to establish an interval ```[-1.75,1.75]``` that it could be considered as **neutral**. So if the classifier output was in that interval I gave it a positive reward while values out of the interval got a negative reward.
## Training setting
The model was trained for `100` optimisation steps with a batch size of `128` which corresponds to `30000` training samples. The full experiment setup (for positive samples) in [trl repo](https://lvwerra.github.io/trl/04-gpt2-sentiment-ppo-training/).
## Examples
A few examples of the model response to a query before and after optimisation:
| query | response (before) | response (after) | rewards (before) | rewards (after) |
|-------|-------------------|------------------|------------------|-----------------|
|Okay, my title is|partly over, but this drama still makes me proud to read its first 40...|weird. The title is "mana were, ahunter". "Man...|4.200727 |-1.891443|
|Where is it written that|there is a monster in this movie anyway? How is it that the entire|[ of the women in the recent women of jungle business between Gender and husband| -3.113942| -1.944993|
|As a lesbian, I|cannot believe I was in the Sixties! Subtle yet witty, with original| found it hard to get responsive. In fact I found myself with the long| 3.906178| 0.769166|
|The Derek's have over|three times as many acting hours than Jack Nicholson? You think bitches?|30 dueling characters and kill of, they retreat themselves to their base.|-2.503655| -1.898380|
> All credits to [@lvwerra](https://twitter.com/lvwerra)
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
datasets:
- squad_v2
---
# Longformer-base-4096 fine-tuned on SQuAD v2
[Longformer-base-4096 model](https://huggingface.co/allenai/longformer-base-4096) fine-tuned on [SQuAD v2](https://rajpurkar.github.io/SQuAD-explorer/) for **Q&A** downstream task.
## Longformer-base-4096
[Longformer](https://arxiv.org/abs/2004.05150) is a transformer model for long documents.
`longformer-base-4096` is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.
Longformer uses a combination of a sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations.
## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
Dataset ID: ```squad_v2``` from [HugginFace/NLP](https://github.com/huggingface/nlp)
| Dataset | Split | # samples |
| -------- | ----- | --------- |
| squad_v2 | train | 130319 |
| squad_v2 | valid | 11873 |
How to load it from [nlp](https://github.com/huggingface/nlp)
```python
train_dataset = nlp.load_dataset('squad_v2', split=nlp.Split.TRAIN)
valid_dataset = nlp.load_dataset('squad_v2', split=nlp.Split.VALIDATION)
```
Check out more about this dataset and others in [NLP Viewer](https://huggingface.co/nlp/viewer/)
## Model fine-tuning 🏋️‍
The training script is a slightly modified version of [this one](https://colab.research.google.com/drive/1zEl5D-DdkBKva-DdreVOmN0hrAfzKG1o?usp=sharing)
## Model in Action 🚀
```python
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
tokenizer = AutoTokenizer.from_pretrained("mrm8488/longformer-base-4096-finetuned-squadv2")
model = AutoModelForQuestionAnswering.from_pretrained("mrm8488/longformer-base-4096-finetuned-squadv2")
text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this."
question = "What has Huggingface done ?"
encoding = tokenizer(question, text, return_tensors="pt")
input_ids = encoding["input_ids"]
# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
attention_mask = encoding["attention_mask"]
start_scores, end_scores = model(input_ids, attention_mask=attention_mask)
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0].tolist())
answer_tokens = all_tokens[torch.argmax(start_scores) :torch.argmax(end_scores)+1]
answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens))
# output => democratized NLP
```
If given the same context we ask something that is not there, the output for **no answer** will be ```<s>```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: multilingual
datasets:
- tydiqa
pipeline_tag: question-answering
---
# mT5-small fine-tuned on TyDiQA for multilingual QA 🗺📖❓
[Google's mT5-small](https://huggingface.co/google/mt5-small) fine-tuned on [TyDi QA](https://huggingface.co/nlp/viewer/?dataset=tydiqa&config=secondary_task) (secondary task) for **multingual Q&A** downstream task.
## Details of mT5
[Google's mT5](https://github.com/google-research/multilingual-t5)
mT5 is pretrained on the [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual) corpus, covering 101 languages:
Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu.
**Note**: mT5 was only pre-trained on mC4 excluding any supervised training. Therefore, this model has to be fine-tuned before it is useable on a downstream task.
Pretraining Dataset: [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual)
Other Community Checkpoints: [here](https://huggingface.co/models?search=mt5)
Paper: [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934)
Authors: *Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel*
## Details of the dataset 📚
**TyDi QA** is a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology -- the set of linguistic features that each language expresses -- such that we expect models performing well on this set to generalize across a large number of the languages in the world. It contains language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, (unlike SQuAD and its descendents) and the data is collected directly in each language without the use of translation (unlike MLQA and XQuAD).
| Dataset | Task | Split | # samples |
| -------- | ----- |------| --------- |
| TyDi QA | GoldP | train| 49881 |
| TyDi QA | GoldP | valid| 5077 |
## Results on validation dataset 📝
| Metric | # Value |
| ------ | --------- |
| **EM** | **41.65** |
## Model in Action 🚀
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = AutoTokenizer.from_pretrained("mrm8488/mT5-small-finetuned-tydiqa-for-xqa")
model = AutoModelForCausalLM.from_pretrained("mrm8488/mT5-small-finetuned-tydiqa-for-xqa").to(device)
def get_response(question, context, max_length=32):
input_text = 'question: %s context: %s' % (question, context)
features = tokenizer([input_text], return_tensors='pt')
output = model.generate(input_ids=features['input_ids'].to(device),
attention_mask=features['attention_mask'].to(device),
max_length=max_length)
return tokenizer.decode(output[0])
# Some examples in different languages
context = 'HuggingFace won the best Demo paper at EMNLP2020.'
question = 'What won HuggingFace?'
get_response(question, context)
context = 'HuggingFace ganó la mejor demostración con su paper en la EMNLP2020.'
question = 'Qué ganó HuggingFace?'
get_response(question, context)
context = 'HuggingFace выиграл лучшую демонстрационную работу на EMNLP2020.'
question = 'Что победило в HuggingFace?'
get_response(question, context)
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
datasets:
- squad
---
# MobileBERT + SQuAD (v1.1) 📱❓
[mobilebert-uncased](https://huggingface.co/google/mobilebert-uncased) fine-tuned on [SQUAD v2.0 dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/) for **Q&A** downstream task.
## Details of the downstream task (Q&A) - Model 🧠
**MobileBERT** is a thin version of *BERT_LARGE*, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks.
The checkpoint used here is the original MobileBert Optimized Uncased English: (uncased_L-24_H-128_B-512_A-4_F-4_OPT) checkpoint.
More about the model [here](https://arxiv.org/abs/2004.02984)
## Details of the downstream task (Q&A) - Dataset 📚
**S**tanford **Q**uestion **A**nswering **D**ataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
SQuAD v1.1 contains **100,000+** question-answer pairs on **500+** articles.
## Model training 🏋️‍
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
python transformers/examples/question-answering/run_squad.py \
--model_type bert \
--model_name_or_path 'google/mobilebert-uncased' \
--do_eval \
--do_train \
--do_lower_case \
--train_file '/content/dataset/train-v1.1.json' \
--predict_file '/content/dataset/dev-v1.1.json' \
--per_gpu_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 5 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir '/content/output' \
--overwrite_output_dir \
--save_steps 1000
```
It is important to say that this models converges much faster than other ones. So, it is also cheap to fine-tune.
## Test set Results 🧾
| Metric | # Value |
| ------ | --------- |
| **EM** | **82.33** |
| **F1** | **89.64** |
| **Size**| **94 MB** |
### Model in action 🚀
Fast usage with **pipelines**:
```python
from transformers import pipeline
QnA_pipeline = pipeline('question-answering', model='mrm8488/mobilebert-uncased-finetuned-squadv1')
QnA_pipeline({
'context': 'A new strain of flu that has the potential to become a pandemic has been identified in China by scientists.',
'question': 'Who did identified it ?'
})
# Output: {'answer': 'scientists.', 'end': 106, 'score': 0.7885545492172241, 'start': 96}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
datasets:
- squad_v2
---
# MobileBERT + SQuAD v2 📱❓
[mobilebert-uncased](https://huggingface.co/google/mobilebert-uncased) fine-tuned on [SQUAD v2.0 dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/) for **Q&A** downstream task.
## Details of the downstream task (Q&A) - Model 🧠
**MobileBERT** is a thin version of *BERT_LARGE*, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks.
The checkpoint used here is the original MobileBert Optimized Uncased English: (uncased_L-24_H-128_B-512_A-4_F-4_OPT) checkpoint.
More about the model [here](https://arxiv.org/abs/2004.02984)
## Details of the downstream task (Q&A) - Dataset 📚
**SQuAD2.0** combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
## Model training 🏋️‍
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
python transformers/examples/question-answering/run_squad.py \
--model_type bert \
--model_name_or_path 'google/mobilebert-uncased' \
--do_eval \
--do_train \
--do_lower_case \
--train_file '/content/dataset/train-v2.0.json' \
--predict_file '/content/dataset/dev-v2.0.json' \
--per_gpu_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 5 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir '/content/output' \
--overwrite_output_dir \
--save_steps 1000 \
--version_2_with_negative
```
It is important to say that this models converges much faster than other ones. So, it is also cheap to fine-tune.
## Test set Results 🧾
| Metric | # Value |
| ------ | --------- |
| **EM** | **75.37** |
| **F1** | **78.48** |
| **Size**| **94 MB** |
### Model in action 🚀
Fast usage with **pipelines**:
```python
from transformers import pipeline
QnA_pipeline = pipeline('question-answering', model='mrm8488/mobilebert-uncased-finetuned-squadv2')
QnA_pipeline({
'context': 'A new strain of flu that has the potential to become a pandemic has been identified in China by scientists.',
'question': 'Who did identified it ?'
})
# Output: {'answer': 'scientists.', 'end': 106, 'score': 0.41531604528427124, 'start': 96}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
---
# RoBERTa-base (1B-1) + SQuAD v1 ❓
[roberta-base-1B-1](https://huggingface.co/nyu-mll/roberta-base-1B-1) fine-tuned on [SQUAD v1.1 dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/) for **Q&A** downstream task.
## Details of the downstream task (Q&A) - Model 🧠
RoBERTa Pretrained on Smaller Datasets
[NYU Machine Learning for Language](https://huggingface.co/nyu-mll) pretrained RoBERTa on smaller datasets (1M, 10M, 100M, 1B tokens). They released 3 models with lowest perplexities for each pretraining data size out of 25 runs (or 10 in the case of 1B tokens). The pretraining data reproduces that of BERT: They combine English Wikipedia and a reproduction of BookCorpus using texts from smashwords in a ratio of approximately 3:1.
## Details of the downstream task (Q&A) - Dataset 📚
**S**tanford **Q**uestion **A**nswering **D**ataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
SQuAD v1.1 contains **100,000+** question-answer pairs on **500+** articles.
## Model training 🏋️‍
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
python transformers/examples/question-answering/run_squad.py \
--model_type roberta \
--model_name_or_path 'nyu-mll/roberta-base-1B-1' \
--do_eval \
--do_train \
--do_lower_case \
--train_file /content/dataset/train-v1.1.json \
--predict_file /content/dataset/dev-v1.1.json \
--per_gpu_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 10 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /content/output \
--overwrite_output_dir \
--save_steps 1000
```
## Test set Results 🧾
| Metric | # Value |
| ------ | --------- |
| **EM** | **72.62** |
| **F1** | **82.19** |
```json
{
'exact': 72.62062440870388,
'f1': 82.19430877136834,
'total': 10570,
'HasAns_exact': 72.62062440870388,
'HasAns_f1': 82.19430877136834,
'HasAns_total': 10570,
'best_exact': 72.62062440870388,
'best_exact_thresh': 0.0,
'best_f1': 82.19430877136834,
'best_f1_thresh': 0.0
}
```
### Model in action 🚀
Fast usage with **pipelines**:
```python
from transformers import pipeline
QnA_pipeline = pipeline('question-answering', model='mrm8488/roberta-base-1B-1-finetuned-squadv1')
QnA_pipeline({
'context': 'A new strain of flu that has the potential to become a pandemic has been identified in China by scientists.',
'question': 'What has been discovered by scientists from China ?'
})
# Output:
{'answer': 'A new strain of flu', 'end': 19, 'score': 0.04702283976040074, 'start': 0}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
---
# RoBERTa-base (1B-1) + SQuAD v2 ❓
[roberta-base-1B-1](https://huggingface.co/nyu-mll/roberta-base-1B-1) fine-tuned on [SQUAD v2 dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/) for **Q&A** downstream task.
## Details of the downstream task (Q&A) - Model 🧠
RoBERTa Pretrained on Smaller Datasets
[NYU Machine Learning for Language](https://huggingface.co/nyu-mll) pretrained RoBERTa on smaller datasets (1M, 10M, 100M, 1B tokens). They released 3 models with lowest perplexities for each pretraining data size out of 25 runs (or 10 in the case of 1B tokens). The pretraining data reproduces that of BERT: They combine English Wikipedia and a reproduction of BookCorpus using texts from smashwords in a ratio of approximately 3:1.
## Details of the downstream task (Q&A) - Dataset 📚
**S**tanford **Q**uestion **A**nswering **D**ataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
**SQuAD2.0** combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
## Model training 🏋️‍
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
python transformers/examples/question-answering/run_squad.py \
--model_type roberta \
--model_name_or_path 'nyu-mll/roberta-base-1B-1' \
--do_eval \
--do_train \
--do_lower_case \
--train_file /content/dataset/train-v2.0.json \
--predict_file /content/dataset/dev-v2.0.json \
--per_gpu_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 10 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /content/output \
--overwrite_output_dir \
--save_steps 1000 \
--version_2_with_negative
```
## Test set Results 🧾
| Metric | # Value |
| ------ | --------- |
| **EM** | **64.86** |
| **F1** | **68.99** |
```json
{
'exact': 64.86145034953255,
'f1': 68.9902640378272,
'total': 11873,
'HasAns_exact': 64.03508771929825,
'HasAns_f1': 72.3045554860189,
'HasAns_total': 5928,
'NoAns_exact': 65.68544995794785,
'NoAns_f1': 65.68544995794785,
'NoAns_total': 5945,
'best_exact': 64.86987282068559,
'best_exact_thresh': 0.0,
'best_f1': 68.99868650898054,
'best_f1_thresh': 0.0
}
```
### Model in action 🚀
Fast usage with **pipelines**:
```python
from transformers import pipeline
QnA_pipeline = pipeline('question-answering', model='mrm8488/roberta-base-1B-1-finetuned-squadv2')
QnA_pipeline({
'context': 'A new strain of flu that has the potential to become a pandemic has been identified in China by scientists.',
'question': 'What has been discovered by scientists from China ?'
})
# Output:
{'answer': 'A new strain of flu', 'end': 19, 'score': 0.7145650685380576,'start': 0}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
# RoBERTa (large) fine-tuned on Winograd Schema Challenge (WSC) data
Step from its original [repo](https://github.com/pytorch/fairseq/blob/master/examples/roberta/wsc/README.md)
The following instructions can be used to finetune RoBERTa on the WSC training
data provided by [SuperGLUE](https://super.gluebenchmark.com/).
Note that there is high variance in the results. For our GLUE/SuperGLUE
submission we swept over the learning rate (1e-5, 2e-5, 3e-5), batch size (16,
32, 64) and total number of updates (500, 1000, 2000, 3000), as well as the
random seed. Out of ~100 runs we chose the best 7 models and ensembled them.
**Approach:** The instructions below use a slightly different loss function than
what's described in the original RoBERTa arXiv paper. In particular,
[Kocijan et al. (2019)](https://arxiv.org/abs/1905.06290) introduce a margin
ranking loss between `(query, candidate)` pairs with tunable hyperparameters
alpha and beta. This is supported in our code as well with the `--wsc-alpha` and
`--wsc-beta` arguments. However, we achieved slightly better (and more robust)
results on the development set by instead using a single cross entropy loss term
over the log-probabilities for the query and all mined candidates. **The
candidates are mined using spaCy from each input sentence in isolation, so the
approach remains strictly pointwise.** This reduces the number of
hyperparameters and our best model achieved 92.3% development set accuracy,
compared to ~90% accuracy for the margin loss. Later versions of the RoBERTa
arXiv paper will describe this updated formulation.
### 1) Download the WSC data from the SuperGLUE website:
```bash
wget https://dl.fbaipublicfiles.com/glue/superglue/data/v2/WSC.zip
unzip WSC.zip
# we also need to copy the RoBERTa dictionary into the same directory
wget -O WSC/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
```
### 2) Finetune over the provided training data:
```bash
TOTAL_NUM_UPDATES=2000 # Total number of training steps.
WARMUP_UPDATES=250 # Linearly increase LR over this many steps.
LR=2e-05 # Peak LR for polynomial LR scheduler.
MAX_SENTENCES=16 # Batch size per GPU.
SEED=1 # Random seed.
ROBERTA_PATH=/path/to/roberta/model.pt
# we use the --user-dir option to load the task and criterion
# from the examples/roberta/wsc directory:
FAIRSEQ_PATH=/path/to/fairseq
FAIRSEQ_USER_DIR=${FAIRSEQ_PATH}/examples/roberta/wsc
CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train WSC/ \
--restore-file $ROBERTA_PATH \
--reset-optimizer --reset-dataloader --reset-meters \
--no-epoch-checkpoints --no-last-checkpoints --no-save-optimizer-state \
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
--valid-subset val \
--fp16 --ddp-backend no_c10d \
--user-dir $FAIRSEQ_USER_DIR \
--task wsc --criterion wsc --wsc-cross-entropy \
--arch roberta_large --bpe gpt2 --max-positions 512 \
--dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-06 \
--lr-scheduler polynomial_decay --lr $LR \
--warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_NUM_UPDATES \
--max-sentences $MAX_SENTENCES \
--max-update $TOTAL_NUM_UPDATES \
--log-format simple --log-interval 100 \
--seed $SEED
```
The above command assumes training on 4 GPUs, but you can achieve the same
results on a single GPU by adding `--update-freq=4`.
### 3) Evaluate
```python
from fairseq.models.roberta import RobertaModel
from examples.roberta.wsc import wsc_utils # also loads WSC task and criterion
roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'WSC/')
roberta.cuda()
nsamples, ncorrect = 0, 0
for sentence, label in wsc_utils.jsonl_iterator('WSC/val.jsonl', eval=True):
pred = roberta.disambiguate_pronoun(sentence)
nsamples += 1
if pred == label:
ncorrect += 1
print('Accuracy: ' + str(ncorrect / float(nsamples)))
# Accuracy: 0.9230769230769231
```
## RoBERTa training on WinoGrande dataset
We have also provided `winogrande` task and criterion for finetuning on the
[WinoGrande](https://mosaic.allenai.org/projects/winogrande) like datasets
where there are always two candidates and one is correct.
It's more efficient implementation for such subcases.
```bash
TOTAL_NUM_UPDATES=23750 # Total number of training steps.
WARMUP_UPDATES=2375 # Linearly increase LR over this many steps.
LR=1e-05 # Peak LR for polynomial LR scheduler.
MAX_SENTENCES=32 # Batch size per GPU.
SEED=1 # Random seed.
ROBERTA_PATH=/path/to/roberta/model.pt
# we use the --user-dir option to load the task and criterion
# from the examples/roberta/wsc directory:
FAIRSEQ_PATH=/path/to/fairseq
FAIRSEQ_USER_DIR=${FAIRSEQ_PATH}/examples/roberta/wsc
cd fairseq
CUDA_VISIBLE_DEVICES=0 fairseq-train winogrande_1.0/ \
--restore-file $ROBERTA_PATH \
--reset-optimizer --reset-dataloader --reset-meters \
--no-epoch-checkpoints --no-last-checkpoints --no-save-optimizer-state \
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
--valid-subset val \
--fp16 --ddp-backend no_c10d \
--user-dir $FAIRSEQ_USER_DIR \
--task winogrande --criterion winogrande \
--wsc-margin-alpha 5.0 --wsc-margin-beta 0.4 \
--arch roberta_large --bpe gpt2 --max-positions 512 \
--dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-06 \
--lr-scheduler polynomial_decay --lr $LR \
--warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_NUM_UPDATES \
--max-sentences $MAX_SENTENCES \
--max-update $TOTAL_NUM_UPDATES \
--log-format simple --log-interval 100
```
[Original repo](https://github.com/pytorch/fairseq/tree/master/examples/roberta/wsc)
---
language: en
thumbnail:
---
# SpanBERT base fine-tuned on SQuAD v1
[SpanBERT](https://github.com/facebookresearch/SpanBERT) created by [Facebook Research](https://github.com/facebookresearch) and fine-tuned on [SQuAD 1.1](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/) for **Q&A** downstream task ([by them](https://github.com/facebookresearch/SpanBERT#finetuned-models-squad-1120-relation-extraction-coreference-resolution)).
## Details of SpanBERT
[SpanBERT: Improving Pre-training by Representing and Predicting Spans](https://arxiv.org/abs/1907.10529)
## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
[SQuAD1.1](https://rajpurkar.github.io/SQuAD-explorer/)
## Model fine-tuning 🏋️‍
You can get the fine-tuning script [here](https://github.com/facebookresearch/SpanBERT)
```bash
python code/run_squad.py \
--do_train \
--do_eval \
--model spanbert-base-cased \
--train_file train-v1.1.json \
--dev_file dev-v1.1.json \
--train_batch_size 32 \
--eval_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 4 \
--max_seq_length 512 \
--doc_stride 128 \
--eval_metric f1 \
--output_dir squad_output \
--fp16
```
## Results Comparison 📝
| | SQuAD 1.1 | SQuAD 2.0 | Coref | TACRED |
| ---------------------- | ------------- | --------- | ------- | ------ |
| | F1 | F1 | avg. F1 | F1 |
| BERT (base) | 88.5 | 76.5 | 73.1 | 67.7 |
| SpanBERT (base) | **92.4** (this one) | [83.6](https://huggingface.co/mrm8488/spanbert-base-finetuned-squadv2) | 77.4 | [68.2](https://huggingface.co/mrm8488/spanbert-base-finetuned-tacred) |
| BERT (large) | 91.3 | 83.3 | 77.1 | 66.4 |
| SpanBERT (large) | [94.6](https://huggingface.co/mrm8488/spanbert-large-finetuned-squadv1) | [88.7](https://huggingface.co/mrm8488/spanbert-large-finetuned-squadv2) | 79.6 | [70.8](https://huggingface.co/mrm8488/spanbert-large-finetuned-tacred) |
Note: The numbers marked as * are evaluated on the development sets because those models were not submitted to the official SQuAD leaderboard. All the other numbers are test numbers.
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="mrm8488/spanbert-base-finetuned-squadv1",
tokenizer="SpanBERT/spanbert-base-cased"
)
qa_pipeline({
'context': "Manuel Romero has been working very hard in the repository hugginface/transformers lately",
'question': "How has been working Manuel Romero lately?"
})
# Output: {'answer': 'very hard in the repository hugginface/transformers',
'end': 82,
'score': 0.327230326857725,
'start': 31}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
thumbnail:
---
# SpanBERT base fine-tuned on SQuAD v2
[SpanBERT](https://github.com/facebookresearch/SpanBERT) created by [Facebook Research](https://github.com/facebookresearch) and fine-tuned on [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) for **Q&A** downstream task ([by them](https://github.com/facebookresearch/SpanBERT#finetuned-models-squad-1120-relation-extraction-coreference-resolution)).
## Details of SpanBERT
[SpanBERT: Improving Pre-training by Representing and Predicting Spans](https://arxiv.org/abs/1907.10529)
## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
[SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
| Dataset | Split | # samples |
| -------- | ----- | --------- |
| SQuAD2.0 | train | 130k |
| SQuAD2.0 | eval | 12.3k |
## Model fine-tuning 🏋️‍
You can get the fine-tuning script [here](https://github.com/facebookresearch/SpanBERT)
```bash
python code/run_squad.py \
--do_train \
--do_eval \
--model spanbert-base-cased \
--train_file train-v2.0.json \
--dev_file dev-v2.0.json \
--train_batch_size 32 \
--eval_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 4 \
--max_seq_length 512 \
--doc_stride 128 \
--eval_metric best_f1 \
--output_dir squad2_output \
--version_2_with_negative \
--fp16
```
## Results Comparison 📝
| | SQuAD 1.1 | SQuAD 2.0 | Coref | TACRED |
| ---------------------- | ------------- | --------- | ------- | ------ |
| | F1 | F1 | avg. F1 | F1 |
| BERT (base) | 88.5 | 76.5 | 73.1 | 67.7 |
| SpanBERT (base) | [92.4](https://huggingface.co/mrm8488/spanbert-base-finetuned-squadv1) | **83.6** (this one) | 77.4 | [68.2](https://huggingface.co/mrm8488/spanbert-base-finetuned-tacred) |
| BERT (large) | 91.3 | 83.3 | 77.1 | 66.4 |
| SpanBERT (large) | [94.6](https://huggingface.co/mrm8488/spanbert-large-finetuned-squadv1) | [88.7](https://huggingface.co/mrm8488/spanbert-large-finetuned-squadv2) | 79.6 | [70.8](https://huggingface.co/mrm8488/spanbert-large-finetuned-tacred) |
Note: The numbers marked as * are evaluated on the development sets because those models were not submitted to the official SQuAD leaderboard. All the other numbers are test numbers.
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="mrm8488/spanbert-base-finetuned-squadv2",
tokenizer="SpanBERT/spanbert-base-cased"
)
qa_pipeline({
'context': "Manuel Romero has been working very hard in the repository hugginface/transformers lately",
'question': "How has been working Manuel Romero lately?"
})
# Output: {'answer': 'very hard', 'end': 40, 'score': 0.9052708846768347, 'start': 31}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
thumbnail:
---
# SpanBERT base fine-tuned on TACRED
[SpanBERT](https://github.com/facebookresearch/SpanBERT) created by [Facebook Research](https://github.com/facebookresearch) and fine-tuned on [TACRED](https://nlp.stanford.edu/projects/tacred/) dataset by [them](https://github.com/facebookresearch/SpanBERT#finetuned-models-squad-1120-relation-extraction-coreference-resolution)
## Details of SpanBERT
[SpanBERT: Improving Pre-training by Representing and Predicting Spans](https://arxiv.org/abs/1907.10529)
## Dataset 📚
[TACRED](https://nlp.stanford.edu/projects/tacred/) A large-scale relation extraction dataset with 106k+ examples over 42 TAC KBP relation types.
## Model fine-tuning 🏋️‍
You can get the fine-tuning script [here](https://github.com/facebookresearch/SpanBERT)
```bash
python code/run_tacred.py \
--do_train \
--do_eval \
--data_dir <TACRED_DATA_DIR> \
--model spanbert-base-cased \
--train_batch_size 32 \
--eval_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 10 \
--max_seq_length 128 \
--output_dir tacred_dir \
--fp16
```
## Results Comparison 📝
| | SQuAD 1.1 | SQuAD 2.0 | Coref | TACRED |
| ---------------------- | ------------- | --------- | ------- | ------ |
| | F1 | F1 | avg. F1 | F1 |
| BERT (base) | 88.5* | 76.5* | 73.1 | 67.7 |
| SpanBERT (base) | [92.4*](https://huggingface.co/mrm8488/spanbert-base-finetuned-squadv1) | [83.6*](https://huggingface.co/mrm8488/spanbert-base-finetuned-squadv2) | 77.4 | **68.2** (this one) |
| BERT (large) | 91.3 | 83.3 | 77.1 | 66.4 |
| SpanBERT (large) | [94.6](https://huggingface.co/mrm8488/spanbert-large-finetuned-squadv1) | [88.7](https://huggingface.co/mrm8488/spanbert-large-finetuned-squadv2) | 79.6 | [70.8](https://huggingface.co/mrm8488/spanbert-base-finetuned-tacred) |
Note: The numbers marked as * are evaluated on the development sets because those models were not submitted to the official SQuAD leaderboard. All the other numbers are test numbers.
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
thumbnail:
---
# SpanBERT (spanbert-base-cased) fine-tuned on SQuAD v1.1
[SpanBERT](https://github.com/facebookresearch/SpanBERT) created by [Facebook Research](https://github.com/facebookresearch) and fine-tuned on [SQuAD 1.1](https://rajpurkar.github.io/SQuAD-explorer/) for **Q&A** downstream task.
## Details of SpanBERT
A pre-training method that is designed to better represent and predict spans of text.
[SpanBERT: Improving Pre-training by Representing and Predicting Spans](https://arxiv.org/abs/1907.10529)
## Details of the downstream task (Q&A) - Dataset
[SQuAD 1.1](https://rajpurkar.github.io/SQuAD-explorer/) contains 100,000+ question-answer pairs on 500+ articles.
| Dataset | Split | # samples |
| -------- | ----- | --------- |
| SQuAD1.1 | train | 87.7k |
| SQuAD1.1 | eval | 10.6k |
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM.
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)
## Results:
| Metric | # Value |
| ------ | --------- |
| **EM** | **85.49** |
| **F1** | **91.98** |
### Raw metrics:
```json
{
"exact": 85.49668874172185,
"f1": 91.9845699540379,
"total": 10570,
"HasAns_exact": 85.49668874172185,
"HasAns_f1": 91.9845699540379,
"HasAns_total": 10570,
"best_exact": 85.49668874172185,
"best_exact_thresh": 0.0,
"best_f1": 91.9845699540379,
"best_f1_thresh": 0.0
}
```
## Comparison:
| Model | EM | F1 score |
| ----------------------------------------------------------------------------------------- | --------- | --------- |
| [SpanBert official repo](https://github.com/facebookresearch/SpanBERT#pre-trained-models) | - | 92.4\* |
| [spanbert-finetuned-squadv1](https://huggingface.co/mrm8488/spanbert-finetuned-squadv1) | **85.49** | **91.98** |
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="mrm8488/spanbert-finetuned-squadv1",
tokenizer="mrm8488/spanbert-finetuned-squadv1"
)
qa_pipeline({
'context': "Manuel Romero has been working hardly in the repository hugginface/transformers lately",
'question': "Who has been working hard for hugginface/transformers lately?"
})
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
thumbnail:
---
# SpanBERT (spanbert-base-cased) fine-tuned on SQuAD v2
[SpanBERT](https://github.com/facebookresearch/SpanBERT) created by [Facebook Research](https://github.com/facebookresearch) and fine-tuned on [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) for **Q&A** downstream task.
## Details of SpanBERT
[SpanBERT: Improving Pre-training by Representing and Predicting Spans](https://arxiv.org/abs/1907.10529)
## Details of the downstream task (Q&A) - Dataset
[SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
| Dataset | Split | # samples |
| -------- | ----- | --------- |
| SQuAD2.0 | train | 130k |
| SQuAD2.0 | eval | 12.3k |
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM.
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)
## Results:
| Metric | # Value |
| ------ | --------- |
| **EM** | **78.80** |
| **F1** | **82.22** |
### Raw metrics:
```json
{
"exact": 78.80064010780762,
"f1": 82.22801347271162,
"total": 11873,
"HasAns_exact": 78.74493927125506,
"HasAns_f1": 85.60951483831069,
"HasAns_total": 5928,
"NoAns_exact": 78.85618166526493,
"NoAns_f1": 78.85618166526493,
"NoAns_total": 5945,
"best_exact": 78.80064010780762,
"best_exact_thresh": 0.0,
"best_f1": 82.2280134727116,
"best_f1_thresh": 0.0
}
```
## Comparison:
| Model | EM | F1 score |
| ----------------------------------------------------------------------------------------- | --------- | --------- |
| [SpanBert official repo](https://github.com/facebookresearch/SpanBERT#pre-trained-models) | - | 83.6\* |
| [spanbert-finetuned-squadv2](https://huggingface.co/mrm8488/spanbert-finetuned-squadv2) | **78.80** | **82.22** |
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="mrm8488/spanbert-finetuned-squadv2",
tokenizer="mrm8488/spanbert-finetuned-squadv2"
)
qa_pipeline({
'context': "Manuel Romero has been working hardly in the repository hugginface/transformers lately",
'question': "Who has been working hard for hugginface/transformers lately?"
})
# Output: {'answer': 'Manuel Romero','end': 13,'score': 6.836378586818937e-09, 'start': 0}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
thumbnail:
---
# SpanBERT large fine-tuned on SQuAD v1
[SpanBERT](https://github.com/facebookresearch/SpanBERT) created by [Facebook Research](https://github.com/facebookresearch) and fine-tuned on [SQuAD 1.1](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/) for **Q&A** downstream task ([by them](https://github.com/facebookresearch/SpanBERT#finetuned-models-squad-1120-relation-extraction-coreference-resolution)).
## Details of SpanBERT
[SpanBERT: Improving Pre-training by Representing and Predicting Spans](https://arxiv.org/abs/1907.10529)
## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
[SQuAD1.1](https://rajpurkar.github.io/SQuAD-explorer/)
## Model fine-tuning 🏋️‍
You can get the fine-tuning script [here](https://github.com/facebookresearch/SpanBERT)
```bash
python code/run_squad.py \
--do_train \
--do_eval \
--model spanbert-large-cased \
--train_file train-v1.1.json \
--dev_file dev-v1.1.json \
--train_batch_size 32 \
--eval_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 4 \
--max_seq_length 512 \
--doc_stride 128 \
--eval_metric f1 \
--output_dir squad_output \
--fp16
```
## Results Comparison 📝
| | SQuAD 1.1 | SQuAD 2.0 | Coref | TACRED |
| ---------------------- | ------------- | --------- | ------- | ------ |
| | F1 | F1 | avg. F1 | F1 |
| BERT (base) | 88.5* | 76.5* | 73.1 | 67.7 |
| SpanBERT (base) | [92.4*](https://huggingface.co/mrm8488/spanbert-base-finetuned-squadv1) | [83.6*](https://huggingface.co/mrm8488/spanbert-base-finetuned-squadv2) | 77.4 | [68.2](https://huggingface.co/mrm8488/spanbert-base-finetuned-tacred) |
| BERT (large) | 91.3 | 83.3 | 77.1 | 66.4 |
| SpanBERT (large) | **94.6** (this) | [88.7](https://huggingface.co/mrm8488/spanbert-large-finetuned-squadv2) | 79.6 | [70.8](https://huggingface.co/mrm8488/spanbert-large-finetuned-tacred) |
Note: The numbers marked as * are evaluated on the development sets because those models were not submitted to the official SQuAD leaderboard. All the other numbers are test numbers.
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="mrm8488/spanbert-large-finetuned-squadv1",
tokenizer="SpanBERT/spanbert-large-cased"
)
qa_pipeline({
'context': "Manuel Romero has been working very hard in the repository hugginface/transformers lately",
'question': "How has been working Manuel Romero lately?"
})
# Output: {'answer': 'very hard in the repository hugginface/transformers',
'end': 82,
'score': 0.327230326857725,
'start': 31}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
thumbnail:
---
# SpanBERT large fine-tuned on SQuAD v2
[SpanBERT](https://github.com/facebookresearch/SpanBERT) created by [Facebook Research](https://github.com/facebookresearch) and fine-tuned on [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) for **Q&A** downstream task ([by them](https://github.com/facebookresearch/SpanBERT#finetuned-models-squad-1120-relation-extraction-coreference-resolution)).
## Details of SpanBERT
[SpanBERT: Improving Pre-training by Representing and Predicting Spans](https://arxiv.org/abs/1907.10529)
## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
[SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
| Dataset | Split | # samples |
| -------- | ----- | --------- |
| SQuAD2.0 | train | 130k |
| SQuAD2.0 | eval | 12.3k |
## Model fine-tuning 🏋️‍
You can get the fine-tuning script [here](https://github.com/facebookresearch/SpanBERT)
```bash
python code/run_squad.py \
--do_train \
--do_eval \
--model spanbert-large-cased \
--train_file train-v2.0.json \
--dev_file dev-v2.0.json \
--train_batch_size 32 \
--eval_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 4 \
--max_seq_length 512 \
--doc_stride 128 \
--eval_metric best_f1 \
--output_dir squad2_output \
--version_2_with_negative \
--fp16
```
## Results Comparison 📝
| | SQuAD 1.1 | SQuAD 2.0 | Coref | TACRED |
| ---------------------- | ------------- | --------- | ------- | ------ |
| | F1 | F1 | avg. F1 | F1 |
| BERT (base) | 88.5* | 76.5* | 73.1 | 67.7 |
| SpanBERT (base) | [92.4*](https://huggingface.co/mrm8488/spanbert-base-finetuned-squadv1) | [83.6*](https://huggingface.co/mrm8488/spanbert-base-finetuned-squadv2) | 77.4 | [68.2](https://huggingface.co/mrm8488/spanbert-base-finetuned-tacred) |
| BERT (large) | 91.3 | 83.3 | 77.1 | 66.4 |
| SpanBERT (large) | [94.6](https://huggingface.co/mrm8488/spanbert-large-finetuned-squadv1) | **88.7** (this) | 79.6 | [70.8](https://huggingface.co/mrm8488/spanbert-large-finetuned-tacred) |
Note: The numbers marked as * are evaluated on the development sets because those models were not submitted to the official SQuAD leaderboard. All the other numbers are test numbers.
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="mrm8488/spanbert-large-finetuned-squadv2",
tokenizer="SpanBERT/spanbert-large-cased"
)
qa_pipeline({
'context': "Manuel Romero has been working very hard in the repository hugginface/transformers lately",
'question': "How has been working Manuel Romero lately?"
})
# Output: {'answer': 'very hard', 'end': 40, 'score': 0.9052708846768347, 'start': 31}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment