Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language: eu
widget:
- text: "Euskara da Euskal Herriko <mask> ofiziala"
- text: "Gaur egun, Euskadik Espainia osoko ekonomia <mask> du"
---
# RoBasquERTa: RoBERTa-like Language model trained on OSCAR Basque corpus
---
language: es
thumbnail:
---
# RuPERTa-base (Spanish RoBERTa) + NER 🎃🏷
This model is a fine-tuned on [NER-C](https://www.kaggle.com/nltkdata/conll-corpora) version of [RuPERTa-base](https://huggingface.co/mrm8488/RuPERTa-base) for **NER** downstream task.
## Details of the downstream task (NER) - Dataset
- [Dataset: CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora) 📚
| Dataset | # Examples |
| ---------------------- | ----- |
| Train | 329 K |
| Dev | 40 K |
- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner_old.py)
- Labels covered:
```
B-LOC
B-MISC
B-ORG
B-PER
I-LOC
I-MISC
I-ORG
I-PER
O
```
## Metrics on evaluation set 🧾
| Metric | # score |
| :------------------------------------------------------------------------------------: | :-------: |
| F1 | **77.55**
| Precision | **75.53** |
| Recall | **79.68** |
## Model in action 🔨
Example of usage:
```python
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
id2label = {
"0": "B-LOC",
"1": "B-MISC",
"2": "B-ORG",
"3": "B-PER",
"4": "I-LOC",
"5": "I-MISC",
"6": "I-ORG",
"7": "I-PER",
"8": "O"
}
text ="Julien, CEO de HF, nació en Francia."
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]
for m in last_hidden_states:
for index, n in enumerate(m):
if(index > 0 and index <= len(text.split(" "))):
print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())])
'''
Output:
--------
Julien,: I-PER
CEO: O
de: O
HF,: B-ORG
nació: I-PER
en: I-PER
Francia.: I-LOC
'''
```
Yeah! Not too bad 🎉
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: es
datasets:
- xtreme
widget:
- text: "En 2009 se mudó a Filadelfia y en la actualidad vive en Nueva York. Se mudó nuevamente a Filadelfia en 2009 y ahora vive en la ciudad de Nueva York."
---
# RuPERTa-base fine-tuned on PAWS-X-es for Paraphrase Identification
---
language: es
thumbnail:
---
# RuPERTa-base (Spanish RoBERTa) + POS 🎃🏷
This model is a fine-tuned on [CONLL CORPORA](https://www.kaggle.com/nltkdata/conll-corpora) version of [RuPERTa-base](https://huggingface.co/mrm8488/RuPERTa-base) for **POS** downstream task.
## Details of the downstream task (POS) - Dataset
- [Dataset: CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora) 📚
| Dataset | # Examples |
| ---------------------- | ----- |
| Train | 445 K |
| Dev | 55 K |
- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner_old.py)
- Labels covered:
```
ADJ
ADP
ADV
AUX
CCONJ
DET
INTJ
NOUN
NUM
PART
PRON
PROPN
PUNCT
SCONJ
SYM
VERB
```
## Metrics on evaluation set 🧾
| Metric | # score |
| :------------------------------------------------------------------------------------: | :-------: |
| F1 | **97.39**
| Precision | **97.47** |
| Recall | **9732** |
## Model in action 🔨
Example of usage
```python
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('mrm8488/RuPERTa-base-finetuned-pos')
model = AutoModelForTokenClassification.from_pretrained('mrm8488/RuPERTa-base-finetuned-pos')
id2label = {
"0": "O",
"1": "ADJ",
"2": "ADP",
"3": "ADV",
"4": "AUX",
"5": "CCONJ",
"6": "DET",
"7": "INTJ",
"8": "NOUN",
"9": "NUM",
"10": "PART",
"11": "PRON",
"12": "PROPN",
"13": "PUNCT",
"14": "SCONJ",
"15": "SYM",
"16": "VERB"
}
text ="Mis amigos están pensando viajar a Londres este verano."
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]
for m in last_hidden_states:
for index, n in enumerate(m):
if(index > 0 and index <= len(text.split(" "))):
print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())])
'''
Output:
--------
Mis: NUM
amigos: PRON
están: AUX
pensando: ADV
viajar: VERB
a: ADP
Londres: PROPN
este: DET
verano..: NOUN
'''
```
Yeah! Not too bad 🎉
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: es
thumbnail: https://i.imgur.com/DUlT077.jpg
widget:
- text: "España es un país muy <mask> en la UE"
---
# RuPERTa: the Spanish RoBERTa 🎃<img src="https://abs-0.twimg.com/emoji/v2/svg/1f1ea-1f1f8.svg" alt="spain flag" width="25"/>
RuPERTa-base (uncased) is a [RoBERTa model](https://github.com/pytorch/fairseq/tree/master/examples/roberta) trained on a *uncased* verison of [big Spanish corpus](https://github.com/josecannete/spanish-corpora).
RoBERTa iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data.
The architecture is the same as `roberta-base`:
`roberta.base:` **RoBERTa** using the **BERT-base architecture 125M** params
## Benchmarks 🧾
WIP (I continue working on it) 🚧
| Task/Dataset | F1 | Precision | Recall | Fine-tuned model | Reproduce it |
| -------- | ----: | --------: | -----: | --------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| POS | 97.39 | 97.47 | 97.32 | [RuPERTa-base-finetuned-pos](https://huggingface.co/mrm8488/RuPERTa-base-finetuned-pos) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/RuPERTa_base_finetuned_POS.ipynb)
| NER | 77.55 | 75.53 | 79.68 | [RuPERTa-base-finetuned-ner](https://huggingface.co/mrm8488/RuPERTa-base-finetuned-ner) |
| SQUAD-es v1 | to-do | | |[RuPERTa-base-finetuned-squadv1](https://huggingface.co/mrm8488/RuPERTa-base-finetuned-squadv1)
| SQUAD-es v2 | to-do | | |[RuPERTa-base-finetuned-squadv2](https://huggingface.co/mrm8488/RuPERTa-base-finetuned-squadv2)
## Model in action 🔨
### Usage for POS and NER 🏷
```python
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
id2label = {
"0": "B-LOC",
"1": "B-MISC",
"2": "B-ORG",
"3": "B-PER",
"4": "I-LOC",
"5": "I-MISC",
"6": "I-ORG",
"7": "I-PER",
"8": "O"
}
tokenizer = AutoTokenizer.from_pretrained('mrm8488/RuPERTa-base-finetuned-ner')
model = AutoModelForTokenClassification.from_pretrained('mrm8488/RuPERTa-base-finetuned-ner')
text ="Julien, CEO de HF, nació en Francia."
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]
for m in last_hidden_states:
for index, n in enumerate(m):
if(index > 0 and index <= len(text.split(" "))):
print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())])
# Output:
'''
Julien,: I-PER
CEO: O
de: O
HF,: B-ORG
nació: I-PER
en: I-PER
Francia.: I-LOC
'''
```
For **POS** just change the `id2label` dictionary and the model path to [mrm8488/RuPERTa-base-finetuned-pos](https://huggingface.co/mrm8488/RuPERTa-base-finetuned-pos)
### Fast usage for LM with `pipelines` 🧪
```python
from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained('mrm8488/RuPERTa-base')
tokenizer = AutoTokenizer.from_pretrained("mrm8488/RuPERTa-base", do_lower_case=True)
from transformers import pipeline
pipeline_fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
pipeline_fill_mask("España es un país muy <mask> en la UE")
```
```json
[
{
"score": 0.1814306527376175,
"sequence": "<s> españa es un país muy importante en la ue</s>",
"token": 1560
},
{
"score": 0.024842597544193268,
"sequence": "<s> españa es un país muy fuerte en la ue</s>",
"token": 2854
},
{
"score": 0.02473250962793827,
"sequence": "<s> españa es un país muy pequeño en la ue</s>",
"token": 2948
},
{
"score": 0.023991240188479424,
"sequence": "<s> españa es un país muy antiguo en la ue</s>",
"token": 5240
},
{
"score": 0.0215945765376091,
"sequence": "<s> españa es un país muy popular en la ue</s>",
"token": 5782
}
]
```
## Acknowledgments
I thank [🤗/transformers team](https://github.com/huggingface/transformers) for answering my doubts and Google for helping me with the [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc) program.
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: es
thumbnail:
---
# Spanish TinyBERT + NER
This model is a fine-tuned on [NER-C](https://www.kaggle.com/nltkdata/conll-corpora) of a [Spanish Tiny Bert](https://huggingface.co/mrm8488/es-tinybert-v1-1) model I created using *distillation* for **NER** downstream task. The **size** of the model is **55MB**
## Details of the downstream task (NER) - Dataset
- [Dataset: CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora)
I preprocessed the dataset and split it as train / dev (80/20)
| Dataset | # Examples |
| ---------------------- | ----- |
| Train | 8.7 K |
| Dev | 2.2 K |
- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner_old.py)
- Labels covered:
```
B-LOC
B-MISC
B-ORG
B-PER
I-LOC
I-MISC
I-ORG
I-PER
O
```
## Metrics on evaluation set:
| Metric | # score |
| :------------------------------------------------------------------------------------: | :-------: |
| F1 | **70.00**
| Precision | **67.83** |
| Recall | **71.46** |
## Comparison:
| Model | # F1 score |Size(MB)|
| :--------------------------------------------------------------------------------------------------------------: | :-------: |:------|
| bert-base-spanish-wwm-cased (BETO) | 88.43 | 421
| [bert-spanish-cased-finetuned-ner](https://huggingface.co/mrm8488/bert-spanish-cased-finetuned-ner) | **90.17** | 420 |
| Best Multilingual BERT | 87.38 | 681 |
|TinyBERT-spanish-uncased-finetuned-ner (this one) | 70.00 | **55** |
## Model in action
Example of usage:
```python
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
id2label = {
"0": "B-LOC",
"1": "B-MISC",
"2": "B-ORG",
"3": "B-PER",
"4": "I-LOC",
"5": "I-MISC",
"6": "I-ORG",
"7": "I-PER",
"8": "O"
}
tokenizer = AutoTokenizer.from_pretrained('mrm8488/TinyBERT-spanish-uncased-finetuned-ner')
model = AutoModelForTokenClassification.from_pretrained('mrm8488/TinyBERT-spanish-uncased-finetuned-ner')
text ="Mis amigos están pensando viajar a Londres este verano."
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]
for m in last_hidden_states:
for index, n in enumerate(m):
if(index > 0 and index <= len(text.split(" "))):
print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())])
'''
Output:
--------
Mis: O
amigos: O
están: O
pensando: O
viajar: O
a: O
Londres: B-LOC
este: O
verano.: O
'''
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: de
datasets:
- xtreme
widget:
- text: "Winarsky ist Mitglied des IEEE, Phi Beta Kappa, des ACM und des Sigma Xi. Winarsky ist Mitglied des ACM, des IEEE, der Phi Beta Kappa und der Sigma Xi."
---
# bert-base-german-dbmdz-cased fine-tuned on PAWS-X-de for Paraphrase Identification
---
language: de
---
# German BERT + LER (Legal Entity Recognition) ⚖️
German BERT ([BERT-base-german-cased](https://huggingface.co/bert-base-german-cased)) fine-tuned on [Legal-Entity-Recognition](https://github.com/elenanereiss/Legal-Entity-Recognition) dataset for **LER** (NER) downstream task.
## Details of the downstream task (NER) - Dataset
[Legal-Entity-Recognition](https://github.com/elenanereiss/Legal-Entity-Recognition): Fine-grained Named Entity Recognition in Legal Documents.
Court decisions from 2017 and 2018 were selected for the dataset, published online by the [Federal Ministry of Justice and Consumer Protection](http://www.rechtsprechung-im-internet.de). The documents originate from seven federal courts: Federal Labour Court (BAG), Federal Fiscal Court (BFH), Federal Court of Justice (BGH), Federal Patent Court (BPatG), Federal Social Court (BSG), Federal Constitutional Court (BVerfG) and Federal Administrative Court (BVerwG).
| Split | # Samples |
| ---------------------- | ----- |
| Train | 1657048 |
| Eval | 500000 |
- Training script: [Fine-tuning script for NER provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner_old.py)
Colab: [How to fine-tune a model for NER using HF scripts](https://colab.research.google.com/drive/156Qrd7NsUHwA3nmQ6gXdZY0NzOvqk9AT?usp=sharing)
- Labels covered (and its distribution):
```
107 B-AN
918 B-EUN
2238 B-GRT
13282 B-GS
1113 B-INN
704 B-LD
151 B-LDS
2490 B-LIT
282 B-MRK
890 B-ORG
1374 B-PER
1480 B-RR
10046 B-RS
401 B-ST
68 B-STR
1011 B-UN
282 B-VO
391 B-VS
2648 B-VT
46 I-AN
6925 I-EUN
1957 I-GRT
70257 I-GS
2931 I-INN
153 I-LD
26 I-LDS
28881 I-LIT
383 I-MRK
1185 I-ORG
330 I-PER
106 I-RR
138938 I-RS
34 I-ST
55 I-STR
1259 I-UN
1572 I-VO
2488 I-VS
11121 I-VT
1348525 O
```
- [Annotation Guidelines (German)](https://github.com/elenanereiss/Legal-Entity-Recognition/blob/master/docs/Annotationsrichtlinien.pdf)
## Metrics on evaluation set
| Metric | # score |
| :------------------------------------------------------------------------------------: | :-------: |
| F1 | **85.67**
| Precision | **84.35** |
| Recall | **87.04** |
| Accuracy | **98.46** |
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
nlp_ler = pipeline(
"ner",
model="mrm8488/bert-base-german-finetuned-ler",
tokenizer="mrm8488/bert-base-german-finetuned-ler"
)
text = "Your German legal text here"
nlp_ler(text)
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: es
thumbnail: https://i.imgur.com/jgBdimh.png
---
# BETO (Spanish BERT) + Spanish SQuAD2.0
This model is provided by [BETO team](https://github.com/dccuchile/beto) and fine-tuned on [SQuAD-es-v2.0](https://github.com/ccasimiro88/TranslateAlignRetrieve) for **Q&A** downstream task.
## Details of the language model('dccuchile/bert-base-spanish-wwm-cased')
Language model ([**'dccuchile/bert-base-spanish-wwm-cased'**](https://github.com/dccuchile/beto/blob/master/README.md)):
BETO is a [BERT model](https://github.com/google-research/bert) trained on a [big Spanish corpus](https://github.com/josecannete/spanish-corpora). BETO is of size similar to a BERT-Base and was trained with the Whole Word Masking technique. Below you find Tensorflow and Pytorch checkpoints for the uncased and cased versions, as well as some results for Spanish benchmarks comparing BETO with [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) as well as other (not BERT-based) models.
## Details of the downstream task (Q&A) - Dataset
[SQuAD-es-v2.0](https://github.com/ccasimiro88/TranslateAlignRetrieve)
| Dataset | # Q&A |
| ---------------------- | ----- |
| SQuAD2.0 Train | 130 K |
| SQuAD2.0-es-v2.0 | 111 K |
| SQuAD2.0 Dev | 12 K |
| SQuAD-es-v2.0-small Dev| 69 K |
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
export SQUAD_DIR=path/to/nl_squad
python transformers/examples/question-answering/run_squad.py \
--model_type bert \
--model_name_or_path dccuchile/bert-base-spanish-wwm-cased \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train_nl-v2.0.json \
--predict_file $SQUAD_DIR/dev_nl-v2.0.json \
--per_gpu_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /content/model_output \
--save_steps 5000 \
--threads 4 \
--version_2_with_negative
```
## Results:
| Metric | # Value |
| ---------------------- | ----- |
| **Exact** | **76.50**50 |
| **F1** | **86.07**81 |
```json
{
"exact": 76.50501430594491,
"f1": 86.07818773108252,
"total": 69202,
"HasAns_exact": 67.93020719738277,
"HasAns_f1": 82.37912207996466,
"HasAns_total": 45850,
"NoAns_exact": 93.34104145255225,
"NoAns_f1": 93.34104145255225,
"NoAns_total": 23352,
"best_exact": 76.51223953064941,
"best_exact_thresh": 0.0,
"best_f1": 86.08541295578848,
"best_f1_thresh": 0.0
}
```
### Model in action (in a Colab Notebook)
<details>
1. Set the context and ask some questions:
![Set context and questions](https://media.giphy.com/media/mCIaBpfN0LQcuzkA2F/giphy.gif)
2. Run predictions:
![Run the model](https://media.giphy.com/media/WT453aptcbCP7hxWTZ/giphy.gif)
</details>
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: it
thumbnail:
---
# Italian BERT fine-tuned on SQuAD_it v1
[Italian BERT base cased](https://huggingface.co/dbmdz/bert-base-italian-cased) fine-tuned on [italian SQuAD](https://github.com/crux82/squad-it) for **Q&A** downstream task.
## Details of Italian BERT
The source data for the Italian BERT model consists of a recent Wikipedia dump and various texts from the OPUS corpora collection. The final training corpus has a size of 13GB and 2,050,057,573 tokens.
For sentence splitting, we use NLTK (faster compared to spacy). Our cased and uncased models are training with an initial sequence length of 512 subwords for ~2-3M steps.
For the XXL Italian models, we use the same training data from OPUS and extend it with data from the Italian part of the OSCAR corpus. Thus, the final training corpus has a size of 81GB and 13,138,379,147 tokens.
More in its official [model card](https://huggingface.co/dbmdz/bert-base-italian-cased)
Created by [Stefan](https://huggingface.co/stefan-it) at [MDZ](https://huggingface.co/dbmdz)
## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
[Italian SQuAD v1.1](https://rajpurkar.github.io/SQuAD-explorer/) is derived from the SQuAD dataset and it is obtained through semi-automatic translation of the SQuAD dataset
into Italian. It represents a large-scale dataset for open question answering processes on factoid questions in Italian.
**The dataset contains more than 60,000 question/answer pairs derived from the original English dataset.** The dataset is split into training and test sets to support the replicability of the benchmarking of QA systems:
- `SQuAD_it-train.json`: it contains training examples derived from the original SQuAD 1.1 trainig material.
- `SQuAD_it-test.json`: it contains test/benchmarking examples derived from the origial SQuAD 1.1 development material.
More details about SQuAD-it can be found in [Croce et al. 2018]. The original paper can be found at this [link](https://link.springer.com/chapter/10.1007/978-3-030-03840-3_29).
## Model training 🏋️‍
The model was trained on a Tesla P100 GPU and 25GB of RAM.
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)
## Results 📝
| Metric | # Value |
| ------ | --------- |
| **EM** | **62.51** |
| **F1** | **74.16** |
### Raw metrics
```json
{
"exact": 62.5180707057432,
"f1": 74.16038329042492,
"total": 7609,
"HasAns_exact": 62.5180707057432,
"HasAns_f1": 74.16038329042492,
"HasAns_total": 7609,
"best_exact": 62.5180707057432,
"best_exact_thresh": 0.0,
"best_f1": 74.16038329042492,
"best_f1_thresh": 0.0
}
```
## Comparison ⚖️
| Model | EM | F1 score |
| -------------------------------------------------------------------------------------------------------------------------------- | --------- | --------- |
| [DrQA-it trained on SQuAD-it ](https://github.com/crux82/squad-it/blob/master/README.md#evaluating-a-neural-model-over-squad-it) | 56.1 | 65.9 |
| This one | **62.51** | **74.16** |
## Model in action 🚀
Fast usage with **pipelines** 🧪
```python
from transformers import pipeline
nlp_qa = pipeline(
'question-answering',
model='mrm8488/bert-italian-finedtuned-squadv1-it-alfa',
tokenizer='mrm8488/bert-italian-finedtuned-squadv1-it-alfa'
)
nlp_qa(
{
'question': 'Per quale lingua stai lavorando?',
'context': 'Manuel Romero è colaborando attivamente con HF / trasformatori per il trader del poder de las últimas ' +
'técnicas di procesamiento de lenguaje natural al idioma español'
}
)
# Output: {'answer': 'español', 'end': 174, 'score': 0.9925341537498156, 'start': 168}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
Dataset citation
<details>
@InProceedings{10.1007/978-3-030-03840-3_29,
author="Croce, Danilo and Zelenanska, Alexandra and Basili, Roberto",
editor="Ghidini, Chiara and Magnini, Bernardo and Passerini, Andrea and Traverso, Paolo",
title="Neural Learning for Question Answering in Italian",
booktitle="AI*IA 2018 -- Advances in Artificial Intelligence",
year="2018",
publisher="Springer International Publishing",
address="Cham",
pages="389--402",
isbn="978-3-030-03840-3"
}
</detail>
---
language: en
thumbnail:
---
# BERT-Medium fine-tuned on SQuAD v2
[BERT-Medium](https://github.com/google-research/bert/) created by [Google Research](https://github.com/google-research) and fine-tuned on [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) for **Q&A** downstream task.
**Mode size** (after training): **157.46 MB**
## Details of BERT-Small and its 'family' (from their documentation)
Released on March 11th, 2020
This is model is a part of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962).
The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
## Details of the downstream task (Q&A) - Dataset
[SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
| Dataset | Split | # samples |
| -------- | ----- | --------- |
| SQuAD2.0 | train | 130k |
| SQuAD2.0 | eval | 12.3k |
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM.
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)
## Results:
| Metric | # Value |
| ------ | --------- |
| **EM** | **65.95** |
| **F1** | **70.11** |
### Raw metrics from benchmark included in training script:
```json
{
"exact": 65.95637159942727,
"f1": 70.11632254245896,
"total": 11873,
"HasAns_exact": 67.79689608636977,
"HasAns_f1": 76.12872765631123,
"HasAns_total": 5928,
"NoAns_exact": 64.12111017661901,
"NoAns_f1": 64.12111017661901,
"NoAns_total": 5945,
"best_exact": 65.96479407058031,
"best_exact_thresh": 0.0,
"best_f1": 70.12474501361196,
"best_f1_thresh": 0.0
}
```
## Comparison:
| Model | EM | F1 score | SIZE (MB) |
| --------------------------------------------------------------------------------------------- | --------- | --------- | --------- |
| [bert-tiny-finetuned-squadv2](https://huggingface.co/mrm8488/bert-tiny-finetuned-squadv2) | 48.60 | 49.73 | **16.74** |
| [bert-tiny-5-finetuned-squadv2](https://huggingface.co/mrm8488/bert-tiny-5-finetuned-squadv2) | 57.12 | 60.86 | 24.34 |
| [bert-mini-finetuned-squadv2](https://huggingface.co/mrm8488/bert-mini-finetuned-squadv2) | 56.31 | 59.65 | 42.63 |
| [bert-mini-5-finetuned-squadv2](https://huggingface.co/mrm8488/bert-mini-5-finetuned-squadv2) | 63.51 | 66.78 | 66.76 |
| [bert-small-finetuned-squadv2](https://huggingface.co/mrm8488/bert-small-finetuned-squadv2) | 60.49 | 64.21 | 109.74 |
| [bert-medium-finetuned-squadv2](https://huggingface.co/mrm8488/bert-medium-finetuned-squadv2) | **65.95** | **70.11** | 157.46 |
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="mrm8488/bert-small-finetuned-squadv2",
tokenizer="mrm8488/bert-small-finetuned-squadv2"
)
qa_pipeline({
'context': "Manuel Romero has been working hardly in the repository hugginface/transformers lately",
'question': "Who has been working hard for hugginface/transformers lately?"
})
# Output:
```
```json
{
"answer": "Manuel Romero",
"end": 13,
"score": 0.9939319924374637,
"start": 0
}
```
### Yes! That was easy 🎉 Let's try with another example
```python
qa_pipeline({
'context': "Manuel Romero has been working remotely in the repository hugginface/transformers lately",
'question': "How has been working Manuel Romero?"
})
# Output:
```
```json
{ "answer": "remotely", "end": 39, "score": 0.3612058272768017, "start": 31 }
```
### It works!! 🎉 🎉 🎉
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
thumbnail:
---
# BERT-Mini fine-tuned on SQuAD v2
[BERT-Mini](https://github.com/google-research/bert/) created by [Google Research](https://github.com/google-research) and fine-tuned on [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) for **Q&A** downstream task.
**Mode size** (after training): **42.63 MB**
## Details of BERT-Mini and its 'family' (from their documentation)
Released on March 11th, 2020
This is model is a part of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962).
The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
## Details of the downstream task (Q&A) - Dataset
[SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
| Dataset | Split | # samples |
| -------- | ----- | --------- |
| SQuAD2.0 | train | 130k |
| SQuAD2.0 | eval | 12.3k |
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM.
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)
## Results:
| Metric | # Value |
| ------ | --------- |
| **EM** | **56.31** |
| **F1** | **59.65** |
## Comparison:
| Model | EM | F1 score | SIZE (MB) |
| ----------------------------------------------------------------------------------------- | --------- | --------- | --------- |
| [bert-tiny-finetuned-squadv2](https://huggingface.co/mrm8488/bert-tiny-finetuned-squadv2) | 48.60 | 49.73 | **16.74** |
| [bert-tiny-5-finetuned-squadv2](https://huggingface.co/mrm8488/bert-tiny-5-finetuned-squadv2) | 57.12 | 60.86 | 24.34 |
| [bert-mini-finetuned-squadv2](https://huggingface.co/mrm8488/bert-mini-finetuned-squadv2) | 56.31 | 59.65 | 42.63 |
| [bert-mini-5-finetuned-squadv2](https://huggingface.co/mrm8488/bert-mini-5-finetuned-squadv2) | **63.51** | **66.78** | 66.76 |
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="mrm8488/bert-mini-finetuned-squadv2",
tokenizer="mrm8488/bert-mini-finetuned-squadv2"
)
qa_pipeline({
'context': "Manuel Romero has been working hardly in the repository hugginface/transformers lately",
'question': "Who has been working hard for hugginface/transformers lately?"
})
# Output:
```
```json
{
"answer": "Manuel Romero",
"end": 13,
"score": 0.9676484207783673,
"start": 0
}
```
### Yes! That was easy 🎉 Let's try with another example
```python
qa_pipeline({
'context': "Manuel Romero has been working hardly in the repository hugginface/transformers lately",
'question': "For which company has worked Manuel Romero?"
})
# Output:
```
```json
{
"answer": "hugginface/transformers",
"end": 79,
"score": 0.5301655914731853,
"start": 56
}
```
### It works!! 🎉 🎉 🎉
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
license: apache-2.0
datasets:
- cnn_dailymail
tags:
- summarization
---
# Bert-mini2Bert-mini Summarization with 🤗EncoderDecoder Framework
This model is a warm-started *BERT2BERT* ([mini](https://huggingface.co/google/bert_uncased_L-4_H-256_A-4)) model fine-tuned on the *CNN/Dailymail* summarization dataset.
The model achieves a **16.51** ROUGE-2 score on *CNN/Dailymail*'s test dataset.
For more details on how the model was fine-tuned, please refer to
[this](https://colab.research.google.com/drive/1Ekd5pUeCX7VOrMx94_czTkwNtLN32Uyu?usp=sharing) notebook.
## Results on test set 📝
| Metric | # Value |
| ------ | --------- |
| **ROUGE-2** | **16.51** |
## Model in Action 🚀
```python
from transformers import BertTokenizerFast, EncoderDecoderModel
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = BertTokenizerFast.from_pretrained('mrm8488/bert-mini2bert-mini-finetuned-cnn_daily_mail-summarization')
model = EncoderDecoderModel.from_pretrained('mrm8488/bert-mini2bert-mini-finetuned-cnn_daily_mail-summarization').to(device)
def generate_summary(text):
# cut off at BERT max length 512
inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
input_ids = inputs.input_ids.to(device)
attention_mask = inputs.attention_mask.to(device)
output = model.generate(input_ids, attention_mask=attention_mask)
return tokenizer.decode(output[0], skip_special_tokens=True)
text = "your text to be summarized here..."
generate_summary(text)
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: multilingual
thumbnail:
---
# A fine-tuned model on GoldP task from Tydi QA dataset
This model uses [bert-multi-cased-finetuned-xquadv1](https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1) and fine-tuned on [Tydi QA](https://github.com/google-research-datasets/tydiqa) dataset for Gold Passage task [(GoldP)](https://github.com/google-research-datasets/tydiqa#the-tasks)
## Details of the language model
The base language model [(bert-multi-cased-finetuned-xquadv1)](https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1) is a fine-tuned version of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) for the **Q&A** downstream task
## Details of the Tydi QA dataset
TyDi QA contains 200k human-annotated question-answer pairs in 11 Typologically Diverse languages, written without seeing the answer and without the use of translation, and is designed for the **training and evaluation** of automatic question answering systems. This repository provides evaluation code and a baseline system for the dataset. https://ai.google.com/research/tydiqa
## Details of the downstream task (Gold Passage or GoldP aka the secondary task)
Given a passage that is guaranteed to contain the answer, predict the single contiguous span of characters that answers the question. the gold passage task differs from the [primary task](https://github.com/google-research-datasets/tydiqa/blob/master/README.md#the-tasks) in several ways:
* only the gold answer passage is provided rather than the entire Wikipedia article;
* unanswerable questions have been discarded, similar to MLQA and XQuAD;
* we evaluate with the SQuAD 1.1 metrics like XQuAD; and
* Thai and Japanese are removed since the lack of whitespace breaks some tools.
## Model training
The model was fine-tuned on a Tesla P100 GPU and 25GB of RAM.
The script is the following:
```python
python run_squad.py \
--model_type bert \
--model_name_or_path mrm8488/bert-multi-cased-finetuned-xquadv1 \
--do_train \
--do_eval \
--train_file /content/dataset/train.json \
--predict_file /content/dataset/dev.json \
--per_gpu_train_batch_size 24 \
--per_gpu_eval_batch_size 24 \
--learning_rate 3e-5 \
--num_train_epochs 2.5 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /content/model_output \
--overwrite_output_dir \
--save_steps 5000 \
--threads 40
```
## Global Results (dev set):
| Metric | # Value |
| --------- | ----------- |
| **Exact** | **71.06** |
| **F1** | **82.16** |
## Specific Results (per language):
| Language | # Samples | # Exact | # F1 |
| --------- | ----------- |--------| ------ |
| Arabic | 1314 | 73.29 | 84.72 |
| Bengali | 180 | 64.60 | 77.84 |
| English | 654 | 72.12 | 82.24 |
| Finnish | 1031 | 70.14 | 80.36 |
| Indonesian| 773 | 77.25 | 86.36 |
| Korean | 414 | 68.92 | 70.95 |
| Russian | 1079 | 62.65 | 78.55 |
| Swahili | 596 | 80.11 | 86.18 |
| Telegu | 874 | 71.00 | 84.24 |
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: multilingual
thumbnail:
---
# BERT (base-multilingual-cased) fine-tuned for multilingual Q&A
This model was created by [Google](https://github.com/google-research/bert/blob/master/multilingual.md) and fine-tuned on [XQuAD](https://github.com/deepmind/xquad) like data for multilingual (`11 different languages`) **Q&A** downstream task.
## Details of the language model('bert-base-multilingual-cased')
[Language model](https://github.com/google-research/bert/blob/master/multilingual.md)
| Languages | Heads | Layers | Hidden | Params |
| --------- | ----- | ------ | ------ | ------ |
| 104 | 12 | 12 | 768 | 100 M |
## Details of the downstream task (multilingual Q&A) - Dataset
Deepmind [XQuAD](https://github.com/deepmind/xquad)
Languages covered:
- Arabic: `ar`
- German: `de`
- Greek: `el`
- English: `en`
- Spanish: `es`
- Hindi: `hi`
- Russian: `ru`
- Thai: `th`
- Turkish: `tr`
- Vietnamese: `vi`
- Chinese: `zh`
As the dataset is based on SQuAD v1.1, there are no unanswerable questions in the data. We chose this
setting so that models can focus on cross-lingual transfer.
We show the average number of tokens per paragraph, question, and answer for each language in the
table below. The statistics were obtained using [Jieba](https://github.com/fxsjy/jieba) for Chinese
and the [Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)
for the other languages.
| | en | es | de | el | ru | tr | ar | vi | th | zh | hi |
| --------- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Paragraph | 142.4 | 160.7 | 139.5 | 149.6 | 133.9 | 126.5 | 128.2 | 191.2 | 158.7 | 147.6 | 232.4 |
| Question | 11.5 | 13.4 | 11.0 | 11.7 | 10.0 | 9.8 | 10.7 | 14.8 | 11.5 | 10.5 | 18.7 |
| Answer | 3.1 | 3.6 | 3.0 | 3.3 | 3.1 | 3.1 | 3.1 | 4.5 | 4.1 | 3.5 | 5.6 |
Citation:
<details>
```bibtex
@article{Artetxe:etal:2019,
author = {Mikel Artetxe and Sebastian Ruder and Dani Yogatama},
title = {On the cross-lingual transferability of monolingual representations},
journal = {CoRR},
volume = {abs/1910.11856},
year = {2019},
archivePrefix = {arXiv},
eprint = {1910.11856}
}
```
</details>
As **XQuAD** is just an evaluation dataset, I used `Data augmentation techniques` (scraping, neural machine translation, etc) to obtain more samples and split the dataset in order to have a train and test set. The test set was created in a way that contains the same number of samples for each language. Finally, I got:
| Dataset | # samples |
| ----------- | --------- |
| XQUAD train | 50 K |
| XQUAD test | 8 K |
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM.
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/distillation/run_squad_w_distillation.py)
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="mrm8488/bert-multi-cased-finetuned-xquadv1",
tokenizer="mrm8488/bert-multi-cased-finetuned-xquadv1"
)
# context: Coronavirus is seeding panic in the West because it expands so fast.
# question: Where is seeding panic Coronavirus?
qa_pipeline({
'context': "कोरोनावायरस पश्चिम में आतंक बो रहा है क्योंकि यह इतनी तेजी से फैलता है।",
'question': "कोरोनावायरस घबराहट कहां है?"
})
# output: {'answer': 'पश्चिम', 'end': 18, 'score': 0.7037217439689059, 'start': 12}
qa_pipeline({
'context': "Manuel Romero has been working hardly in the repository hugginface/transformers lately",
'question': "Who has been working hard for hugginface/transformers lately?"
})
# output: {'answer': 'Manuel Romero', 'end': 13, 'score': 0.7254485993702389, 'start': 0}
qa_pipeline({
'context': "Manuel Romero a travaillé à peine dans le référentiel hugginface / transformers ces derniers temps",
'question': "Pour quel référentiel a travaillé Manuel Romero récemment?"
})
#output: {'answer': 'hugginface / transformers', 'end': 79, 'score': 0.6482061613915384, 'start': 54}
```
![model in action](https://media.giphy.com/media/MBlire8Wj7ng73VBQ5/giphy.gif)
Try it on a Colab:
<a href="https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/Try_mrm8488_xquad_finetuned_model.ipynb" target="_parent"><img src="https://camo.githubusercontent.com/52feade06f2fecbf006889a904d221e6a730c194/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667" alt="Open In Colab" data-canonical-src="https://colab.research.google.com/assets/colab-badge.svg"></a>
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: multilingual
thumbnail:
---
# BERT (base-multilingual-uncased) fine-tuned for multilingual Q&A
This model was created by [Google](https://github.com/google-research/bert/blob/master/multilingual.md) and fine-tuned on [XQuAD](https://github.com/deepmind/xquad) like data for multilingual (`11 different languages`) **Q&A** downstream task.
## Details of the language model('bert-base-multilingual-uncased')
[Language model](https://github.com/google-research/bert/blob/master/multilingual.md)
| Languages | Heads | Layers | Hidden | Params |
| --------- | ----- | ------ | ------ | ------ |
| 102 | 12 | 12 | 768 | 100 M |
## Details of the downstream task (multilingual Q&A) - Dataset
Deepmind [XQuAD](https://github.com/deepmind/xquad)
Languages covered:
- Arabic: `ar`
- German: `de`
- Greek: `el`
- English: `en`
- Spanish: `es`
- Hindi: `hi`
- Russian: `ru`
- Thai: `th`
- Turkish: `tr`
- Vietnamese: `vi`
- Chinese: `zh`
As the dataset is based on SQuAD v1.1, there are no unanswerable questions in the data. We chose this
setting so that models can focus on cross-lingual transfer.
We show the average number of tokens per paragraph, question, and answer for each language in the
table below. The statistics were obtained using [Jieba](https://github.com/fxsjy/jieba) for Chinese
and the [Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)
for the other languages.
| | en | es | de | el | ru | tr | ar | vi | th | zh | hi |
| --------- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Paragraph | 142.4 | 160.7 | 139.5 | 149.6 | 133.9 | 126.5 | 128.2 | 191.2 | 158.7 | 147.6 | 232.4 |
| Question | 11.5 | 13.4 | 11.0 | 11.7 | 10.0 | 9.8 | 10.7 | 14.8 | 11.5 | 10.5 | 18.7 |
| Answer | 3.1 | 3.6 | 3.0 | 3.3 | 3.1 | 3.1 | 3.1 | 4.5 | 4.1 | 3.5 | 5.6 |
Citation:
<details>
```bibtex
@article{Artetxe:etal:2019,
author = {Mikel Artetxe and Sebastian Ruder and Dani Yogatama},
title = {On the cross-lingual transferability of monolingual representations},
journal = {CoRR},
volume = {abs/1910.11856},
year = {2019},
archivePrefix = {arXiv},
eprint = {1910.11856}
}
```
</details>
As **XQuAD** is just an evaluation dataset, I used `Data augmentation techniques` (scraping, neural machine translation, etc) to obtain more samples and split the dataset in order to have a train and test set. The test set was created in a way that contains the same number of samples for each language. Finally, I got:
| Dataset | # samples |
| ----------- | --------- |
| XQUAD train | 50 K |
| XQUAD test | 8 K |
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM.
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/distillation/run_squad_w_distillation.py)
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="mrm8488/bert-multi-uncased-finetuned-xquadv1",
tokenizer="mrm8488/bert-multi-uncased-finetuned-xquadv1"
)
# context: Coronavirus is seeding panic in the West because it expands so fast.
# question: Where is seeding panic Coronavirus?
qa_pipeline({
'context': "कोरोनावायरस पश्चिम में आतंक बो रहा है क्योंकि यह इतनी तेजी से फैलता है।",
'question': "कोरोनावायरस घबराहट कहां है?"
})
# output: {'answer': 'पश्चिम', 'end': 18, 'score': 0.7037217439689059, 'start': 12}
qa_pipeline({
'context': "Manuel Romero has been working hardly in the repository hugginface/transformers lately",
'question': "Who has been working hard for hugginface/transformers lately?"
})
# output: {'answer': 'Manuel Romero', 'end': 13, 'score': 0.7254485993702389, 'start': 0}
qa_pipeline({
'context': "Manuel Romero a travaillé à peine dans le référentiel hugginface / transformers ces derniers temps",
'question': "Pour quel référentiel a travaillé Manuel Romero récemment?"
})
#output: {'answer': 'hugginface / transformers', 'end': 79, 'score': 0.6482061613915384, 'start': 54}
```
![model in action](https://media.giphy.com/media/MBlire8Wj7ng73VBQ5/giphy.gif)
Try it on a Colab:
<a href="https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/Try_mrm8488_xquad_finetuned_uncased_model.ipynb" target="_parent"><img src="https://camo.githubusercontent.com/52feade06f2fecbf006889a904d221e6a730c194/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667" alt="Open In Colab" data-canonical-src="https://colab.research.google.com/assets/colab-badge.svg"></a>
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
thumbnail:
---
# BERT-Small fine-tuned on SQuAD v2
[BERT-Small](https://github.com/google-research/bert/) created by [Google Research](https://github.com/google-research) and fine-tuned on [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) for **Q&A** downstream task.
**Mode size** (after training): **109.74 MB**
## Details of BERT-Small and its 'family' (from their documentation)
Released on March 11th, 2020
This is model is a part of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962).
The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
## Details of the downstream task (Q&A) - Dataset
[SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
| Dataset | Split | # samples |
| -------- | ----- | --------- |
| SQuAD2.0 | train | 130k |
| SQuAD2.0 | eval | 12.3k |
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM.
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)
## Results:
| Metric | # Value |
| ------ | --------- |
| **EM** | **60.49** |
| **F1** | **64.21** |
## Comparison:
| Model | EM | F1 score | SIZE (MB) |
| ------------------------------------------------------------------------------------------- | --------- | --------- | --------- |
| [bert-tiny-finetuned-squadv2](https://huggingface.co/mrm8488/bert-tiny-finetuned-squadv2) | 48.60 | 49.73 | **16.74** |
| [bert-mini-finetuned-squadv2](https://huggingface.co/mrm8488/bert-mini-finetuned-squadv2) | 56.31 | 59.65 | 42.63 |
| [bert-small-finetuned-squadv2](https://huggingface.co/mrm8488/bert-small-finetuned-squadv2) | **60.49** | **64.21** | 109.74 |
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="mrm8488/bert-small-finetuned-squadv2",
tokenizer="mrm8488/bert-small-finetuned-squadv2"
)
qa_pipeline({
'context': "Manuel Romero has been working hardly in the repository hugginface/transformers lately",
'question': "Who has been working hard for hugginface/transformers lately?"
})
# Output:
```
```json
{
"answer": "Manuel Romero",
"end": 13,
"score": 0.9939319924374637,
"start": 0
}
```
### Yes! That was easy 🎉 Let's try with another example
```python
qa_pipeline({
'context': "Manuel Romero has been working hardly in the repository hugginface/transformers lately",
'question': "For which company has worked Manuel Romero?"
})
# Output:
```
```json
{
"answer": "hugginface/transformers",
"end": 79,
"score": 0.6024888734447131,
"start": 56
}
```
### It works!! 🎉 🎉 🎉
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
thumbnail:
---
# BERT SMALL + Typo Detection ✍❌✍✔
[BERT SMALL](https://huggingface.co/google/bert_uncased_L-4_H-512_A-8) fine-tuned on [GitHub Typo Corpus](https://github.com/mhagiwara/github-typo-corpus) for **typo detection** (using *NER* style)
## Details of the downstream task (Typo detection as NER)
- Dataset: [GitHub Typo Corpus](https://github.com/mhagiwara/github-typo-corpus) 📚
- [Fine-tune script on NER dataset provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner_old.py) 🏋️‍♂️
## Metrics on test set 📋
| Metric | # score |
| :-------: | :-------: |
| F1 | **89.12** |
| Precision | **93.82** |
| Recall | **84.87** |
## Model in action 🔨
Fast usage with **pipelines** 🧪
```python
from transformers import pipeline
typo_checker = pipeline(
"ner",
model="mrm8488/bert-small-finetuned-typo-detection",
tokenizer="mrm8488/bert-small-finetuned-typo-detection"
)
result = typo_checker("here there is an error in coment")
result[1:-1]
# Output:
[{'entity': 'ok', 'score': 0.9021041989326477, 'word': 'here'},
{'entity': 'ok', 'score': 0.7975626587867737, 'word': 'there'},
{'entity': 'ok', 'score': 0.8596242070198059, 'word': 'is'},
{'entity': 'ok', 'score': 0.7071516513824463, 'word': 'an'},
{'entity': 'ok', 'score': 0.943381130695343, 'word': 'error'},
{'entity': 'ok', 'score': 0.8047608733177185, 'word': 'in'},
{'entity': 'ok', 'score': 0.8240702152252197, 'word': 'come'},
{'entity': 'typo', 'score': 0.5004884004592896, 'word': '##nt'}]
```
It works🎉! we typed ```coment``` instead of ```comment```
Let's try with another example
```python
result = typo_checker("Adddd validation midelware")
result[1:-1]
# Output:
[{'entity': 'ok', 'score': 0.7128152847290039, 'word': 'add'},
{'entity': 'typo', 'score': 0.5388424396514893, 'word': '##dd'},
{'entity': 'ok', 'score': 0.94792640209198, 'word': 'validation'},
{'entity': 'typo', 'score': 0.5839331746101379, 'word': 'mid'},
{'entity': 'ok', 'score': 0.5195121765136719, 'word': '##el'},
{'entity': 'ok', 'score': 0.7222476601600647, 'word': '##ware'}]
```
Yeah! We typed wrong ```Add and middleware```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment