Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language: en
license: apache-2.0
datasets:
- cnn_dailymail
tags:
- summarization
---
# Bert-small2Bert-small Summarization with 🤗EncoderDecoder Framework
This model is a warm-started *BERT2BERT* ([small](https://huggingface.co/google/bert_uncased_L-4_H-512_A-8)) model fine-tuned on the *CNN/Dailymail* summarization dataset.
The model achieves a **17.37** ROUGE-2 score on *CNN/Dailymail*'s test dataset.
For more details on how the model was fine-tuned, please refer to
[this](https://colab.research.google.com/drive/1Ekd5pUeCX7VOrMx94_czTkwNtLN32Uyu?usp=sharing) notebook.
## Results on test set 📝
| Metric | # Value |
| ------ | --------- |
| **ROUGE-2** | **17.37** |
## Model in Action 🚀
```python
from transformers import BertTokenizerFast, EncoderDecoderModel
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = BertTokenizerFast.from_pretrained('mrm8488/bert-small2bert-small-finetuned-cnn_daily_mail-summarization')
model = EncoderDecoderModel.from_pretrained('mrm8488/bert-small2bert-small-finetuned-cnn_daily_mail-summarization').to(device)
def generate_summary(text):
# cut off at BERT max length 512
inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
input_ids = inputs.input_ids.to(device)
attention_mask = inputs.attention_mask.to(device)
output = model.generate(input_ids, attention_mask=attention_mask)
return tokenizer.decode(output[0], skip_special_tokens=True)
text = "your text to be summarized here..."
generate_summary(text)
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: es
thumbnail: https://i.imgur.com/jgBdimh.png
---
# Spanish BERT (BETO) + NER
This model is a fine-tuned on [NER-C](https://www.kaggle.com/nltkdata/conll-corpora) version of the Spanish BERT cased [(BETO)](https://github.com/dccuchile/beto) for **NER** downstream task.
## Details of the downstream task (NER) - Dataset
- [Dataset: CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora)
I preprocessed the dataset and split it as train / dev (80/20)
| Dataset | # Examples |
| ---------------------- | ----- |
| Train | 8.7 K |
| Dev | 2.2 K |
- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner_old.py)
- Labels covered:
```
B-LOC
B-MISC
B-ORG
B-PER
I-LOC
I-MISC
I-ORG
I-PER
O
```
## Metrics on evaluation set:
| Metric | # score |
| :------------------------------------------------------------------------------------: | :-------: |
| F1 | **90.17**
| Precision | **89.86** |
| Recall | **90.47** |
## Comparison:
| Model | # F1 score |Size(MB)|
| :--------------------------------------------------------------------------------------------------------------: | :-------: |:------|
| bert-base-spanish-wwm-cased (BETO) | 88.43 | 421
| [bert-spanish-cased-finetuned-ner (this one)](https://huggingface.co/mrm8488/bert-spanish-cased-finetuned-ner) | **90.17** | 420 |
| Best Multilingual BERT | 87.38 | 681 |
|[TinyBERT-spanish-uncased-finetuned-ner](https://huggingface.co/mrm8488/TinyBERT-spanish-uncased-finetuned-ner) | 70.00 | **55** |
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
nlp_ner = pipeline(
"ner",
model="mrm8488/bert-spanish-cased-finetuned-ner",
tokenizer=(
'mrm8488/bert-spanish-cased-finetuned-ner',
{"use_fast": False}
))
text = 'Mis amigos están pensando viajar a Londres este verano'
nlp_ner(text)
#Output: [{'entity': 'B-LOC', 'score': 0.9998720288276672, 'word': 'Londres'}]
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: es
thumbnail:
---
# Spanish BERT (BETO) + Syntax POS tagging ✍🏷
This model is a fine-tuned version of the Spanish BERT [(BETO)](https://github.com/dccuchile/beto) on Spanish **syntax** annotations in [CONLL CORPORA](https://www.kaggle.com/nltkdata/conll-corpora) dataset for **syntax POS** (Part of Speech tagging) downstream task.
## Details of the downstream task (Syntax POS) - Dataset
- [Dataset: CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora)
#### [Fine-tune script on NER dataset provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner_old.py)
#### 21 Syntax annotations (Labels) covered:
- \_
- ATR
- ATR.d
- CAG
- CC
- CD
- CD.Q
- CI
- CPRED
- CPRED.CD
- CPRED.SUJ
- CREG
- ET
- IMPERS
- MOD
- NEG
- PASS
- PUNC
- ROOT
- SUJ
- VOC
## Metrics on test set 📋
| Metric | # score |
| :-------: | :-------: |
| F1 | **89.27** |
| Precision | **89.44** |
| Recall | **89.11** |
## Model in action 🔨
Fast usage with **pipelines** 🧪
```python
from transformers import pipeline
nlp_pos_syntax = pipeline(
"ner",
model="mrm8488/bert-spanish-cased-finetuned-pos-syntax",
tokenizer="mrm8488/bert-spanish-cased-finetuned-pos-syntax"
)
text = 'Mis amigos están pensando viajar a Londres este verano.'
nlp_pos_syntax(text)[1:len(nlp_pos_syntax(text))-1]
```
```json
[
{ "entity": "_", "score": 0.9999216794967651, "word": "Mis" },
{ "entity": "SUJ", "score": 0.999882698059082, "word": "amigos" },
{ "entity": "_", "score": 0.9998869299888611, "word": "están" },
{ "entity": "ROOT", "score": 0.9980518221855164, "word": "pensando" },
{ "entity": "_", "score": 0.9998420476913452, "word": "viajar" },
{ "entity": "CD", "score": 0.999351978302002, "word": "a" },
{ "entity": "_", "score": 0.999959409236908, "word": "Londres" },
{ "entity": "_", "score": 0.9998968839645386, "word": "este" },
{ "entity": "CC", "score": 0.99931401014328, "word": "verano" },
{ "entity": "PUNC", "score": 0.9998534917831421, "word": "." }
]
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: es
thumbnail: https://i.imgur.com/jgBdimh.png
---
# Spanish BERT (BETO) + POS
This model is a fine-tuned on Spanish [CONLL CORPORA](https://www.kaggle.com/nltkdata/conll-corpora) version of the Spanish BERT cased [(BETO)](https://github.com/dccuchile/beto) for **POS** (Part of Speech tagging) downstream task.
## Details of the downstream task (POS) - Dataset
- [Dataset: CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora) with data augmentation techniques
I preprocessed the dataset and split it as train / dev (80/20)
| Dataset | # Examples |
| ---------------------- | ----- |
| Train | 340 K |
| Dev | 50 K |
- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner_old.py)
- **60** Labels covered:
```
AO, AQ, CC, CS, DA, DD, DE, DI, DN, DP, DT, Faa, Fat, Fc, Fd, Fe, Fg, Fh, Fia, Fit, Fp, Fpa, Fpt, Fs, Ft, Fx, Fz, I, NC, NP, P0, PD, PI, PN, PP, PR, PT, PX, RG, RN, SP, VAI, VAM, VAN, VAP, VAS, VMG, VMI, VMM, VMN, VMP, VMS, VSG, VSI, VSM, VSN, VSP, VSS, Y and Z
```
## Metrics on evaluation set:
| Metric | # score |
| :------------------------------------------------------------------------------------: | :-------: |
| F1 | **90.06**
| Precision | **89.46** |
| Recall | **90.67** |
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
nlp_pos = pipeline(
"ner",
model="mrm8488/bert-spanish-cased-finetuned-pos",
tokenizer=(
'mrm8488/bert-spanish-cased-finetuned-pos',
{"use_fast": False}
))
text = 'Mis amigos están pensando en viajar a Londres este verano'
nlp_pos(text)
#Output:
'''
[{'entity': 'NC', 'score': 0.7792173624038696, 'word': '[CLS]'},
{'entity': 'DP', 'score': 0.9996283650398254, 'word': 'Mis'},
{'entity': 'NC', 'score': 0.9999253749847412, 'word': 'amigos'},
{'entity': 'VMI', 'score': 0.9998560547828674, 'word': 'están'},
{'entity': 'VMG', 'score': 0.9992249011993408, 'word': 'pensando'},
{'entity': 'SP', 'score': 0.9999602437019348, 'word': 'en'},
{'entity': 'VMN', 'score': 0.9998666048049927, 'word': 'viajar'},
{'entity': 'SP', 'score': 0.9999545216560364, 'word': 'a'},
{'entity': 'VMN', 'score': 0.8722310662269592, 'word': 'Londres'},
{'entity': 'DD', 'score': 0.9995203614234924, 'word': 'este'},
{'entity': 'NC', 'score': 0.9999248385429382, 'word': 'verano'},
{'entity': 'NC', 'score': 0.8802427649497986, 'word': '[SEP]'}]
'''
```
![model in action](https://media.giphy.com/media/jVC9m1cNrdIWuAAtjy/giphy.gif)
16 POS tags version also available [here](https://huggingface.co/mrm8488/bert-spanish-cased-finetuned-pos-16-tags)
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
thumbnail:
---
# BERT-Tiny fine-tuned on SQuAD v2
[BERT-Tiny](https://github.com/google-research/bert/) created by [Google Research](https://github.com/google-research) and fine-tuned on [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) for **Q&A** downstream task.
**Mode size** (after training): **16.74 MB**
## Details of BERT-Tiny and its 'family' (from their documentation)
Released on March 11th, 2020
This is model is a part of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962).
The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
## Details of the downstream task (Q&A) - Dataset
[SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
| Dataset | Split | # samples |
| -------- | ----- | --------- |
| SQuAD2.0 | train | 130k |
| SQuAD2.0 | eval | 12.3k |
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM.
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)
## Results:
| Metric | # Value |
| ------ | --------- |
| **EM** | **48.60** |
| **F1** | **49.73** |
| Model | EM | F1 score | SIZE (MB) |
| ----------------------------------------------------------------------------------------- | --------- | --------- | --------- |
| [bert-tiny-finetuned-squadv2](https://huggingface.co/mrm8488/bert-tiny-finetuned-squadv2) | 48.60 | 49.73 | **16.74** |
| [bert-tiny-5-finetuned-squadv2](https://huggingface.co/mrm8488/bert-tiny-5-finetuned-squadv2) | **57.12** | **60.86** | 24.34
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="mrm8488/bert-tiny-finetuned-squadv2",
tokenizer="mrm8488/bert-tiny-finetuned-squadv2"
)
qa_pipeline({
'context': "Manuel Romero has been working hardly in the repository hugginface/transformers lately",
'question': "Who has been working hard for hugginface/transformers lately?"
})
# Output:
```
```json
{
"answer": "Manuel Romero",
"end": 13,
"score": 0.05684709993458714,
"start": 0
}
```
### Yes! That was easy 🎉 Let's try with another example
```python
qa_pipeline({
'context': "Manuel Romero has been working hardly in the repository hugginface/transformers lately",
'question': "For which company has worked Manuel Romero?"
})
# Output:
```
```json
{
"answer": "hugginface/transformers",
"end": 79,
"score": 0.11613431826808274,
"start": 56
}
```
### It works!! 🎉 🎉 🎉
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
thumbnail:
---
# [BERT](https://huggingface.co/deepset/bert-base-cased-squad2) fine tuned on [QNLI](https://github.com/rhythmcao/QNLI)+ compression ([BERT-of-Theseus](https://github.com/JetRunner/BERT-of-Theseus))
I used a [Bert model fine tuned on **SQUAD v2**](https://huggingface.co/deepset/bert-base-cased-squad2) and then I fine tuned it on **QNLI** using **compression** (with a constant replacing rate) as proposed in **BERT-of-Theseus**
## Details of the downstream task (QNLI):
### Getting the dataset
```bash
wget https://raw.githubusercontent.com/rhythmcao/QNLI/master/data/QNLI/train.tsv
wget https://raw.githubusercontent.com/rhythmcao/QNLI/master/data/QNLI/test.tsv
wget https://raw.githubusercontent.com/rhythmcao/QNLI/master/data/QNLI/dev.tsv
mkdir QNLI_dataset
mv *.tsv QNLI_dataset
```
### Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
!python /content/BERT-of-Theseus/run_glue.py \
--model_name_or_path deepset/bert-base-cased-squad2 \
--task_name qnli \
--do_train \
--do_eval \
--do_lower_case \
--data_dir /content/QNLI_dataset \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 2e-5 \
--save_steps 2000 \
--num_train_epochs 50 \
--output_dir /content/ouput_dir \
--evaluate_during_training \
--replacing_rate 0.7 \
--steps_for_replacing 2500
```
## Metrics:
| Model | Accuracy |
|-----------------|------|
| BERT-base | 91.2 |
| BERT-of-Theseus | 88.8 |
| [bert-uncased-finetuned-qnli](https://huggingface.co/mrm8488/bert-uncased-finetuned-qnli) | 87.2
| DistillBERT | 85.3 |
> [See all my models](https://huggingface.co/models?search=mrm8488)
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: fr
datasets:
- xtreme
widget:
- text: "La première série a été mieux reçue par la critique que la seconde. La seconde série a été bien accueillie par la critique, mieux que la première."
---
# Camembert-base fine-tuned on PAWS-X-fr for Paraphrase Identification
# *De Novo* Drug Design with MLM
## What is it?
An approximation to [Generative Recurrent Networks for De Novo Drug Design](https://onlinelibrary.wiley.com/doi/full/10.1002/minf.201700111) but training a MLM (RoBERTa like) from scratch.
## Why?
As mentioned in the paper:
Generative artificial intelligence models present a fresh approach to chemogenomics and de novo drug design, as they provide researchers with the ability to narrow down their search of the chemical space and focus on regions of interest.
They used a generative *recurrent neural network (RNN)* containing long short‐term memory (LSTM) cell to capture the syntax of molecular representations in terms of SMILES strings.
The learned pattern probabilities can be used for de novo SMILES generation. This molecular design concept **eliminates the need for virtual compound library enumeration** and **enables virtual compound design without requiring secondary or external activity prediction**.
## My Goal 🎯
By training a MLM from scratch on 438552 (cleaned*) SMILES I wanted to build a model that learns this kind of molecular combinations so that given a partial SMILE it can generate plausible combinations so that it can be proposed as new drugs.
By cleaned SMILES I mean that I used their [SMILES cleaning script](https://github.com/topazape/LSTM_Chem/blob/master/cleanup_smiles.py) to remove duplicates, salts, and stereochemical information.
You can see the detailed process of gathering the data, preprocess it and train the LSTM in their [repo](https://github.com/topazape/LSTM_Chem).
## Fast usage with ```pipelines``` 🧪
```python
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model='/mrm8488/chEMBL_smiles_v1',
tokenizer='/mrm8488/chEMBL_smiles_v1'
)
# CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)cc1 Atazanavir
smile1 = "CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)<mask>"
fill_mask(smile1)
# Output:
'''
[{'score': 0.6040295958518982,
'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)nc</s>',
'token': 265},
{'score': 0.2185731679201126,
'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)N</s>',
'token': 50},
{'score': 0.0642734169960022,
'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)cc</s>',
'token': 261},
{'score': 0.01932266168296337,
'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)CCCl</s>',
'token': 452},
{'score': 0.005068355705589056,
'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)C</s>',
'token': 39}]
'''
```
## More
I also created a [second version](https://huggingface.co/mrm8488/chEMBL26_smiles_v2) without applying the cleaning SMILES script mentioned above. You can use it in the same way as this one.
```python
fill_mask = pipeline(
"fill-mask",
model='/mrm8488/chEMBL26_smiles_v2',
tokenizer='/mrm8488/chEMBL26_smiles_v2'
)
```
[Original paper](https://www.ncbi.nlm.nih.gov/pubmed/29095571) Authors:
<details>
Swiss Federal Institute of Technology (ETH), Department of Chemistry and Applied Biosciences, Vladimir–Prelog–Weg 4, 8093, Zurich, Switzerland,
Stanford University, Department of Computer Science, 450 Sierra Mall, Stanford, CA, 94305, USA,
inSili.com GmbH, 8049, Zurich, Switzerland,
Gisbert Schneider, Email: hc.zhte@trebsig.
</details>
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: code
thumbnail:
---
# CodeBERTaJS
CodeBERTaJS is a RoBERTa-like model trained on the [CodeSearchNet](https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/) dataset from GitHub for `javaScript` by [Manuel Romero](https://twitter.com/mrm8488)
The **tokenizer** is a Byte-level BPE tokenizer trained on the corpus using Hugging Face `tokenizers`.
Because it is trained on a corpus of code (vs. natural language), it encodes the corpus efficiently (the sequences are between 33% to 50% shorter, compared to the same corpus tokenized by gpt2/roberta).
The (small) **model** is a 6-layer, 84M parameters, RoBERTa-like Transformer model – that’s the same number of layers & heads as DistilBERT – initialized from the default initialization settings and trained from scratch on the full `javascript` corpus (120M after preproccessing) for 2 epochs.
## Quick start: masked language modeling prediction
```python
JS_CODE = """
async function createUser(req, <mask>) {
if (!validUser(req.body.user)) {
return res.status(400);
}
user = userService.createUser(req.body.user);
return res.json(user);
}
""".lstrip()
```
### Does the model know how to complete simple JS/express like code?
```python
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="mrm8488/codeBERTaJS",
tokenizer="mrm8488/codeBERTaJS"
)
fill_mask(JS_CODE)
## Top 5 predictions:
#
'res' # prob 0.069489665329
'next'
'req'
'user'
',req'
```
### Yes! That was easy 🎉 Let's try with another example
```python
JS_CODE_= """
function getKeys(obj) {
keys = [];
for (var [key, value] of Object.entries(obj)) {
keys.push(<mask>);
}
return keys
}
""".lstrip()
```
Results:
```python
'obj', 'key', ' value', 'keys', 'i'
```
> Not so bad! Right token was predicted as second option! 🎉
## This work is heavely inspired on [codeBERTa](https://github.com/huggingface/transformers/blob/master/model_cards/huggingface/CodeBERTa-small-v1/README.md) by huggingface team
<br>
## CodeSearchNet citation
<details>
```bibtex
@article{husain_codesearchnet_2019,
title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
shorttitle = {{CodeSearchNet} {Challenge}},
url = {http://arxiv.org/abs/1909.09436},
urldate = {2020-03-12},
journal = {arXiv:1909.09436 [cs, stat]},
author = {Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
month = sep,
year = {2019},
note = {arXiv: 1909.09436},
}
```
</details>
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
datasets:
- codexglue
---
# CodeBERT fine-tuned for Insecure Code Detection 💾⛔
[codebert-base](https://huggingface.co/microsoft/codebert-base) fine-tuned on [CodeXGLUE -- Defect Detection](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection) dataset for **Insecure Code Detection** downstream task.
## Details of [CodeBERT](https://arxiv.org/abs/2002.08155)
We present CodeBERT, a bimodal pre-trained model for programming language (PL) and nat-ural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language codesearch, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation tasks. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing.
## Details of the downstream task (code classification) - Dataset 📚
Given a source code, the task is to identify whether it is an insecure code that may attack software systems, such as resource leaks, use-after-free vulnerabilities and DoS attack. We treat the task as binary classification (0/1), where 1 stands for insecure code and 0 for secure code.
The [dataset](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection) used comes from the paper [*Devign*: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks](http://papers.nips.cc/paper/9209-devign-effective-vulnerability-identification-by-learning-comprehensive-program-semantics-via-graph-neural-networks.pdf). All projects are combined and splitted 80%/10%/10% for training/dev/test.
Data statistics of the dataset are shown in the below table:
| | #Examples |
| ----- | :-------: |
| Train | 21,854 |
| Dev | 2,732 |
| Test | 2,732 |
## Test set metrics 🧾
| Methods | ACC |
| -------- | :-------: |
| BiLSTM | 59.37 |
| TextCNN | 60.69 |
| [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf) | 61.05 |
| [CodeBERT](https://arxiv.org/pdf/2002.08155.pdf) | 62.08 |
| [Ours](https://huggingface.co/mrm8488/codebert-base-finetuned-detect-insecure-code) | **65.30** |
## Model in Action 🚀
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np
tokenizer = AutoTokenizer.from_pretrained('mrm8488/codebert-base-finetuned-detect-insecure-code')
model = AutoModelForSequenceClassification.from_pretrained('mrm8488/codebert-base-finetuned-detect-insecure-code')
inputs = tokenizer("your code here", return_tensors="pt", truncation=True, padding='max_length')
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits
print(np.argmax(logits.detach().numpy()))
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: multilingual
thumbnail:
---
# DISTILBERT 🌎 + Typo Detection ✍❌✍✔
[distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased) fine-tuned on [GitHub Typo Corpus](https://github.com/mhagiwara/github-typo-corpus) for **typo detection** (using *NER* style)
## Details of the downstream task (Typo detection as NER)
- Dataset: [GitHub Typo Corpus](https://github.com/mhagiwara/github-typo-corpus) 📚 for 15 languages
- [Fine-tune script on NER dataset provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner_old.py) 🏋️‍♂️
## Metrics on test set 📋
| Metric | # score |
| :-------: | :-------: |
| F1 | **93.51** |
| Precision | **96.08** |
| Recall | **91.06** |
## Model in action 🔨
Fast usage with **pipelines** 🧪
```python
from transformers import pipeline
typo_checker = pipeline(
"ner",
model="mrm8488/distilbert-base-multi-cased-finetuned-typo-detection",
tokenizer="mrm8488/distilbert-base-multi-cased-finetuned-typo-detection"
)
result = typo_checker("Adddd validation midelware")
result[1:-1]
# Output:
[{'entity': 'ok', 'score': 0.7128152847290039, 'word': 'add'},
{'entity': 'typo', 'score': 0.5388424396514893, 'word': '##dd'},
{'entity': 'ok', 'score': 0.94792640209198, 'word': 'validation'},
{'entity': 'typo', 'score': 0.5839331746101379, 'word': 'mid'},
{'entity': 'ok', 'score': 0.5195121765136719, 'word': '##el'},
{'entity': 'ok', 'score': 0.7222476601600647, 'word': '##ware'}]
```
It works🎉! We typed wrong ```Add and middleware```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: multilingual
thumbnail:
---
# DistilBERT multilingual fine-tuned on TydiQA (GoldP task) dataset for multilingual Q&A 😛🌍❓
## Details of the language model
[distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased)
## Details of the Tydi QA dataset
TyDi QA contains 200k human-annotated question-answer pairs in 11 Typologically Diverse languages, written without seeing the answer and without the use of translation, and is designed for the **training and evaluation** of automatic question answering systems. This repository provides evaluation code and a baseline system for the dataset. https://ai.google.com/research/tydiqa
## Details of the downstream task (Gold Passage or GoldP aka the secondary task)
Given a passage that is guaranteed to contain the answer, predict the single contiguous span of characters that answers the question. the gold passage task differs from the [primary task](https://github.com/google-research-datasets/tydiqa/blob/master/README.md#the-tasks) in several ways:
* only the gold answer passage is provided rather than the entire Wikipedia article;
* unanswerable questions have been discarded, similar to MLQA and XQuAD;
* we evaluate with the SQuAD 1.1 metrics like XQuAD; and
* Thai and Japanese are removed since the lack of whitespace breaks some tools.
## Model training 💪🏋️‍
The model was fine-tuned on a Tesla P100 GPU and 25GB of RAM.
The script is the following:
```python
python transformers/examples/question-answering/run_squad.py \
--model_type distilbert \
--model_name_or_path distilbert-base-multilingual-cased \
--do_train \
--do_eval \
--train_file /path/to/dataset/train.json \
--predict_file /path/to/dataset/dev.json \
--per_gpu_train_batch_size 24 \
--per_gpu_eval_batch_size 24 \
--learning_rate 3e-5 \
--num_train_epochs 5 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /content/model_output \
--overwrite_output_dir \
--save_steps 1000 \
--threads 400
```
## Global Results (dev set) 📝
| Metric | # Value |
| --------- | ----------- |
| **EM** | **63.85** |
| **F1** | **75.70** |
## Specific Results (per language) 🌍📝
| Language | # Samples | # EM | # F1 |
| --------- | ----------- |--------| ------ |
| Arabic | 1314 | 66.66 | 80.02 |
| Bengali | 180 | 53.09 | 63.50 |
| English | 654 | 62.42 | 73.12 |
| Finnish | 1031 | 64.57 | 75.15 |
| Indonesian| 773 | 67.89 | 79.70 |
| Korean | 414 | 51.29 | 61.73 |
| Russian | 1079 | 55.42 | 70.08 |
| Swahili | 596 | 74.51 | 81.15 |
| Telegu | 874 | 66.21 | 79.85 |
## Similar models
You can also try [bert-multi-cased-finedtuned-xquad-tydiqa-goldp](https://huggingface.co/mrm8488/bert-multi-cased-finedtuned-xquad-tydiqa-goldp) that achieves **F1 = 82.16** and **EM = 71.06** (And of course better marks per language).
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: es
thumbnail: https://i.imgur.com/jgBdimh.png
---
# BETO (Spanish BERT) + Spanish SQuAD2.0 + distillation using 'bert-base-multilingual-cased' as teacher
This model is a fine-tuned on [SQuAD-es-v2.0](https://github.com/ccasimiro88/TranslateAlignRetrieve) and **distilled** version of [BETO](https://github.com/dccuchile/beto) for **Q&A**.
Distillation makes the model **smaller, faster, cheaper and lighter** than [bert-base-spanish-wwm-cased-finetuned-spa-squad2-es](https://github.com/huggingface/transformers/blob/master/model_cards/mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es/README.md)
This model was fine-tuned on the same dataset but using **distillation** during the process as mentioned above (and one more train epoch).
The **teacher model** for the distillation was `bert-base-multilingual-cased`. It is the same teacher used for `distilbert-base-multilingual-cased` AKA [**DistilmBERT**](https://github.com/huggingface/transformers/tree/master/examples/distillation) (on average is twice as fast as **mBERT-base**).
## Details of the downstream task (Q&A) - Dataset
<details>
[SQuAD-es-v2.0](https://github.com/ccasimiro88/TranslateAlignRetrieve)
| Dataset | # Q&A |
| ----------------------- | ----- |
| SQuAD2.0 Train | 130 K |
| SQuAD2.0-es-v2.0 | 111 K |
| SQuAD2.0 Dev | 12 K |
| SQuAD-es-v2.0-small Dev | 69 K |
</details>
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
!export SQUAD_DIR=/path/to/squad-v2_spanish \
&& python transformers/examples/distillation/run_squad_w_distillation.py \
--model_type bert \
--model_name_or_path dccuchile/bert-base-spanish-wwm-cased \
--teacher_type bert \
--teacher_name_or_path bert-base-multilingual-cased \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v2.json \
--predict_file $SQUAD_DIR/dev-v2.json \
--per_gpu_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 5.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /content/model_output \
--save_steps 5000 \
--threads 4 \
--version_2_with_negative
```
## Results:
| Metric | # Value |
| --------- | ----------- |
| **Exact** | **90.77**48 |
| **F1** | **94.94**71 |
```json
{
"exact": 90.77483309730933,
"f1": 94.94714391266254,
"total": 69202,
"HasAns_exact": 86.60850599781898,
"HasAns_f1": 92.90582885592328,
"HasAns_total": 45850,
"NoAns_exact": 98.95512161699212,
"NoAns_f1": 98.95512161699212,
"NoAns_total": 23352,
"best_exact": 90.77483309730933,
"best_exact_thresh": 0.0,
"best_f1": 94.94714391266305,
"best_f1_thresh": 0.0
}
```
## Comparison:
| Model | f1 score |
| :-------------------------------------------------------------: | :-------: |
| bert-base-spanish-wwm-cased-finetuned-spa-squad2-es | 86.07 |
| **distill**-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es | **94.94** |
So, yes, this version is even more accurate.
### Model in action
Fast usage with **pipelines**:
```python
from transformers import *
# Important!: By now the QA pipeline is not compatible with fast tokenizer, but they are working on it. So that pass the object to the tokenizer {"use_fast": False} as in the following example:
nlp = pipeline(
'question-answering',
model='mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es',
tokenizer=(
'mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es',
{"use_fast": False}
)
)
nlp(
{
'question': '¿Para qué lenguaje está trabajando?',
'context': 'Manuel Romero está colaborando activamente con huggingface/transformers ' +
'para traer el poder de las últimas técnicas de procesamiento de lenguaje natural al idioma español'
}
)
# Output: {'answer': 'español', 'end': 169, 'score': 0.67530957344621, 'start': 163}
```
Play with this model and ```pipelines``` in a Colab:
<a href="https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/Using_Spanish_BERT_fine_tuned_for_Q%26A_pipelines.ipynb" target="_parent"><img src="https://camo.githubusercontent.com/52feade06f2fecbf006889a904d221e6a730c194/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667" alt="Open In Colab" data-canonical-src="https://colab.research.google.com/assets/colab-badge.svg"></a>
<details>
1. Set the context and ask some questions:
![Set context and questions](https://media.giphy.com/media/mCIaBpfN0LQcuzkA2F/giphy.gif)
2. Run predictions:
![Run the model](https://media.giphy.com/media/WT453aptcbCP7hxWTZ/giphy.gif)
</details>
More about ``` Huggingface pipelines```? check this Colab out:
<a href="https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/Huggingface_pipelines_demo.ipynb" target="_parent"><img src="https://camo.githubusercontent.com/52feade06f2fecbf006889a904d221e6a730c194/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667" alt="Open In Colab" data-canonical-src="https://colab.research.google.com/assets/colab-badge.svg"></a>
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
# DistilRoBERTa + Sentiment Analysis 😂😢😡😃😯
This in an adapted version of [@omarsar0](https://twitter.com/omarsar0) [tutorial](https://t.co/WMnATW0Hwf?amp=1)
He explains everything so detailed and provided the dataset. I just changed some parameters and created the ```config.json```file to upload it to [🤗Transformers HUB](https://huggingface.co/)
In this tutorial, he shows how to fine-tune a language model (LM) for **emotion classification** with code adapted from this [tutorial](https://zablo.net/blog/post/custom-classifier-on-bert-model-guide-polemo2-sentiment-analysis/) by MARCIN ZABŁOCKI.
The emotions covered are:
- sadness 😢
- joy 😃
- love 🥰
- anger 😡
- fear 😱
- surprise 😯
## Details of the language model
The base model used is [DistilRoBERTa](https://huggingface.co/distilroberta-base)
## Details of the downstream task (Sentence classification) - Dataset 📚
| Dataset split | # Size | # Sequences |
| ---------------------- | ----- | ------|
|Train | 1.58M | 20000
| Validation | 200 KB |
| Test | 202 KB |
## Results after training 🏋️‍♀️🧾
|emotion |precision |recall| f1-score| support|
|-------|-------------|------|----------|----------|
|sadness| 0.973868 |0.949066 |0.961307| 589|
|joy |0.970313 |0.901306 |0.934537| 689|
|love |0.743119 |0.925714 |0.824427| 175|
|anger | 0.884615| 0.969349| 0.925046| 261|
|fear |0.951456 |0.875000| 0.911628| 224|
|surprise| 0.750000| 0.919355| 0.826087| 62|
| | | | | |
|**accuracy**| | | 0.924000| 2000|
|**macro avg**| 0.878895| 0.923298| 0.897172| 2000|
|**weighted avg**| 0.931355| 0.924000| 0.925620| 2000|
## Model in action 🔨
Fast usage with **pipelines** 🧪
```python
from transformers import pipeline
nlp_sentiment = pipeline(
"sentiment-analysis",
model="mrm8488/distilroberta-base-finetuned-sentiment",
tokenizer="mrm8488/distilroberta-base-finetuned-sentiment"
)
text = "i feel i should return to the start of the weekend so my loyal readers can get a feeling for things up to this point"
nlp_sentiment(text)
# Output: [{'label': 'love', 'score': 0.2183746}]
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
---
# Electra base ⚡ + SQuAD v1 ❓
[Electra-base-discriminator](https://huggingface.co/google/electra-base-discriminator) fine-tuned on [SQUAD v1.1 dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/) for **Q&A** downstream task.
## Details of the downstream task (Q&A) - Model 🧠
**ELECTRA** is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a [GAN](https://arxiv.org/pdf/1406.2661.pdf). At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset.
## Details of the downstream task (Q&A) - Dataset 📚
**S**tanford **Q**uestion **A**nswering **D**ataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
SQuAD v1.1 contains **100,000+** question-answer pairs on **500+** articles.
## Model training 🏋️‍
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
python transformers/examples/question-answering/run_squad.py \
--model_type electra \
--model_name_or_path 'google/electra-base-discriminator' \
--do_eval \
--do_train \
--do_lower_case \
--train_file '/content/dataset/train-v1.1.json' \
--predict_file '/content/dataset/dev-v1.1.json' \
--per_gpu_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 10 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir '/content/output' \
--overwrite_output_dir \
--save_steps 1000
```
## Test set Results 🧾
| Metric | # Value |
| ------ | --------- |
| **EM** | **83.03** |
| **F1** | **90.77** |
| **Size**| **+ 400 MB** |
Very good metrics for such a "small" model!
```json
{
'exact': 83.03689687795648,
'f1': 90.77486052446231,
'total': 10570,
'HasAns_exact': 83.03689687795648,
'HasAns_f1': 90.77486052446231,
'HasAns_total': 10570,
'best_exact': 83.03689687795648,
'best_exact_thresh': 0.0,
'best_f1': 90.77486052446231,
'best_f1_thresh': 0.0
}
```
### Model in action 🚀
Fast usage with **pipelines**:
```python
from transformers import pipeline
QnA_pipeline = pipeline('question-answering', model='mrm8488/electra-base-finetuned-squadv1')
QnA_pipeline({
'context': 'A new strain of flu that has the potential to become a pandemic has been identified in China by scientists.',
'question': 'What has been discovered by scientists from China ?'
})
# Output:
{'answer': 'A new strain of flu', 'end': 19, 'score': 0.9995211430099182, 'start': 0}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
---
# Electra small ⚡ + SQuAD v1 ❓
[Electra-small-discriminator](https://huggingface.co/google/electra-small-discriminator) fine-tuned on [SQUAD v1.1 dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/) for **Q&A** downstream task.
## Details of the downstream task (Q&A) - Model 🧠
**ELECTRA** is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a [GAN](https://arxiv.org/pdf/1406.2661.pdf). At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset.
## Details of the downstream task (Q&A) - Dataset 📚
**S**tanford **Q**uestion **A**nswering **D**ataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
SQuAD v1.1 contains **100,000+** question-answer pairs on **500+** articles.
## Model training 🏋️‍
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
python transformers/examples/question-answering/run_squad.py \
--model_type electra \
--model_name_or_path 'google/electra-small-discriminator' \
--do_eval \
--do_train \
--do_lower_case \
--train_file '/content/dataset/train-v1.1.json' \
--predict_file '/content/dataset/dev-v1.1.json' \
--per_gpu_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 10 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir '/content/output' \
--overwrite_output_dir \
--save_steps 1000
```
## Test set Results 🧾
| Metric | # Value |
| ------ | --------- |
| **EM** | **77.70** |
| **F1** | **85.74** |
| **Size**| **50 MB** |
Very good metrics for such a "small" model!
```json
{
'exact': 77.70104068117313,
'f1': 85.73991234187997,
'total': 10570,
'HasAns_exact': 77.70104068117313,
'HasAns_f1': 85.73991234187997,
'HasAns_total': 10570,
'best_exact': 77.70104068117313,
'best_exact_thresh': 0.0,
'best_f1': 85.73991234187997,
'best_f1_thresh': 0.0
}
```
### Model in action 🚀
Fast usage with **pipelines**:
```python
from transformers import pipeline
QnA_pipeline = pipeline('question-answering', model='mrm8488/electra-small-finetuned-squadv1')
QnA_pipeline({
'context': 'A new strain of flu that has the potential to become a pandemic has been identified in China by scientists.',
'question': 'What has been discovered by scientists from China ?'
})
# Output:
{'answer': 'A new strain of flu', 'end': 19, 'score': 0.7950334108113424, 'start': 0}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: en
---
# Electra small ⚡ + SQuAD v2 ❓
[Electra-small-discriminator](https://huggingface.co/google/electra-small-discriminator) fine-tuned on [SQUAD v2.0 dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/) for **Q&A** downstream task.
## Details of the downstream task (Q&A) - Model 🧠
**ELECTRA** is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a [GAN](https://arxiv.org/pdf/1406.2661.pdf). At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset.
## Details of the downstream task (Q&A) - Dataset 📚
**SQuAD2.0** combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
## Model training 🏋️‍
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
python transformers/examples/question-answering/run_squad.py \
--model_type electra \
--model_name_or_path 'google/electra-small-discriminator' \
--do_eval \
--do_train \
--do_lower_case \
--train_file '/content/dataset/train-v2.0.json' \
--predict_file '/content/dataset/dev-v2.0.json' \
--per_gpu_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 10 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir '/content/output' \
--overwrite_output_dir \
--save_steps 1000 \
--version_2_with_negative
```
## Test set Results 🧾
| Metric | # Value |
| ------ | --------- |
| **EM** | **69.71** |
| **F1** | **73.44** |
| **Size**| **50 MB** |
```json
{
'exact': 69.71279373368147,
'f1': 73.4439546123672,
'total': 11873,
'HasAns_exact': 69.92240215924427,
'HasAns_f1': 77.39542393937836,
'HasAns_total': 5928,
'NoAns_exact': 69.50378469301934,
'NoAns_f1': 69.50378469301934,
'NoAns_total': 5945,
'best_exact': 69.71279373368147,
'best_exact_thresh': 0.0,
'best_f1': 73.44395461236732,
'best_f1_thresh': 0.0
}
```
### Model in action 🚀
Fast usage with **pipelines**:
```python
from transformers import pipeline
QnA_pipeline = pipeline('question-answering', model='mrm8488/electra-base-finetuned-squadv2')
QnA_pipeline({
'context': 'A new strain of flu that has the potential to become a pandemic has been identified in China by scientists.',
'question': 'What has been discovered by scientists from China ?'
})
# Output:
{'answer': 'A new strain of flu', 'end': 19, 'score': 0.8650811568752914, 'start': 0}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: es
thumbnail: https://i.imgur.com/uxAvBfh.png
---
## ELECTRICIDAD: The Spanish Electra [Imgur](https://imgur.com/uxAvBfh)
**Electricidad-base-discriminator** (uncased) is a ```base``` Electra like model (discriminator in this case) trained on a + 20 GB of the [OSCAR](https://oscar-corpus.com/) Spanish corpus.
As mentioned in the original [paper](https://openreview.net/pdf?id=r1xMH1BtvB):
**ELECTRA** is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a [GAN](https://arxiv.org/pdf/1406.2661.pdf). At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset.
For a detailed description and experimental results, please refer the paper [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/pdf?id=r1xMH1BtvB).
## Model details ⚙
|Name| # Value|
|-----|--------|
|Layers| 12 |
|Hidden |768 |
|Params| 110M|
## Evaluation metrics (for discriminator) 🧾
|Metric | # Score |
|-------|---------|
|Accuracy| 0.985|
|Precision| 0.726|
|AUC | 0.922|
## Fast example of usage 🚀
```python
from transformers import ElectraForPreTraining, ElectraTokenizerFast
import torch
discriminator = ElectraForPreTraining.from_pretrained("/content/electricidad-base-discriminator")
tokenizer = ElectraTokenizerFast.from_pretrained("/content/electricidad-base-discriminator")
sentence = "El rápido zorro marrón salta sobre el perro perezoso"
fake_sentence = "El rápido zorro marrón amar sobre el perro perezoso"
fake_tokens = tokenizer.tokenize(fake_sentence)
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
discriminator_outputs = discriminator(fake_inputs)
predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
[print("%7s" % token, end="") for token in fake_tokens]
[print("%7s" % prediction, end="") for prediction in predictions.tolist()]
# Output:
'''
el rapido zorro marro ##n amar sobre el perro pere ##zoso 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0[None, None, None, None, None, None, None, None, None, None, None, None, None
'''
```
As you can see there are **1s** in the places where the model detected a fake token. So, it works! 🎉
### Some models fine-tuned on a downstream task 🛠️
[Question Answering](https://huggingface.co/mrm8488/electricidad-base-finetuned-squadv1-es)
[POS](https://huggingface.co/mrm8488/electricidad-base-finetuned-pos)
[NER](https://huggingface.co/mrm8488/electricidad-base-finetuned-ner)
[Paraphrase Identification](https://huggingface.co/mrm8488/RuPERTa-base-finetuned-pawsx-es)
## Acknowledgments
I thank [🤗/transformers team](https://github.com/huggingface/transformers) for allowing me to train the model (specially to [Julien Chaumond](https://twitter.com/julien_c)).
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
---
language: es
datasets:
- xtreme
widget:
- text: "El río Tabaci es una vertiente del río Leurda en Rumania. El río Leurda es un afluente del río Tabaci en Rumania."
---
# Electricidad-base fine-tuned on PAWS-X-es for Paraphrase Identification
---
language: es
thumbnail: https://i.imgur.com/uxAvBfh.png
widget:
- text: "Madrid es una ciudad muy [MASK] en España."
---
## ELECTRICIDAD: The Spanish Electra [Imgur](https://imgur.com/uxAvBfh)
**Electricidad-base-generator** (uncased) is a ```base``` Electra like model (generator in this case) trained on a + 20 GB of the [OSCAR](https://oscar-corpus.com/) Spanish corpus.
As mentioned in the original [paper](https://openreview.net/pdf?id=r1xMH1BtvB):
**ELECTRA** is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a [GAN](https://arxiv.org/pdf/1406.2661.pdf). At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset.
For a detailed description and experimental results, please refer the paper [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/pdf?id=r1xMH1BtvB).
## Fast example of usage 🚀
```python
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="mrm8488/electricidad-base-generator",
tokenizer="mrm8488/electricidad-base-generator"
)
print(
fill_mask(f"HuggingFace está creando {fill_mask.tokenizer.mask_token} que la comunidad usa para resolver tareas de NLP.")
)
# Output: [{'sequence': '[CLS] huggingface esta creando herramientas que la comunidad usa para resolver tareas de nlp. [SEP]', 'score': 0.0896105170249939, 'token': 8760, 'token_str': 'herramientas'}, ...]
```
## Acknowledgments
I thank [🤗/transformers team](https://github.com/huggingface/transformers) for allowing me to train the model (specially to [Julien Chaumond](https://twitter.com/julien_c)).
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment