Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language: fr
---
# camembert-base-fquad
## Description
A baseline model for question-answering in french ([CamemBERT](https://camembert-model.fr/) model fine-tuned on [FQuAD](https://fquad.illuin.tech/))
## Training hyperparameters
```shell
python3 ./examples/question-answering/run_squad.py \
--model_type camembert \
--model_name_or_path camembert-base \
--do_train \
--do_eval \
--do_lower_case \
--train_file train.json \
--predict_file valid.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir output \
--per_gpu_eval_batch_size=3 \
--per_gpu_train_batch_size=3 \
--save_steps 10000
```
## Evaluation results
```shell
{"f1": 77.24515316052342, "exact_match": 52.82308657465496}
```
## Usage
```python
from transformers import pipeline
nlp = pipeline('question-answering', model='fmikaelian/camembert-base-fquad', tokenizer='fmikaelian/camembert-base-fquad')
nlp({
'question': "Qui est Claude Monet?",
'context': "Claude Monet, né le 14 novembre 1840 à Paris et mort le 5 décembre 1926 à Giverny, est un peintre français et l’un des fondateurs de l'impressionnisme."
})
```
\ No newline at end of file
---
language: fr
---
# camembert-base-squad
## Description
A baseline model for question-answering in french ([CamemBERT](https://camembert-model.fr/) model fine-tuned on [french-translated SQuAD 1.1 dataset](https://github.com/Alikabbadj/French-SQuAD))
## Training hyperparameters
```shell
python3 ./examples/question-answering/run_squad.py \
--model_type camembert \
--model_name_or_path camembert-base \
--do_train \
--do_eval \
--do_lower_case \
--train_file SQuAD-v1.1-train_fr_ss999_awstart2_net.json \
--predict_file SQuAD-v1.1-dev_fr_ss999_awstart2_net.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir output3 \
--per_gpu_eval_batch_size=3 \
--per_gpu_train_batch_size=3 \
--save_steps 10000
```
## Evaluation results
```shell
{"f1": 79.8570684959745, "exact_match": 59.21327108373895}
```
## Usage
```python
from transformers import pipeline
nlp = pipeline('question-answering', model='fmikaelian/camembert-base-squad', tokenizer='fmikaelian/camembert-base-squad')
nlp({
'question': "Qui est Claude Monet?",
'context': "Claude Monet, né le 14 novembre 1840 à Paris et mort le 5 décembre 1926 à Giverny, est un peintre français et l’un des fondateurs de l'impressionnisme."
})
```
\ No newline at end of file
---
language: fr
---
# flaubert-base-uncased-squad
## Description
A baseline model for question-answering in french ([flaubert](https://github.com/getalp/Flaubert) model fine-tuned on [french-translated SQuAD 1.1 dataset](https://github.com/Alikabbadj/French-SQuAD))
## Training hyperparameters
```shell
python3 ./examples/question-answering/run_squad.py \
--model_type flaubert \
--model_name_or_path flaubert-base-uncased \
--do_train \
--do_eval \
--do_lower_case \
--train_file SQuAD-v1.1-train_fr_ss999_awstart2_net.json \
--predict_file SQuAD-v1.1-dev_fr_ss999_awstart2_net.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir output \
--per_gpu_eval_batch_size=3 \
--per_gpu_train_batch_size=3
```
## Evaluation results
```shell
{"f1": 68.66174806561969, "exact_match": 49.299692063176714}
```
## Usage
```python
from transformers import pipeline
nlp = pipeline('question-answering', model='fmikaelian/flaubert-base-uncased-squad', tokenizer='fmikaelian/flaubert-base-uncased-squad')
nlp({
'question': "Qui est Claude Monet?",
'context': "Claude Monet, né le 14 novembre 1840 à Paris et mort le 5 décembre 1926 à Giverny, est un peintre français et l’un des fondateurs de l'impressionnisme."
})
```
\ No newline at end of file
---
language: scientific english
---
# SciBERT finetuned on JNLPA for NER downstream task
## Language Model
[SciBERT](https://arxiv.org/pdf/1903.10676.pdf) is a pretrained language model based on BERT and trained by the
[Allen Institute for AI](https://allenai.org/) on papers from the corpus of
[Semantic Scholar](https://www.semanticscholar.org/).
Corpus size is 1.14M papers, 3.1B tokens. SciBERT has its own vocabulary (scivocab) that's built to best match
the training corpus.
## Downstream task
[`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased#) has been finetuned for Named Entity
Recognition (NER) dowstream task. The code to train the NER can be found [here](https://github.com/fran-martinez/bio_ner_bert).
### Data
The corpus used to fine-tune the NER is [BioNLP / JNLPBA shared task](http://www.geniaproject.org/shared-tasks/bionlp-jnlpba-shared-task-2004).
- Training data consist of 2,000 PubMed abstracts with term/word annotation. This corresponds to 18,546 samples (senteces).
- Evaluation data consist of 404 PubMed abstracts with term/word annotation. This corresponds to 3,856 samples (sentences).
The classes (at word level) and its distribution (number of examples for each class) for training and evaluation datasets are shown below:
| Class Label | # training examples| # evaluation examples|
|:--------------|--------------:|----------------:|
|O | 382,963 | 81,647 |
|B-protein | 30,269 | 5,067 |
|I-protein | 24,848 | 4,774 |
|B-cell_type | 6,718 | 1,921 |
|I-cell_type | 8,748 | 2,991 |
|B-DNA | 9,533 | 1,056 |
|I-DNA | 15,774 | 1,789 |
|B-cell_line | 3,830 | 500 |
|I-cell_line | 7,387 | 9,89 |
|B-RNA | 951 | 118 |
|I-RNA | 1,530 | 187 |
### Model
An exhaustive hyperparameter search was done.
The hyperparameters that provided the best results are:
- Max length sequence: 128
- Number of epochs: 6
- Batch size: 32
- Dropout: 0.3
- Optimizer: Adam
The used learning rate was 5e-5 with a decreasing linear schedule. A warmup was used at the beggining of the training
with a ratio of steps equal to 0.1 from the total training steps.
The model from the epoch with the best F1-score was selected, in this case, the model from epoch 5.
### Evaluation
The following table shows the evaluation metrics calculated at span/entity level:
| | precision| recall| f1-score|
|:---------|-----------:|---------:|---------:|
cell_line | 0.5205 | 0.7100 | 0.6007 |
cell_type | 0.7736 | 0.7422 | 0.7576 |
protein | 0.6953 | 0.8459 | 0.7633 |
DNA | 0.6997 | 0.7894 | 0.7419 |
RNA | 0.6985 | 0.8051 | 0.7480 |
| | | |
**micro avg** | 0.6984 | 0.8076 | 0.7490|
**macro avg** | 0.7032 | 0.8076 | 0.7498 |
The macro F1-score is equal to 0.7498, compared to the value provided by the Allen Institute for AI in their
[paper](https://arxiv.org/pdf/1903.10676.pdf), which is equal to 0.7728. This drop in performance could be due to
several reasons, but one hypothesis could be the fact that the authors used an additional conditional random field,
while this model uses a regular classification layer with softmax activation on top of SciBERT model.
At word level, this model achieves a precision of 0.7742, a recall of 0.8536 and a F1-score of 0.8093.
### Model usage in inference
Use the pipeline:
````python
from transformers import pipeline
text = "Mouse thymus was used as a source of glucocorticoid receptor from normal CS lymphocytes."
nlp_ner = pipeline("ner",
model='fran-martinez/scibert_scivocab_cased_ner_jnlpba',
tokenizer='fran-martinez/scibert_scivocab_cased_ner_jnlpba')
nlp_ner(text)
"""
Output:
---------------------------
[
{'word': 'glucocorticoid',
'score': 0.9894881248474121,
'entity': 'B-protein'},
{'word': 'receptor',
'score': 0.989505410194397,
'entity': 'I-protein'},
{'word': 'normal',
'score': 0.7680378556251526,
'entity': 'B-cell_type'},
{'word': 'cs',
'score': 0.5176806449890137,
'entity': 'I-cell_type'},
{'word': 'lymphocytes',
'score': 0.9898491501808167,
'entity': 'I-cell_type'}
]
"""
````
Or load model and tokenizer as follows:
````python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
# Example
text = "Mouse thymus was used as a source of glucocorticoid receptor from normal CS lymphocytes."
# Load model
tokenizer = AutoTokenizer.from_pretrained("fran-martinez/scibert_scivocab_cased_ner_jnlpba")
model = AutoModelForTokenClassification.from_pretrained("fran-martinez/scibert_scivocab_cased_ner_jnlpba")
# Get input for BERT
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
# Predict
with torch.no_grad():
outputs = model(input_ids)
# From the output let's take the first element of the tuple.
# Then, let's get rid of [CLS] and [SEP] tokens (first and last)
predictions = outputs[0].argmax(axis=-1)[0][1:-1]
# Map label class indexes to string labels.
for token, pred in zip(tokenizer.tokenize(text), predictions):
print(token, '->', model.config.id2label[pred.numpy().item()])
"""
Output:
---------------------------
mouse -> O
thymus -> O
was -> O
used -> O
as -> O
a -> O
source -> O
of -> O
glucocorticoid -> B-protein
receptor -> I-protein
from -> O
normal -> B-cell_type
cs -> I-cell_type
lymphocytes -> I-cell_type
. -> O
"""
````
---
language: en
license: apache-2.0
datasets:
- bookcorpus
- wikipedia
- gigaword
---
# Funnel Transformer intermediate model (B6-6-6 without decoder)
Pretrained model on English language using a similar objective objective as [ELECTRA](https://huggingface.co/transformers/model_doc/electra.html). It was introduced in
[this paper](https://arxiv.org/pdf/2006.03236.pdf) and first released in
[this repository](https://github.com/laiguokun/Funnel-Transformer). This model is uncased: it does not make a difference
between english and English.
Disclaimer: The team releasing Funnel Transformer did not write a model card for this model so this model card has been
written by the Hugging Face team.
## Model description
Funnel Transformer is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those texts.
More precisely, a small language model corrupts the input texts and serves as a generator of inputs for this model, and
the pretraining objective is to predict which token is an original and which one has been replaced, a bit like a GAN training.
This way, the model learns an inner representation of the English language that can then be used to extract features
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
classifier using the features produced by the BERT model as inputs.
**Note:** This model does not contain the decoder, so it ouputs hidden states that have a sequence length of one fourth
of the inputs. It's good to use for tasks requiring a summary of the sentence (like sentence classification) but not if
you need one input per initial token. You should use the `intermediate` model in that case.
## Intended uses & limitations
You can use the raw model to extract a vector representation of a given text, but it's mostly intended to
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=funnel-transformer) to look for
fine-tuned versions on a task that interests you.
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
generation you should look at model like GPT2.
### How to use
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import FunnelTokenizer, FunnelBaseModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/intermediate-base")
model = FunnelBaseModel.from_pretrained("funnel-transformer/intermediate-base")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import FunnelTokenizer, TFFunnelBaseModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/intermediate-base")
model = TFFunnelBaseModel.from_pretrained("funnel-transformer/intermediate-base")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
## Training data
The BERT model was pretrained on:
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books,
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers),
- [Clue Web](https://lemurproject.org/clueweb12/), a dataset of 733,019,372 English web pages,
- [GigaWord](https://catalog.ldc.upenn.edu/LDC2011T07), an archive of newswire text data,
- [Common Crawl](https://commoncrawl.org/), a dataset of raw web pages.
### BibTeX entry and citation info
```bibtex
@misc{dai2020funneltransformer,
title={Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing},
author={Zihang Dai and Guokun Lai and Yiming Yang and Quoc V. Le},
year={2020},
eprint={2006.03236},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
---
language: en
license: apache-2.0
datasets:
- bookcorpus
- wikipedia
- gigaword
---
# Funnel Transformer intermediate model (B6-6-6 with decoder)
Pretrained model on English language using a similar objective objective as [ELECTRA](https://huggingface.co/transformers/model_doc/electra.html). It was introduced in
[this paper](https://arxiv.org/pdf/2006.03236.pdf) and first released in
[this repository](https://github.com/laiguokun/Funnel-Transformer). This model is uncased: it does not make a difference
between english and English.
Disclaimer: The team releasing Funnel Transformer did not write a model card for this model so this model card has been
written by the Hugging Face team.
## Model description
Funnel Transformer is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those texts.
More precisely, a small language model corrupts the input texts and serves as a generator of inputs for this model, and
the pretraining objective is to predict which token is an original and which one has been replaced, a bit like a GAN training.
This way, the model learns an inner representation of the English language that can then be used to extract features
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
classifier using the features produced by the BERT model as inputs.
## Intended uses & limitations
You can use the raw model to extract a vector representation of a given text, but it's mostly intended to
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=funnel-transformer) to look for
fine-tuned versions on a task that interests you.
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
generation you should look at model like GPT2.
### How to use
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import FunnelTokenizer, FunnelModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/intermediate")
model = FunneModel.from_pretrained("funnel-transformer/intermediate")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import FunnelTokenizer, TFFunnelModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/intermediate")
model = TFFunnelModel.from_pretrained("funnel-transformer/intermediatesmall")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
## Training data
The BERT model was pretrained on:
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books,
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers),
- [Clue Web](https://lemurproject.org/clueweb12/), a dataset of 733,019,372 English web pages,
- [GigaWord](https://catalog.ldc.upenn.edu/LDC2011T07), an archive of newswire text data,
- [Common Crawl](https://commoncrawl.org/), a dataset of raw web pages.
### BibTeX entry and citation info
```bibtex
@misc{dai2020funneltransformer,
title={Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing},
author={Zihang Dai and Guokun Lai and Yiming Yang and Quoc V. Le},
year={2020},
eprint={2006.03236},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
---
language: en
license: apache-2.0
datasets:
- bookcorpus
- wikipedia
- gigaword
---
# Funnel Transformer large model (B8-8-8 without decoder)
Pretrained model on English language using a similar objective objective as [ELECTRA](https://huggingface.co/transformers/model_doc/electra.html). It was introduced in
[this paper](https://arxiv.org/pdf/2006.03236.pdf) and first released in
[this repository](https://github.com/laiguokun/Funnel-Transformer). This model is uncased: it does not make a difference
between english and English.
Disclaimer: The team releasing Funnel Transformer did not write a model card for this model so this model card has been
written by the Hugging Face team.
## Model description
Funnel Transformer is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those texts.
More precisely, a small language model corrupts the input texts and serves as a generator of inputs for this model, and
the pretraining objective is to predict which token is an original and which one has been replaced, a bit like a GAN training.
This way, the model learns an inner representation of the English language that can then be used to extract features
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
classifier using the features produced by the BERT model as inputs.
**Note:** This model does not contain the decoder, so it ouputs hidden states that have a sequence length of one fourth
of the inputs. It's good to use for tasks requiring a summary of the sentence (like sentence classification) but not if
you need one input per initial token. You should use the `large` model in that case.
## Intended uses & limitations
You can use the raw model to extract a vector representation of a given text, but it's mostly intended to
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=funnel-transformer) to look for
fine-tuned versions on a task that interests you.
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
generation you should look at model like GPT2.
### How to use
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import FunnelTokenizer, FunnelBaseModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/large-base")
model = FunnelBaseModel.from_pretrained("funnel-transformer/large-base")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import FunnelTokenizer, TFFunnelBaseModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/large-base")
model = TFFunnelBaseModel.from_pretrained("funnel-transformer/large-base")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
## Training data
The BERT model was pretrained on:
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books,
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers),
- [Clue Web](https://lemurproject.org/clueweb12/), a dataset of 733,019,372 English web pages,
- [GigaWord](https://catalog.ldc.upenn.edu/LDC2011T07), an archive of newswire text data,
- [Common Crawl](https://commoncrawl.org/), a dataset of raw web pages.
### BibTeX entry and citation info
```bibtex
@misc{dai2020funneltransformer,
title={Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing},
author={Zihang Dai and Guokun Lai and Yiming Yang and Quoc V. Le},
year={2020},
eprint={2006.03236},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
---
language: en
license: apache-2.0
datasets:
- bookcorpus
- wikipedia
- gigaword
---
# Funnel Transformer large model (B8-8-8 with decoder)
Pretrained model on English language using a similar objective objective as [ELECTRA](https://huggingface.co/transformers/model_doc/electra.html). It was introduced in
[this paper](https://arxiv.org/pdf/2006.03236.pdf) and first released in
[this repository](https://github.com/laiguokun/Funnel-Transformer). This model is uncased: it does not make a difference
between english and English.
Disclaimer: The team releasing Funnel Transformer did not write a model card for this model so this model card has been
written by the Hugging Face team.
## Model description
Funnel Transformer is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those texts.
More precisely, a small language model corrupts the input texts and serves as a generator of inputs for this model, and
the pretraining objective is to predict which token is an original and which one has been replaced, a bit like a GAN training.
This way, the model learns an inner representation of the English language that can then be used to extract features
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
classifier using the features produced by the BERT model as inputs.
## Intended uses & limitations
You can use the raw model to extract a vector representation of a given text, but it's mostly intended to
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=funnel-transformer) to look for
fine-tuned versions on a task that interests you.
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
generation you should look at model like GPT2.
### How to use
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import FunnelTokenizer, FunnelModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/large")
model = FunneModel.from_pretrained("funnel-transformer/large")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import FunnelTokenizer, TFFunnelModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/large")
model = TFFunnelModel.from_pretrained("funnel-transformer/large")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
## Training data
The BERT model was pretrained on:
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books,
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers),
- [Clue Web](https://lemurproject.org/clueweb12/), a dataset of 733,019,372 English web pages,
- [GigaWord](https://catalog.ldc.upenn.edu/LDC2011T07), an archive of newswire text data,
- [Common Crawl](https://commoncrawl.org/), a dataset of raw web pages.
### BibTeX entry and citation info
```bibtex
@misc{dai2020funneltransformer,
title={Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing},
author={Zihang Dai and Guokun Lai and Yiming Yang and Quoc V. Le},
year={2020},
eprint={2006.03236},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
---
language: en
license: apache-2.0
datasets:
- bookcorpus
- wikipedia
- gigaword
---
# Funnel Transformer medium model (B6-3x2-3x2 without decoder)
Pretrained model on English language using a similar objective objective as [ELECTRA](https://huggingface.co/transformers/model_doc/electra.html). It was introduced in
[this paper](https://arxiv.org/pdf/2006.03236.pdf) and first released in
[this repository](https://github.com/laiguokun/Funnel-Transformer). This model is uncased: it does not make a difference
between english and English.
Disclaimer: The team releasing Funnel Transformer did not write a model card for this model so this model card has been
written by the Hugging Face team.
## Model description
Funnel Transformer is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those texts.
More precisely, a small language model corrupts the input texts and serves as a generator of inputs for this model, and
the pretraining objective is to predict which token is an original and which one has been replaced, a bit like a GAN training.
This way, the model learns an inner representation of the English language that can then be used to extract features
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
classifier using the features produced by the BERT model as inputs.
**Note:** This model does not contain the decoder, so it ouputs hidden states that have a sequence length of one fourth
of the inputs. It's good to use for tasks requiring a summary of the sentence (like sentence classification) but not if
you need one input per initial token. You should use the `medium` model in that case.
## Intended uses & limitations
You can use the raw model to extract a vector representation of a given text, but it's mostly intended to
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=funnel-transformer) to look for
fine-tuned versions on a task that interests you.
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
generation you should look at model like GPT2.
### How to use
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import FunnelTokenizer, FunnelBaseModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/medium-base")
model = FunnelBaseModel.from_pretrained("funnel-transformer/medium-base")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import FunnelTokenizer, TFFunnelBaseModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/medium-base")
model = TFFunnelBaseModel.from_pretrained("funnel-transformer/medium-base")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
## Training data
The BERT model was pretrained on:
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books,
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers),
- [Clue Web](https://lemurproject.org/clueweb12/), a dataset of 733,019,372 English web pages,
- [GigaWord](https://catalog.ldc.upenn.edu/LDC2011T07), an archive of newswire text data,
- [Common Crawl](https://commoncrawl.org/), a dataset of raw web pages.
### BibTeX entry and citation info
```bibtex
@misc{dai2020funneltransformer,
title={Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing},
author={Zihang Dai and Guokun Lai and Yiming Yang and Quoc V. Le},
year={2020},
eprint={2006.03236},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
---
language: en
license: apache-2.0
datasets:
- bookcorpus
- wikipedia
- gigaword
---
# Funnel Transformer medium model (B6-3x2-3x2 with decoder)
Pretrained model on English language using a similar objective objective as [ELECTRA](https://huggingface.co/transformers/model_doc/electra.html). It was introduced in
[this paper](https://arxiv.org/pdf/2006.03236.pdf) and first released in
[this repository](https://github.com/laiguokun/Funnel-Transformer). This model is uncased: it does not make a difference
between english and English.
Disclaimer: The team releasing Funnel Transformer did not write a model card for this model so this model card has been
written by the Hugging Face team.
## Model description
Funnel Transformer is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those texts.
More precisely, a small language model corrupts the input texts and serves as a generator of inputs for this model, and
the pretraining objective is to predict which token is an original and which one has been replaced, a bit like a GAN training.
This way, the model learns an inner representation of the English language that can then be used to extract features
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
classifier using the features produced by the BERT model as inputs.
## Intended uses & limitations
You can use the raw model to extract a vector representation of a given text, but it's mostly intended to
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=funnel-transformer) to look for
fine-tuned versions on a task that interests you.
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
generation you should look at model like GPT2.
### How to use
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import FunnelTokenizer, FunnelModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/medium")
model = FunneModel.from_pretrained("funnel-transformer/medium")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import FunnelTokenizer, TFFunnelModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/medium")
model = TFFunnelModel.from_pretrained("funnel-transformer/medium")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
## Training data
The BERT model was pretrained on:
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books,
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers),
- [Clue Web](https://lemurproject.org/clueweb12/), a dataset of 733,019,372 English web pages,
- [GigaWord](https://catalog.ldc.upenn.edu/LDC2011T07), an archive of newswire text data,
- [Common Crawl](https://commoncrawl.org/), a dataset of raw web pages.
### BibTeX entry and citation info
```bibtex
@misc{dai2020funneltransformer,
title={Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing},
author={Zihang Dai and Guokun Lai and Yiming Yang and Quoc V. Le},
year={2020},
eprint={2006.03236},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
---
language: en
license: apache-2.0
datasets:
- bookcorpus
- wikipedia
- gigaword
---
# Funnel Transformer small model (B4-4-4 without decoder)
Pretrained model on English language using a similar objective objective as [ELECTRA](https://huggingface.co/transformers/model_doc/electra.html). It was introduced in
[this paper](https://arxiv.org/pdf/2006.03236.pdf) and first released in
[this repository](https://github.com/laiguokun/Funnel-Transformer). This model is uncased: it does not make a difference
between english and English.
Disclaimer: The team releasing Funnel Transformer did not write a model card for this model so this model card has been
written by the Hugging Face team.
## Model description
Funnel Transformer is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those texts.
More precisely, a small language model corrupts the input texts and serves as a generator of inputs for this model, and
the pretraining objective is to predict which token is an original and which one has been replaced, a bit like a GAN training.
This way, the model learns an inner representation of the English language that can then be used to extract features
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
classifier using the features produced by the BERT model as inputs.
**Note:** This model does not contain the decoder, so it ouputs hidden states that have a sequence length of one fourth
of the inputs. It's good to use for tasks requiring a summary of the sentence (like sentence classification) but not if
you need one input per initial token. You should use the `small` model in that case.
## Intended uses & limitations
You can use the raw model to extract a vector representation of a given text, but it's mostly intended to
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=funnel-transformer) to look for
fine-tuned versions on a task that interests you.
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
generation you should look at model like GPT2.
### How to use
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import FunnelTokenizer, FunnelBaseModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/small-base")
model = FunnelBaseModel.from_pretrained("funnel-transformer/small-base")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import FunnelTokenizer, TFFunnelBaseModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/small-base")
model = TFFunnelBaseModel.from_pretrained("funnel-transformer/small-base")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
## Training data
The BERT model was pretrained on:
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books,
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers),
- [Clue Web](https://lemurproject.org/clueweb12/), a dataset of 733,019,372 English web pages,
- [GigaWord](https://catalog.ldc.upenn.edu/LDC2011T07), an archive of newswire text data,
- [Common Crawl](https://commoncrawl.org/), a dataset of raw web pages.
### BibTeX entry and citation info
```bibtex
@misc{dai2020funneltransformer,
title={Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing},
author={Zihang Dai and Guokun Lai and Yiming Yang and Quoc V. Le},
year={2020},
eprint={2006.03236},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
---
language: en
license: apache-2.0
datasets:
- bookcorpus
- wikipedia
- gigaword
---
# Funnel Transformer small model (B4-4-4 with decoder)
Pretrained model on English language using a similar objective objective as [ELECTRA](https://huggingface.co/transformers/model_doc/electra.html). It was introduced in
[this paper](https://arxiv.org/pdf/2006.03236.pdf) and first released in
[this repository](https://github.com/laiguokun/Funnel-Transformer). This model is uncased: it does not make a difference
between english and English.
Disclaimer: The team releasing Funnel Transformer did not write a model card for this model so this model card has been
written by the Hugging Face team.
## Model description
Funnel Transformer is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those texts.
More precisely, a small language model corrupts the input texts and serves as a generator of inputs for this model, and
the pretraining objective is to predict which token is an original and which one has been replaced, a bit like a GAN training.
This way, the model learns an inner representation of the English language that can then be used to extract features
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
classifier using the features produced by the BERT model as inputs.
## Intended uses & limitations
You can use the raw model to extract a vector representation of a given text, but it's mostly intended to
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=funnel-transformer) to look for
fine-tuned versions on a task that interests you.
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
generation you should look at model like GPT2.
### How to use
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import FunnelTokenizer, FunnelModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/small")
model = FunneModel.from_pretrained("funnel-transformer/small")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import FunnelTokenizer, TFFunnelModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/small")
model = TFFunnelModel.from_pretrained("funnel-transformer/small")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
## Training data
The BERT model was pretrained on:
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books,
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers),
- [Clue Web](https://lemurproject.org/clueweb12/), a dataset of 733,019,372 English web pages,
- [GigaWord](https://catalog.ldc.upenn.edu/LDC2011T07), an archive of newswire text data,
- [Common Crawl](https://commoncrawl.org/), a dataset of raw web pages.
### BibTeX entry and citation info
```bibtex
@misc{dai2020funneltransformer,
title={Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing},
author={Zihang Dai and Guokun Lai and Yiming Yang and Quoc V. Le},
year={2020},
eprint={2006.03236},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
---
language: en
license: apache-2.0
datasets:
- bookcorpus
- wikipedia
- gigaword
---
# Funnel Transformer xlarge model (B10-10-10 without decoder)
Pretrained model on English language using a similar objective objective as [ELECTRA](https://huggingface.co/transformers/model_doc/electra.html). It was introduced in
[this paper](https://arxiv.org/pdf/2006.03236.pdf) and first released in
[this repository](https://github.com/laiguokun/Funnel-Transformer). This model is uncased: it does not make a difference
between english and English.
Disclaimer: The team releasing Funnel Transformer did not write a model card for this model so this model card has been
written by the Hugging Face team.
## Model description
Funnel Transformer is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those texts.
More precisely, a small language model corrupts the input texts and serves as a generator of inputs for this model, and
the pretraining objective is to predict which token is an original and which one has been replaced, a bit like a GAN training.
This way, the model learns an inner representation of the English language that can then be used to extract features
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
classifier using the features produced by the BERT model as inputs.
**Note:** This model does not contain the decoder, so it ouputs hidden states that have a sequence length of one fourth
of the inputs. It's good to use for tasks requiring a summary of the sentence (like sentence classification) but not if
you need one input per initial token. You should use the `xlarge` model in that case.
## Intended uses & limitations
You can use the raw model to extract a vector representation of a given text, but it's mostly intended to
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=funnel-transformer) to look for
fine-tuned versions on a task that interests you.
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
generation you should look at model like GPT2.
### How to use
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import FunnelTokenizer, FunnelBaseModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/xlarge-base")
model = FunnelBaseModel.from_pretrained("funnel-transformer/xlarge-base")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import FunnelTokenizer, TFFunnelBaseModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/xlarge-base")
model = TFFunnelBaseModel.from_pretrained("funnel-transformer/xlarge-base")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
## Training data
The BERT model was pretrained on:
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books,
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers),
- [Clue Web](https://lemurproject.org/clueweb12/), a dataset of 733,019,372 English web pages,
- [GigaWord](https://catalog.ldc.upenn.edu/LDC2011T07), an archive of newswire text data,
- [Common Crawl](https://commoncrawl.org/), a dataset of raw web pages.
### BibTeX entry and citation info
```bibtex
@misc{dai2020funneltransformer,
title={Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing},
author={Zihang Dai and Guokun Lai and Yiming Yang and Quoc V. Le},
year={2020},
eprint={2006.03236},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
---
language: en
license: apache-2.0
datasets:
- bookcorpus
- wikipedia
- gigaword
---
# Funnel Transformer xlarge model (B10-10-10 with decoder)
Pretrained model on English language using a similar objective objective as [ELECTRA](https://huggingface.co/transformers/model_doc/electra.html). It was introduced in
[this paper](https://arxiv.org/pdf/2006.03236.pdf) and first released in
[this repository](https://github.com/laiguokun/Funnel-Transformer). This model is uncased: it does not make a difference
between english and English.
Disclaimer: The team releasing Funnel Transformer did not write a model card for this model so this model card has been
written by the Hugging Face team.
## Model description
Funnel Transformer is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those texts.
More precisely, a small language model corrupts the input texts and serves as a generator of inputs for this model, and
the pretraining objective is to predict which token is an original and which one has been replaced, a bit like a GAN training.
This way, the model learns an inner representation of the English language that can then be used to extract features
useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
classifier using the features produced by the BERT model as inputs.
## Intended uses & limitations
You can use the raw model to extract a vector representation of a given text, but it's mostly intended to
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=funnel-transformer) to look for
fine-tuned versions on a task that interests you.
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
generation you should look at model like GPT2.
### How to use
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import FunnelTokenizer, FunnelModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/xlarge")
model = FunneModel.from_pretrained("funnel-transformer/xlarge")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import FunnelTokenizer, TFFunnelModel
tokenizer = FunnelTokenizer.from_pretrained("funnel-transformer/xlarge")
model = TFFunnelModel.from_pretrained("funnel-transformer/xlarge")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
## Training data
The BERT model was pretrained on:
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books,
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers),
- [Clue Web](https://lemurproject.org/clueweb12/), a dataset of 733,019,372 English web pages,
- [GigaWord](https://catalog.ldc.upenn.edu/LDC2011T07), an archive of newswire text data,
- [Common Crawl](https://commoncrawl.org/), a dataset of raw web pages.
### BibTeX entry and citation info
```bibtex
@misc{dai2020funneltransformer,
title={Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing},
author={Zihang Dai and Guokun Lai and Yiming Yang and Quoc V. Le},
year={2020},
eprint={2006.03236},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
---
language:
- hi-en
tags:
- sentiment
- multilingual
- hindi codemix
- hinglish
license: apache-2.0
datasets:
- sail
---
# Sentiment Classification for hinglish text: `gk-hinglish-sentiment`
## Model description
Trained small amount of reviews dataset
## Intended uses & limitations
I wanted something to work well with hinglish data as it is being used in India mostly.
The training data was not much as expected
#### How to use
```python
#sample code
from transformers import BertTokenizer, BertForSequenceClassification
tokenizerg = BertTokenizer.from_pretrained("/content/model")
modelg = BertForSequenceClassification.from_pretrained("/content/model")
text = "kuch bhi type karo hinglish mai"
encoded_input = tokenizerg(text, return_tensors='pt')
output = modelg(**encoded_input)
print(output)
#output contains 3 lables LABEL_0 = Negative ,LABEL_1 = Nuetral ,LABEL_2 = Positive
```
#### Limitations and bias
The data contains only hinglish codemixed text it and was very much limited may be I will Update this model if I can get good amount of data
## Training data
Training data contains labeled data for 3 labels
link to the pre-trained model card with description of the pre-training data.
I have Tuned below model
https://huggingface.co/rohanrajpal/bert-base-multilingual-codemixed-cased-sentiment
### BibTeX entry and citation info
```@inproceedings{khanuja-etal-2020-gluecos,
title = "{GLUEC}o{S}: An Evaluation Benchmark for Code-Switched {NLP}",
author = "Khanuja, Simran and
Dandapat, Sandipan and
Srinivasan, Anirudh and
Sitaram, Sunayana and
Choudhury, Monojit",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.329",
pages = "3575--3585"
}
```
## Generating Chinese poetry by topic.
```python
from transformers import *
tokenizer = BertTokenizer.from_pretrained("gaochangkuan/model_dir")
model = AutoModelWithLMHead.from_pretrained("gaochangkuan/model_dir")
prompt= '''<s>田园躬耕'''
length= 84
stop_token='</s>'
temperature = 1.2
repetition_penalty=1.3
k= 30
p= 0.95
device ='cuda'
seed=2020
no_cuda=False
prompt_text = prompt if prompt else input("Model prompt >>> ")
encoded_prompt = tokenizer.encode(
'<s>'+prompt_text+'<sep>',
add_special_tokens=False,
return_tensors="pt"
)
encoded_prompt = encoded_prompt.to(device)
output_sequences = model.generate(
input_ids=encoded_prompt,
max_length=length,
min_length=10,
do_sample=True,
early_stopping=True,
num_beams=10,
temperature=temperature,
top_k=k,
top_p=p,
repetition_penalty=repetition_penalty,
bad_words_ids=None,
bos_token_id=tokenizer.bos_token_id,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
length_penalty=1.2,
no_repeat_ngram_size=2,
num_return_sequences=1,
attention_mask=None,
decoder_start_token_id=tokenizer.bos_token_id,)
generated_sequence = output_sequences[0].tolist()
text = tokenizer.decode(generated_sequence)
text = text[: text.find(stop_token) if stop_token else None]
print(''.join(text).replace(' ','').replace('<pad>','').replace('<s>',''))
```
---
language: de
license: mit
thumbnail: "https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/german-electra-logo.png"
tags:
- electra
- commoncrawl
- uncased
- umlaute
- umlauts
- german
- deutsch
---
# German Electra Uncased
<img width="300px" src="https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/german-electra-logo.png">
[¹]
# Model Info
This Model is suitable for Training on many downstream tasks in German (Q&A, Sentiment Analysis, etc.).
It can be used as a drop-in Replacement for **BERT** in most down-stream tasks (**ELECTRA** is even implemented as an extended **BERT** Class).
At the time of release (August 2020) this Model is the best performing publicly available German NLP Model on various German Evaluation Metrics (CONLL03-DE, GermEval18 Coarse, GermEval18 Fine). For GermEval18 Coarse results see below. More will be published soon.
# Installation
This model has the special feature that it is **uncased** but does **not strip accents**.
This possibility was added by us with [PR #6280](https://github.com/huggingface/transformers/pull/6280).
To use it you have to use Transformers version 3.1.0 or newer.
```bash
pip install transformers -U
```
# Uncase and Umlauts ('Ö', 'Ä', 'Ü')
This model is uncased. This helps especially for domains where colloquial terms with uncorrect capitalization is often used.
The special characters 'ö', 'ü', 'ä' are included through the `strip_accent=False` option, as this leads to an improved precision.
# Creators
This model was trained and open sourced in conjunction with the [**German NLP Group**](https://github.com/German-NLP-Group) in equal parts by:
- [**Philip May**](https://eniak.de) - [T-Systems on site services GmbH](https://www.t-systems-onsite.de/)
- [**Philipp Reißel**](https://www.reissel.eu) - [ambeRoad](https://amberoad.de/)
# Evaluation: GermEval18 Coarse
| Model Name | F1 macro<br/>Mean | F1 macro<br/>Median | F1 macro<br/>Std |
|---|---|---|---|
| dbmdz-bert-base-german-europeana-cased | 0.727 | 0.729 | 0.00674 |
| dbmdz-bert-base-german-europeana-uncased | 0.736 | 0.737 | 0.00476 |
| dbmdz/electra-base-german-europeana-cased-discriminator | 0.745 | 0.745 | 0.00498 |
| distilbert-base-german-cased | 0.752 | 0.752 | 0.00341 |
| bert-base-german-cased | 0.762 | 0.761 | 0.00597 |
| dbmdz/bert-base-german-cased | 0.765 | 0.765 | 0.00523 |
| dbmdz/bert-base-german-uncased | 0.770 | 0.770 | 0.00572 |
| **ELECTRA-base-german-uncased (this model)** | **0.778** | **0.778** | **0.00392** |
- (1): Hyperparameters taken from the [FARM project](https://farm.deepset.ai/) "[germEval18Coarse_config.json](https://github.com/deepset-ai/FARM/blob/master/experiments/german-bert2.0-eval/germEval18Coarse_config.json)"
![GermEval18 Coarse Model Evaluation](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/model_eval.png)
# Checkpoint evaluation
Since it it not guaranteed that the last checkpoint is the best, we evaluated the checkpoints on GermEval18. We found that the last checkpoint is indeed the best. The training was stable and did not overfit the text corpus. Below is a boxplot chart showing the different checkpoints.
![Checkpoint Evaluation on GermEval18](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/checkpoint_eval.png)
# Pre-training details
## Data
- Cleaned Common Crawl Corpus 2019-09 German: [CC_net](https://github.com/facebookresearch/cc_net) (Only head coprus and filtered for language_score > 0.98) - 62 GB
- German Wikipedia Article Pages Dump (20200701) - 5.5 GB
- German Wikipedia Talk Pages Dump (20200620) - 1.1 GB
- Subtitles - 823 MB
- News 2018 - 4.1 GB
The sentences were split with [SojaMo](https://github.com/tsproisl/SoMaJo). We took the German Wikipedia Article Pages Dump 3x to oversample. This approach was also used in a similar way in GPT-3 (Table 2.2).
More Details can be found here [Preperaing Datasets for German Electra Github](https://github.com/German-NLP-Group/german-transformer-training)
## Electra Branch no_strip_accents
Because we do not want to stip accents in our training data we made a change to Electra and used this repo [Electra no_strip_accents](https://github.com/PhilipMay/electra/tree/no_strip_accents) (branch `no_strip_accents`). Then created the tf dataset with:
```bash
python build_pretraining_dataset.py --corpus-dir <corpus_dir> --vocab-file <dir>/vocab.txt --output-dir ./tf_data --max-seq-length 512 --num-processes 8 --do-lower-case --no-strip-accents
```
## The training
The training itself can be performed with the Original Electra Repo (No special case for this needed).
We run it with the following Config:
<details>
<summary>The exact Training Config</summary>
<br/>debug False
<br/>disallow_correct False
<br/>disc_weight 50.0
<br/>do_eval False
<br/>do_lower_case True
<br/>do_train True
<br/>electra_objective True
<br/>embedding_size 768
<br/>eval_batch_size 128
<br/>gcp_project None
<br/>gen_weight 1.0
<br/>generator_hidden_size 0.33333
<br/>generator_layers 1.0
<br/>iterations_per_loop 200
<br/>keep_checkpoint_max 0
<br/>learning_rate 0.0002
<br/>lr_decay_power 1.0
<br/>mask_prob 0.15
<br/>max_predictions_per_seq 79
<br/>max_seq_length 512
<br/>model_dir gs://XXX
<br/>model_hparam_overrides {}
<br/>model_name 02_Electra_Checkpoints_32k_766k_Combined
<br/>model_size base
<br/>num_eval_steps 100
<br/>num_tpu_cores 8
<br/>num_train_steps 766000
<br/>num_warmup_steps 10000
<br/>pretrain_tfrecords gs://XXX
<br/>results_pkl gs://XXX
<br/>results_txt gs://XXX
<br/>save_checkpoints_steps 5000
<br/>temperature 1.0
<br/>tpu_job_name None
<br/>tpu_name electrav5
<br/>tpu_zone None
<br/>train_batch_size 256
<br/>uniform_generator False
<br/>untied_generator True
<br/>untied_generator_embeddings False
<br/>use_tpu True
<br/>vocab_file gs://XXX
<br/>vocab_size 32767
<br/>weight_decay_rate 0.01
</details>
![Training Loss](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/loss.png)
Please Note: *Due to the GAN like strucutre of Electra the loss is not that meaningful*
It took about 7 Days on a preemtible TPU V3-8. In total, the Model went through approximately 10 Epochs. For an automatically recreation of a cancelled TPUs we used [tpunicorn](https://github.com/shawwn/tpunicorn). The total cost of training summed up to about 450 $ for one run. The Data-pre processing and Vocab Creation needed approximately 500-1000 CPU hours. Servers were fully provided by [T-Systems on site services GmbH](https://www.t-systems-onsite.de/), [ambeRoad](https://amberoad.de/).
Special thanks to [Stefan Schweter](https://github.com/stefan-it) for your feedback and providing parts of the text corpus.
[¹]: Source for the picture [Pinterest](https://www.pinterest.cl/pin/371828512984142193/)
# Negative Results
We tried the following approaches which we found had no positive influence:
- **Increased Vocab Size**: Leads to more parameters and thus reduced examples/sec while no visible Performance gains were measured
- **Decreased Batch-Size**: The original Electra was trained with a Batch Size per TPU Core of 16 whereas this Model was trained with 32 BS / TPU Core. We found out that 32 BS leads to better results when you compare metrics over computation time
# StackOBERTflow-comments-small
StackOBERTflow is a RoBERTa model trained on StackOverflow comments.
A Byte-level BPE tokenizer with dropout was used (using the `tokenizers` package).
The model is *small*, i.e. has only 6-layers and the maximum sequence length was restricted to 256 tokens.
The model was trained for 6 epochs on several GBs of comments from the StackOverflow corpus.
## Quick start: masked language modeling prediction
```python
from transformers import pipeline
from pprint import pprint
COMMENT = "You really should not do it this way, I would use <mask> instead."
fill_mask = pipeline(
"fill-mask",
model="giganticode/StackOBERTflow-comments-small-v1",
tokenizer="giganticode/StackOBERTflow-comments-small-v1"
)
pprint(fill_mask(COMMENT))
# [{'score': 0.019997311756014824,
# 'sequence': '<s> You really should not do it this way, I would use jQuery instead.</s>',
# 'token': 1738},
# {'score': 0.01693696901202202,
# 'sequence': '<s> You really should not do it this way, I would use arrays instead.</s>',
# 'token': 2844},
# {'score': 0.013411642983555794,
# 'sequence': '<s> You really should not do it this way, I would use CSS instead.</s>',
# 'token': 2254},
# {'score': 0.013224546797573566,
# 'sequence': '<s> You really should not do it this way, I would use it instead.</s>',
# 'token': 300},
# {'score': 0.011984303593635559,
# 'sequence': '<s> You really should not do it this way, I would use classes instead.</s>',
# 'token': 1779}]
```
---
language: fr
widget:
- text: "Face à un choc inédit, les mesures mises en place par le gouvernement ont permis une protection forte et efficace des ménages"
---
## About
The *french-camembert-postag-model* is a part of speech tagging model for French that was trained on the *free-french-treebank* dataset available on
[github](https://github.com/nicolashernandez/free-french-treebank). The base tokenizer and model used for training is *'camembert-base'*.
## Supported Tags
It uses the following tags:
| Tag | Category | Extra Info |
|----------|:------------------------------:|------------:|
| ADJ | adjectif | |
| ADJWH | adjectif | |
| ADV | adverbe | |
| ADVWH | adverbe | |
| CC | conjonction de coordination | |
| CLO | pronom | obj |
| CLR | pronom | refl |
| CLS | pronom | suj |
| CS | conjonction de subordination | |
| DET | déterminant | |
| DETWH | déterminant | |
| ET | mot étranger | |
| I | interjection | |
| NC | nom commun | |
| NPP | nom propre | |
| P | préposition | |
| P+D | préposition + déterminant | |
| PONCT | signe de ponctuation | |
| PREF | préfixe | |
| PRO | autres pronoms | |
| PROREL | autres pronoms | rel |
| PROWH | autres pronoms | int |
| U | ? | |
| V | verbe | |
| VIMP | verbe imperatif | |
| VINF | verbe infinitif | |
| VPP | participe passé | |
| VPR | participe présent | |
| VS | subjonctif | |
More information on the tags can be found here:
http://alpage.inria.fr/statgram/frdep/Publications/crabbecandi-taln2008-final.pdf
## Usage
The usage of this model follows the common transformers patterns. Here is a short example of its usage:
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("gilf/french-camembert-postag-model")
model = AutoModelForTokenClassification.from_pretrained("gilf/french-camembert-postag-model")
from transformers import pipeline
nlp_token_class = pipeline('ner', model=model, tokenizer=tokenizer, grouped_entities=True)
nlp_token_class('Face à un choc inédit, les mesures mises en place par le gouvernement ont permis une protection forte et efficace des ménages')
```
The lines above would display something like this on a Jupyter notebook:
```
[{'entity_group': 'NC', 'score': 0.5760144591331482, 'word': '<s>'},
{'entity_group': 'U', 'score': 0.9946700930595398, 'word': 'Face'},
{'entity_group': 'P', 'score': 0.999615490436554, 'word': 'à'},
{'entity_group': 'DET', 'score': 0.9995906352996826, 'word': 'un'},
{'entity_group': 'NC', 'score': 0.9995531439781189, 'word': 'choc'},
{'entity_group': 'ADJ', 'score': 0.999183714389801, 'word': 'inédit'},
{'entity_group': 'P', 'score': 0.3710663616657257, 'word': ','},
{'entity_group': 'DET', 'score': 0.9995903968811035, 'word': 'les'},
{'entity_group': 'NC', 'score': 0.9995649456977844, 'word': 'mesures'},
{'entity_group': 'VPP', 'score': 0.9988670349121094, 'word': 'mises'},
{'entity_group': 'P', 'score': 0.9996246099472046, 'word': 'en'},
{'entity_group': 'NC', 'score': 0.9995329976081848, 'word': 'place'},
{'entity_group': 'P', 'score': 0.9996233582496643, 'word': 'par'},
{'entity_group': 'DET', 'score': 0.9995935559272766, 'word': 'le'},
{'entity_group': 'NC', 'score': 0.9995369911193848, 'word': 'gouvernement'},
{'entity_group': 'V', 'score': 0.9993771314620972, 'word': 'ont'},
{'entity_group': 'VPP', 'score': 0.9991101026535034, 'word': 'permis'},
{'entity_group': 'DET', 'score': 0.9995885491371155, 'word': 'une'},
{'entity_group': 'NC', 'score': 0.9995636343955994, 'word': 'protection'},
{'entity_group': 'ADJ', 'score': 0.9991781711578369, 'word': 'forte'},
{'entity_group': 'CC', 'score': 0.9991298317909241, 'word': 'et'},
{'entity_group': 'ADJ', 'score': 0.9992275238037109, 'word': 'efficace'},
{'entity_group': 'P+D', 'score': 0.9993300437927246, 'word': 'des'},
{'entity_group': 'NC', 'score': 0.8353511393070221, 'word': 'ménages</s>'}]
```
## About
The *french-postag-model* is a part of speech tagging model for French that was trained on the *free-french-treebank* dataset available on
[github](https://github.com/nicolashernandez/free-french-treebank). The base tokenizer and model used for training is *'bert-base-multilingual-cased'*.
## Supported Tags
It uses the following tags:
| Tag | Category | Extra Info |
|----------|:------------------------------:|------------:|
| ADJ | adjectif | |
| ADJWH | adjectif | |
| ADV | adverbe | |
| ADVWH | adverbe | |
| CC | conjonction de coordination | |
| CLO | pronom | obj |
| CLR | pronom | refl |
| CLS | pronom | suj |
| CS | conjonction de subordination | |
| DET | déterminant | |
| DETWH | déterminant | |
| ET | mot étranger | |
| I | interjection | |
| NC | nom commun | |
| NPP | nom propre | |
| P | préposition | |
| P+D | préposition + déterminant | |
| PONCT | signe de ponctuation | |
| PREF | préfixe | |
| PRO | autres pronoms | |
| PROREL | autres pronoms | rel |
| PROWH | autres pronoms | int |
| U | ? | |
| V | verbe | |
| VIMP | verbe imperatif | |
| VINF | verbe infinitif | |
| VPP | participe passé | |
| VPR | participe présent | |
| VS | subjonctif | |
More information on the tags can be found here:
http://alpage.inria.fr/statgram/frdep/Publications/crabbecandi-taln2008-final.pdf
## Usage
The usage of this model follows the common transformers patterns. Here is a short example of its usage:
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("gilf/french-postag-model")
model = AutoModelForTokenClassification.from_pretrained("gilf/french-postag-model")
from transformers import pipeline
nlp_token_class = pipeline('ner', model=model, tokenizer=tokenizer, grouped_entities=True)
nlp_token_class('Face à un choc inédit, les mesures mises en place par le gouvernement ont permis une protection forte et efficace des ménages')
```
The lines above would display something like this on a Jupyter notebook:
```
[{'entity_group': 'PONCT', 'score': 0.0742340236902237, 'word': '[CLS]'},
{'entity_group': 'U', 'score': 0.9995399713516235, 'word': 'Face'},
{'entity_group': 'P', 'score': 0.9999609589576721, 'word': 'à'},
{'entity_group': 'DET', 'score': 0.9999597072601318, 'word': 'un'},
{'entity_group': 'NC', 'score': 0.9998948276042938, 'word': 'choc'},
{'entity_group': 'ADJ', 'score': 0.995318204164505, 'word': 'inédit'},
{'entity_group': 'PONCT', 'score': 0.9999793171882629, 'word': ','},
{'entity_group': 'DET', 'score': 0.999964714050293, 'word': 'les'},
{'entity_group': 'NC', 'score': 0.999936580657959, 'word': 'mesures'},
{'entity_group': 'VPP', 'score': 0.9995776414871216, 'word': 'mises'},
{'entity_group': 'P', 'score': 0.99996417760849, 'word': 'en'},
{'entity_group': 'NC', 'score': 0.999882161617279, 'word': 'place'},
{'entity_group': 'P', 'score': 0.9999671578407288, 'word': 'par'},
{'entity_group': 'DET', 'score': 0.9999637603759766, 'word': 'le'},
{'entity_group': 'NC', 'score': 0.9999350309371948, 'word': 'gouvernement'},
{'entity_group': 'V', 'score': 0.9999298453330994, 'word': 'ont'},
{'entity_group': 'VPP', 'score': 0.9998740553855896, 'word': 'permis'},
{'entity_group': 'DET', 'score': 0.9999625086784363, 'word': 'une'},
{'entity_group': 'NC', 'score': 0.9999420046806335, 'word': 'protection'},
{'entity_group': 'ADJ', 'score': 0.9998913407325745, 'word': 'forte'},
{'entity_group': 'CC', 'score': 0.9998615980148315, 'word': 'et'},
{'entity_group': 'ADJ', 'score': 0.9998483657836914, 'word': 'efficace'},
{'entity_group': 'P+D', 'score': 0.9987645149230957, 'word': 'des'},
{'entity_group': 'NC', 'score': 0.8720395267009735, 'word': 'ménages [SEP]'}]
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment