Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language: de
thumbnail: kfold.png
---
# German BERT for literary texts
This German BERT is based on `bert-base-german-dbmdz-cased`, and has been adapted to the domain of literary texts by fine-tuning the language modeling task on the [Corpus of German-Language Fiction](https://figshare.com/articles/Corpus_of_German-Language_Fiction_txt_/4524680/1). Afterwards the model was fine-tuned for named entity recognition on the [DROC](https://gitlab2.informatik.uni-wuerzburg.de/kallimachos/DROC-Release) corpus, so you can use it to recognize protagonists in German novels.
# Stats
## Language modeling
The [Corpus of German-Language Fiction](https://figshare.com/articles/Corpus_of_German-Language_Fiction_txt_/4524680/1) consists of 3,194 documents with 203,516,988 tokens or 1,520,855 types. The publication year of the texts ranges from the 18th to the 20th century:
![years](prosa-jahre.png)
### Results
After one epoch:
| Model | Perplexity |
| ---------------- | ---------- |
| Vanilla BERT | 6.82 |
| Fine-tuned BERT | 4.98 |
## Named entity recognition
The provided model was also fine-tuned for two epochs on 10,799 sentences for training, validated on 547 and tested on 1,845 with three labels: `B-PER`, `I-PER` and `O`.
## Results
| Dataset | Precision | Recall | F1 |
| ------- | --------- | ------ | ---- |
| Dev | 96.4 | 87.3 | 91.6 |
| Test | 92.8 | 94.9 | 93.8 |
The model has also been evaluated using 10-fold cross validation and compared with a classic Conditional Random Field baseline described in [Jannidis et al.](https://opus.bibliothek.uni-wuerzburg.de/opus4-wuerzburg/frontdoor/deliver/index/docId/14333/file/Jannidis_Figurenerkennung_Roman.pdf) (2015):
![kfold](kfold.png)
# References
Markus Krug, Lukas Weimer, Isabella Reger, Luisa Macharowsky, Stephan Feldhaus, Frank Puppe, Fotis Jannidis, [Description of a Corpus of Character References in German Novels](http://webdoc.sub.gwdg.de/pub/mon/dariah-de/dwp-2018-27.pdf), 2018.
Fotis Jannidis, Isabella Reger, Lukas Weimer, Markus Krug, Martin Toepfer, Frank Puppe, [Automatische Erkennung von Figuren in deutschsprachigen Romanen](https://opus.bibliothek.uni-wuerzburg.de/opus4-wuerzburg/frontdoor/deliver/index/docId/14333/file/Jannidis_Figurenerkennung_Roman.pdf), 2015.
---
tags:
- chemistry
---
# ChemBERTa: Training a BERT-like transformer model for masked language modelling of chemical SMILES strings.
Deep learning for chemistry and materials science remains a novel field with lots of potiential. However, the popularity of transfer learning based methods in areas such as NLP and computer vision have not yet been effectively developed in computational chemistry + machine learning. Using HuggingFace's suite of models and the ByteLevel tokenizer, we are able to train on a large corpus of 100k SMILES strings from a commonly known benchmark dataset, ZINC.
Training RoBERTa over 5 epochs, the model achieves a decent loss of 0.398, but may likely continue to decline if trained for a larger number of epochs. The model can predict tokens within a SMILES sequence/molecule, allowing for variants of a molecule within discoverable chemical space to be predicted.
By applying the representations of functional groups and atoms learned by the model, we can try to tackle problems of toxicity, solubility, drug-likeness, and synthesis accessibility on smaller datasets using the learned representations as features for graph convolution and attention models on the graph structure of molecules, as well as fine-tuning of BERT. Finally, we propose the use of attention visualization as a helpful tool for chemistry practitioners and students to quickly identify important substructures in various chemical properties.
Additionally, visualization of the attention mechanism have been seen through previous research as incredibly valuable towards chemical reaction classification. The applications of open-sourcing large-scale transformer models such as RoBERTa with HuggingFace may allow for the acceleration of these individual research directions.
A link to a repository which includes the training, uploading and evaluation notebook (with sample predictions on compounds such as Remdesivir) can be found [here](https://github.com/seyonechithrananda/bert-loves-chemistry). All of the notebooks can be copied into a new Colab runtime for easy execution.
Thanks for checking this out!
- Seyone
# ALECTRA-small-OWT
This is an extension of
[ELECTRA](https://openreview.net/forum?id=r1xMH1BtvB) small model, trained on the
[OpenWebText corpus](https://skylion007.github.io/OpenWebTextCorpus/).
The training task (discriminative LM / replaced-token-detection) can be generalized to any transformer type. Here, we train an ALBERT model under the same scheme.
## Pretraining task
![electra task diagram](https://github.com/shoarora/lmtuners/raw/master/assets/electra.png)
(figure from [Clark et al. 2020](https://openreview.net/pdf?id=r1xMH1BtvB))
ELECTRA uses discriminative LM / replaced-token-detection for pretraining.
This involves a generator (a Masked LM model) creating examples for a discriminator
to classify as original or replaced for each token.
The generator generalizes to any `*ForMaskedLM` model and the discriminator could be
any `*ForTokenClassification` model. Therefore, we can extend the task to ALBERT models,
not just BERT as in the original paper.
## Usage
```python
from transformers import AlbertForSequenceClassification, BertTokenizer
# Both models use the bert-base-uncased tokenizer and vocab.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
alectra = AlbertForSequenceClassification.from_pretrained('shoarora/alectra-small-owt')
```
NOTE: this ALBERT model uses a BERT WordPiece tokenizer.
## Code
The pytorch module that implements this task is available [here](https://github.com/shoarora/lmtuners/blob/master/lmtuners/lightning_modules/discriminative_lm.py).
Further implementation information [here](https://github.com/shoarora/lmtuners/tree/master/experiments/disc_lm_small),
and [here](https://github.com/shoarora/lmtuners/blob/master/experiments/disc_lm_small/train_alectra_small.py) is the script that created this model.
This specific model was trained with the following params:
- `batch_size: 512`
- `training_steps: 5e5`
- `warmup_steps: 4e4`
- `learning_rate: 2e-3`
## Downstream tasks
#### GLUE Dev results
| Model | # Params | CoLA | SST | MRPC | STS | QQP | MNLI | QNLI | RTE |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| ELECTRA-Small++ | 14M | 57.0 | 91. | 88.0 | 87.5 | 89.0 | 81.3 | 88.4 | 66.7|
| ELECTRA-Small-OWT | 14M | 56.8 | 88.3| 87.4 | 86.8 | 88.3 | 78.9 | 87.9 | 68.5|
| ELECTRA-Small-OWT (ours) | 17M | 56.3 | 88.4| 75.0 | 86.1 | 89.1 | 77.9 | 83.0 | 67.1|
| ALECTRA-Small-OWT (ours) | 4M | 50.6 | 89.1| 86.3 | 87.2 | 89.1 | 78.2 | 85.9 | 69.6|
#### GLUE Test results
| Model | # Params | CoLA | SST | MRPC | STS | QQP | MNLI | QNLI | RTE |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| BERT-Base | 110M | 52.1 | 93.5| 84.8 | 85.9 | 89.2 | 84.6 | 90.5 | 66.4|
| GPT | 117M | 45.4 | 91.3| 75.7 | 80.0 | 88.5 | 82.1 | 88.1 | 56.0|
| ELECTRA-Small++ | 14M | 57.0 | 91.2| 88.0 | 87.5 | 89.0 | 81.3 | 88.4 | 66.7|
| ELECTRA-Small-OWT (ours) | 17M | 57.4 | 89.3| 76.2 | 81.9 | 87.5 | 78.1 | 82.4 | 68.1|
| ALECTRA-Small-OWT (ours) | 4M | 43.9 | 87.9| 82.1 | 82.0 | 87.6 | 77.9 | 85.8 | 67.5|
# ELECTRA-small-OWT
This is an unnoficial implementation of an
[ELECTRA](https://openreview.net/forum?id=r1xMH1BtvB) small model, trained on the
[OpenWebText corpus](https://skylion007.github.io/OpenWebTextCorpus/).
Differences from official ELECTRA models:
- we use a `BertForMaskedLM` as the generator and `BertForTokenClassification` as the discriminator
- they use an embedding projection layer, but Bert doesn't have one
## Pretraining ttask
![electra task diagram](https://github.com/shoarora/lmtuners/raw/master/assets/electra.png)
(figure from [Clark et al. 2020](https://openreview.net/pdf?id=r1xMH1BtvB))
ELECTRA uses discriminative LM / replaced-token-detection for pretraining.
This involves a generator (a Masked LM model) creating examples for a discriminator
to classify as original or replaced for each token.
## Usage
```python
from transformers import BertForSequenceClassification, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
electra = BertForSequenceClassification.from_pretrained('shoarora/electra-small-owt')
```
## Code
The pytorch module that implements this task is available [here](https://github.com/shoarora/lmtuners/blob/master/lmtuners/lightning_modules/discriminative_lm.py).
Further implementation information [here](https://github.com/shoarora/lmtuners/tree/master/experiments/disc_lm_small),
and [here](https://github.com/shoarora/lmtuners/blob/master/experiments/disc_lm_small/train_electra_small.py) is the script that created this model.
This specific model was trained with the following params:
- `batch_size: 512`
- `training_steps: 5e5`
- `warmup_steps: 4e4`
- `learning_rate: 2e-3`
## Downstream tasks
#### GLUE Dev results
| Model | # Params | CoLA | SST | MRPC | STS | QQP | MNLI | QNLI | RTE |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| ELECTRA-Small++ | 14M | 57.0 | 91. | 88.0 | 87.5 | 89.0 | 81.3 | 88.4 | 66.7|
| ELECTRA-Small-OWT | 14M | 56.8 | 88.3| 87.4 | 86.8 | 88.3 | 78.9 | 87.9 | 68.5|
| ELECTRA-Small-OWT (ours) | 17M | 56.3 | 88.4| 75.0 | 86.1 | 89.1 | 77.9 | 83.0 | 67.1|
| ALECTRA-Small-OWT (ours) | 4M | 50.6 | 89.1| 86.3 | 87.2 | 89.1 | 78.2 | 85.9 | 69.6|
- Table initialized from [ELECTRA github repo](https://github.com/google-research/electra)
#### GLUE Test results
| Model | # Params | CoLA | SST | MRPC | STS | QQP | MNLI | QNLI | RTE |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| BERT-Base | 110M | 52.1 | 93.5| 84.8 | 85.9 | 89.2 | 84.6 | 90.5 | 66.4|
| GPT | 117M | 45.4 | 91.3| 75.7 | 80.0 | 88.5 | 82.1 | 88.1 | 56.0|
| ELECTRA-Small++ | 14M | 57.0 | 91.2| 88.0 | 87.5 | 89.0 | 81.3 | 88.4 | 66.7|
| ELECTRA-Small-OWT (ours) | 17M | 57.4 | 89.3| 76.2 | 81.9 | 87.5 | 78.1 | 82.4 | 68.1|
| ALECTRA-Small-OWT (ours) | 4M | 43.9 | 87.9| 82.1 | 82.0 | 87.6 | 77.9 | 85.8 | 67.5|
# shrugging-grace/tweetclassifier
## Model description
This model classifies tweets as either relating to the Covid-19 pandemic or not.
## Intended uses & limitations
It is intended to be used on tweets commenting on UK politics, in particular those trending with the #PMQs hashtag, as this refers to weekly Prime Ministers' Questions.
#### How to use
``LABEL_0`` means that the tweet relates to Covid-19
``LABEL_1`` means that the tweet does not relate to Covid-19
## Training data
The model was trained on 1000 tweets (with the "#PMQs'), which were manually labeled by the author. The tweets were collected between May-July 2020.
### BibTeX entry and citation info
This was based on a pretrained version of BERT.
@article{devlin2018bert,
title={Bert: Pre-training of deep bidirectional transformers for language understanding},
author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
journal={arXiv preprint arXiv:1810.04805},
year={2018}
}
---
language: de
tags:
- exbert
- German
---
<a href="https://huggingface.co/exbert/?model=smanjil/German-MedBERT">
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>
# German Medical BERT
This is a fine-tuned model on Medical domain for German language and based on German BERT. This model has only been trained to improve on target task (Masked Language Model). It can later be used to perform a downstream task of your needs, while I performed it for NTS-ICD-10 text classification task.
## Overview
**Language model:** bert-base-german-cased
**Language:** German
**Fine-tuning:** Medical articles (diseases, symptoms, therapies, etc..)
**Eval data:** NTS-ICD-10 dataset (Classification)
**Infrastructure:** Gogle Colab
## Details
- We fine-tuned using Pytorch with Huggingface library on Colab GPU.
- With standard parameter settings for fine-tuning as mentioned in original BERT's paper.
- Although had to train for upto 25 epochs for classification.
## Performance (Micro precision, recall and f1 score for multilabel code classification)
|Models |P |R |F1 |
|:-------------- |:------|:------|:------|
|German BERT |86.04 |75.82 |80.60 |
|German MedBERT-256 |87.41 |77.97 |82.42 |
|German MedBERT-512 |87.75 |78.26 |82.73 |
## Author
Manjil Shrestha: `shresthamanjil21 [at] gmail.com`
Get in touch:
[LinkedIn](https://www.linkedin.com/in/manjil-shrestha-038527b4/)
# DistilBERT Yelp Review Sentiment
This model is used for sentiment analysis on english yelp reviews.
It is a DistilBERT model trained on 1 million reviews from the yelp open dataset.
It is a regression model, with outputs in the range of ~-2 to ~2. With -2 being 1 star and 2 being 5 stars.
It was trained using the [ktrain](https://github.com/amaiya/ktrain) because of it's ease of use.
Example use:
```
tokenizer = AutoTokenizer.from_pretrained(
'distilbert-base-uncased', use_fast=True)
model = TFAutoModelForSequenceClassification.from_pretrained(
"spentaur/yelp")
review = "This place is great!"
input_ids = tokenizer.encode(review, return_tensors='tf')
pred = model(input_ids)[0][0][0].numpy()
# pred should === 1.9562385
```
language: en
license: bsd
datasets:
- bookcorpus
- wikipedia
---
# SqueezeBERT pretrained model
This model, `squeezebert-mnli-headless`, has been pretrained for the English language using a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective and finetuned on the [Multi-Genre Natural Language Inference (MNLI)](https://cims.nyu.edu/~sbowman/multinli/) dataset. This is a "headless" model with the final classification layer removed, and this will allow Transformers to automatically reinitialize the final classification layer before you begin finetuning on your data.
SqueezeBERT was introduced in [this paper](https://arxiv.org/abs/2006.11316). This model is case-insensitive. The model architecture is similar to BERT-base, but with the pointwise fully-connected layers replaced with [grouped convolutions](https://blog.yani.io/filter-group-tutorial/).
The authors found that SqueezeBERT is 4.3x faster than `bert-base-uncased` on a Google Pixel 3 smartphone.
## Pretraining
### Pretraining data
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of thousands of unpublished books
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia)
### Pretraining procedure
The model is pretrained using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks.
(Author's note: If you decide to pretrain your own model, and you prefer to train with MLM only, that should work too.)
From the SqueezeBERT paper:
> We pretrain SqueezeBERT from scratch (without distillation) using the [LAMB](https://arxiv.org/abs/1904.00962) optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28. Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.
## Finetuning
The SqueezeBERT paper presents 2 approaches to finetuning the model:
- "finetuning without bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on each GLUE task
- "finetuning with bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on a MNLI with distillation from a teacher model. Then, use the MNLI-finetuned SqueezeBERT model as a student model to finetune on each of the other GLUE tasks (e.g. RTE, MRPC, …) with distillation from a task-specific teacher model.
A detailed discussion of the hyperparameters used for finetuning is provided in the appendix of the [SqueezeBERT paper](https://arxiv.org/abs/2006.11316).
Note that finetuning SqueezeBERT with distillation is not yet implemented in this repo. If the author (Forrest Iandola - forrest.dnn@gmail.com) gets enough encouragement from the user community, he will add example code to Transformers for finetuning SqueezeBERT with distillation.
This model, `squeezebert/squeezebert-mnli-headless`, is the "finetuned with bells and whistles" MNLI-finetuned SqueezeBERT model. In this particular model, we have removed the final classification layer -- in other words, it is "headless." We recommend using this model if you intend to finetune the model on your own data. Using this model means that your final layer will automatically be reinitialized when you start finetuning on your data.
### How to finetune
To try finetuning SqueezeBERT on the [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) text classification task, you can run the following command:
```
./utils/download_glue_data.py
python examples/text-classification/run_glue.py \
--model_name_or_path squeezebert-base-headless \
--task_name mrpc \
--data_dir ./glue_data/MRPC \
--output_dir ./models/squeezebert_mrpc \
--overwrite_output_dir \
--do_train \
--do_eval \
--num_train_epochs 10 \
--learning_rate 3e-05 \
--per_device_train_batch_size 16 \
--save_steps 20000
```
## BibTeX entry and citation info
```
@article{2020_SqueezeBERT,
author = {Forrest N. Iandola and Albert E. Shaw and Ravi Krishna and Kurt W. Keutzer},
title = {{SqueezeBERT}: What can computer vision teach NLP about efficient neural networks?},
journal = {arXiv:2006.11316},
year = {2020}
}
```
language: en
license: bsd
datasets:
- bookcorpus
- wikipedia
---
# SqueezeBERT pretrained model
This model, `squeezebert-mnli`, has been pretrained for the English language using a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective and finetuned on the [Multi-Genre Natural Language Inference (MNLI)](https://cims.nyu.edu/~sbowman/multinli/) dataset.
SqueezeBERT was introduced in [this paper](https://arxiv.org/abs/2006.11316). This model is case-insensitive. The model architecture is similar to BERT-base, but with the pointwise fully-connected layers replaced with [grouped convolutions](https://blog.yani.io/filter-group-tutorial/).
The authors found that SqueezeBERT is 4.3x faster than `bert-base-uncased` on a Google Pixel 3 smartphone.
## Pretraining
### Pretraining data
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of thousands of unpublished books
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia)
### Pretraining procedure
The model is pretrained using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks.
(Author's note: If you decide to pretrain your own model, and you prefer to train with MLM only, that should work too.)
From the SqueezeBERT paper:
> We pretrain SqueezeBERT from scratch (without distillation) using the [LAMB](https://arxiv.org/abs/1904.00962) optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28. Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.
## Finetuning
The SqueezeBERT paper presents 2 approaches to finetuning the model:
- "finetuning without bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on each GLUE task
- "finetuning with bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on a MNLI with distillation from a teacher model. Then, use the MNLI-finetuned SqueezeBERT model as a student model to finetune on each of the other GLUE tasks (e.g. RTE, MRPC, …) with distillation from a task-specific teacher model.
A detailed discussion of the hyperparameters used for finetuning is provided in the appendix of the [SqueezeBERT paper](https://arxiv.org/abs/2006.11316).
Note that finetuning SqueezeBERT with distillation is not yet implemented in this repo. If the author (Forrest Iandola - forrest.dnn@gmail.com) gets enough encouragement from the user community, he will add example code to Transformers for finetuning SqueezeBERT with distillation.
This model, `squeezebert/squeezebert-mnli`, is the "trained with bells and whistles" MNLI-finetuned SqueezeBERT model.
### How to finetune
To try finetuning SqueezeBERT on the [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) text classification task, you can run the following command:
```
./utils/download_glue_data.py
python examples/text-classification/run_glue.py \
--model_name_or_path squeezebert-base-headless \
--task_name mrpc \
--data_dir ./glue_data/MRPC \
--output_dir ./models/squeezebert_mrpc \
--overwrite_output_dir \
--do_train \
--do_eval \
--num_train_epochs 10 \
--learning_rate 3e-05 \
--per_device_train_batch_size 16 \
--save_steps 20000
```
## BibTeX entry and citation info
```
@article{2020_SqueezeBERT,
author = {Forrest N. Iandola and Albert E. Shaw and Ravi Krishna and Kurt W. Keutzer},
title = {{SqueezeBERT}: What can computer vision teach NLP about efficient neural networks?},
journal = {arXiv:2006.11316},
year = {2020}
}
```
language: en
license: bsd
datasets:
- bookcorpus
- wikipedia
---
# SqueezeBERT pretrained model
This model, `squeezebert-uncased`, is a pretrained model for the English language using a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective.
SqueezeBERT was introduced in [this paper](https://arxiv.org/abs/2006.11316). This model is case-insensitive. The model architecture is similar to BERT-base, but with the pointwise fully-connected layers replaced with [grouped convolutions](https://blog.yani.io/filter-group-tutorial/).
The authors found that SqueezeBERT is 4.3x faster than `bert-base-uncased` on a Google Pixel 3 smartphone.
## Pretraining
### Pretraining data
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of thousands of unpublished books
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia)
### Pretraining procedure
The model is pretrained using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks.
(Author's note: If you decide to pretrain your own model, and you prefer to train with MLM only, that should work too.)
From the SqueezeBERT paper:
> We pretrain SqueezeBERT from scratch (without distillation) using the [LAMB](https://arxiv.org/abs/1904.00962) optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28. Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.
## Finetuning
The SqueezeBERT paper results from 2 approaches to finetuning the model:
- "finetuning without bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on each GLUE task
- "finetuning with bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on a MNLI with distillation from a teacher model. Then, use the MNLI-finetuned SqueezeBERT model as a student model to finetune on each of the other GLUE tasks (e.g. RTE, MRPC, …) with distillation from a task-specific teacher model.
A detailed discussion of the hyperparameters used for finetuning is provided in the appendix of the [SqueezeBERT paper](https://arxiv.org/abs/2006.11316).
Note that finetuning SqueezeBERT with distillation is not yet implemented in this repo. If the author (Forrest Iandola - forrest.dnn@gmail.com) gets enough encouragement from the user community, he will add example code to Transformers for finetuning SqueezeBERT with distillation.
This model, `squeezebert/squeezebert-uncased`, has been pretrained but not finetuned. For most text classification tasks, we recommend using squeezebert-mnli-headless as a starting point.
### How to finetune
To try finetuning SqueezeBERT on the [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) text classification task, you can run the following command:
```
./utils/download_glue_data.py
python examples/text-classification/run_glue.py \
--model_name_or_path squeezebert-base-headless \
--task_name mrpc \
--data_dir ./glue_data/MRPC \
--output_dir ./models/squeezebert_mrpc \
--overwrite_output_dir \
--do_train \
--do_eval \
--num_train_epochs 10 \
--learning_rate 3e-05 \
--per_device_train_batch_size 16 \
--save_steps 20000
```
## BibTeX entry and citation info
```
@article{2020_SqueezeBERT,
author = {Forrest N. Iandola and Albert E. Shaw and Ravi Krishna and Kurt W. Keutzer},
title = {{SqueezeBERT}: What can computer vision teach NLP about efficient neural networks?},
journal = {arXiv:2006.11316},
year = {2020}
}
```
---
language:
- en
- de
thumbnail:
tags:
- wmt19
- testing
license: apache-2.0
datasets:
- wmt19
metrics:
- bleu
---
# Tiny FSMT
This is a tiny model that is used in the `transformers` test suite. It doesn't do anything useful, other than testing that `FSMT` works.
---
language: "en"
thumbnail: "https://raw.githubusercontent.com/stevhliu/satsuma/master/images/astroGPT-thumbnail.png"
widget:
- text: "Jan 18, 2020"
- text: "Feb 14, 2020"
- text: "Jul 04, 2020"
---
# astroGPT 🪐
## Model description
This is a GPT-2 model fine-tuned on Western zodiac signs. For more information about GPT-2, take a look at 🤗 Hugging Face's GPT-2 [model card](https://huggingface.co/gpt2). You can use astroGPT to generate a daily horoscope by entering the current date.
## How to use
To use this model, simply enter the current date like so `Mon DD, YEAR`:
```python
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("stevhliu/astroGPT")
model = AutoModelWithLMHead.from_pretrained("stevhliu/astroGPT")
input_ids = tokenizer.encode('Sep 03, 2020', return_tensors='pt').to('cuda')
sample_output = model.generate(input_ids,
do_sample=True,
max_length=75,
top_k=20,
top_p=0.97)
print(sample_output)
```
## Limitations and bias
astroGPT inherits the same biases that affect GPT-2 as a result of training on a lot of non-neutral content on the internet. The model does not currently support zodiac sign-specific generation and only returns a general horoscope. While the generated text may occasionally mention a specific zodiac sign, this is due to how the horoscopes were originally written by it's human authors.
## Data
The data was scraped from [Horoscope.com](https://www.horoscope.com/us/index.aspx) and trained on 4.7MB of text. The text was collected from four categories (daily, love, wellness, career) and span from 09/01/19 to 08/01/2020. The archives only store horoscopes dating a year back from the current date.
## Training and results
The text was tokenized using the fast GPT-2 BPE [tokenizer](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2tokenizerfast). It has a vocabulary size of 50,257 and sequence length of 1024 tokens. The model was trained with on one of Google Colaboratory's GPU's for approximately 2.5 hrs with [fastai's](https://docs.fast.ai/) learning rate finder, discriminative learning rates and 1cycle policy. See table below for a quick summary of the training procedure and results.
| dataset size | epochs | lr | training time | train_loss | valid_loss | perplexity |
|:-------------:|:------:|:-----------------:|:-------------:|:----------:|:----------:|:----------:|
| 5.9MB |32 | slice(1e-7,1e-5) | 2.5 hrs | 2.657170 | 2.642387 | 14.046692 |
---
language:
- hi
- sa
- gu
tags:
- Indic
license: mit
datasets:
- Wikipedia (Hindi, Sanskrit, Gujarati)
metrics:
- perplexity
---
# RoBERTa-hindi-guj-san
## Model description
Multillingual RoBERTa like model trained on Wikipedia articles of Hindi, Sanskrit, Gujarati languages. The tokenizer was trained on combined text.
However, Hindi text was used to pre-train the model and then it was fine-tuned on Sanskrit and Gujarati Text combined hoping that pre-training with Hindi
will help the model learn similar languages.
### Configuration
| Parameter | Value |
|---|---|
| `hidden_size` | 768 |
| `num_attention_heads` | 12 |
| `num_hidden_layers` | 6 |
| `vocab_size` | 30522 |
|`model_type`|`roberta`|
## Intended uses & limitations
#### How to use
```python
# Example usage
from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
tokenizer = AutoTokenizer.from_pretrained("surajp/RoBERTa-hindi-guj-san")
model = AutoModelWithLMHead.from_pretrained("surajp/RoBERTa-hindi-guj-san")
fill_mask = pipeline(
"fill-mask",
model=model,
tokenizer=tokenizer
)
# Sanskrit: इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।
# Hindi: अगर आप अब अभ्यास नहीं करते हो तो आप अपने परीक्षा में मूर्खतापूर्ण गलतियाँ करोगे।
# Gujarati: ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો <mask> હતો.
fill_mask("ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો <mask> હતો.")
'''
Output:
--------
[
{'score': 0.07849744707345963, 'sequence': '<s> ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો જ હતો.</s>', 'token': 390},
{'score': 0.06273336708545685, 'sequence': '<s> ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો ન હતો.</s>', 'token': 478},
{'score': 0.05160355195403099, 'sequence': '<s> ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો થઇ હતો.</s>', 'token': 2075},
{'score': 0.04751499369740486, 'sequence': '<s> ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો એક હતો.</s>', 'token': 600},
{'score': 0.03788900747895241, 'sequence': '<s> ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો પણ હતો.</s>', 'token': 840}
]
```
## Training data
Cleaned wikipedia articles in Hindi, Sanskrit and Gujarati on Kaggle. It contains training as well as evaluation text.
Used in [iNLTK](https://github.com/goru001/inltk)
- [Hindi](https://www.kaggle.com/disisbig/hindi-wikipedia-articles-172k)
- [Gujarati](https://www.kaggle.com/disisbig/gujarati-wikipedia-articles)
- [Sanskrit](https://www.kaggle.com/disisbig/sanskrit-wikipedia-articles)
## Training procedure
- On TPU (using `xla_spawn.py`)
- For language modelling
- Iteratively increasing `--block_size` from 128 to 256 over epochs
- Tokenizer trained on combined text
- Pre-training with Hindi and fine-tuning on Sanskrit and Gujarati texts
```
--model_type distillroberta-base \
--model_name_or_path "/content/SanHiGujBERTa" \
--mlm_probability 0.20 \
--line_by_line \
--save_total_limit 2 \
--per_device_train_batch_size 128 \
--per_device_eval_batch_size 128 \
--num_train_epochs 5 \
--block_size 256 \
--seed 108 \
--overwrite_output_dir \
```
## Eval results
perplexity = 2.920005983224673
> Created by [Suraj Parmar/@parmarsuraj99](https://twitter.com/parmarsuraj99) | [LinkedIn](https://www.linkedin.com/in/parmarsuraj99/)
> Made with <span style="color: #e25555;">&hearts;</span> in India
---
language: sa
---
# RoBERTa trained on Sanskrit (SanBERTa)
**Mode size** (after training): **340MB**
### Dataset:
[Wikipedia articles](https://www.kaggle.com/disisbig/sanskrit-wikipedia-articles) (used in [iNLTK](https://github.com/goru001/nlp-for-sanskrit)).
It contains evaluation set.
[Sanskrit scraps from CLTK](http://cltk.org/)
### Configuration
| Parameter | Value |
|---|---|
| `num_attention_heads` | 12 |
| `num_hidden_layers` | 6 |
| `hidden_size` | 768 |
| `vocab_size` | 29407 |
### Training :
- On TPU
- For language modelling
- Iteratively increasing `--block_size` from 128 to 256 over epochs
### Evaluation
|Metric| # Value |
|---|---|
|Perplexity (`block_size=256`)|4.04|
## Example of usage:
### For Embeddings
```
tokenizer = AutoTokenizer.from_pretrained("surajp/SanBERTa")
model = RobertaModel.from_pretrained("surajp/SanBERTa")
op = tokenizer.encode("इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।", return_tensors="pt")
ps = model(op)
ps[0].shape
```
```
'''
Output:
--------
torch.Size([1, 47, 768])
```
### For \<mask\> Prediction
```
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="surajp/SanBERTa",
tokenizer="surajp/SanBERTa"
)
## इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।
fill_mask("इयं भाषा न केवल<mask> भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।")
ps = model(torch.tensor(enc).unsqueeze(1))
print(ps[0].shape)
```
```
'''
Output:
--------
[{'score': 0.7516744136810303,
'sequence': '<s> इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।</s>',
'token': 280,
'token_str': 'à¤Ĥ'},
{'score': 0.06230105459690094,
'sequence': '<s> इयं भाषा न केवली भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।</s>',
'token': 289,
'token_str': 'à¥Ģ'},
{'score': 0.055410224944353104,
'sequence': '<s> इयं भाषा न केवला भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।</s>',
'token': 265,
'token_str': 'ा'},
...]
```
### It works!! 🎉 🎉 🎉
> Created by [Suraj Parmar/@parmarsuraj99](https://twitter.com/parmarsuraj99) | [LinkedIn](https://www.linkedin.com/in/parmarsuraj99/)
> Made with <span style="color: #e25555;">&hearts;</span> in India
---
language: sa
---
# ALBERT-base-Sanskrit
Explaination Notebook Colab: [SanskritALBERT.ipynb](https://colab.research.google.com/github/parmarsuraj99/suraj-parmar/blob/master/_notebooks/2020-05-02-SanskritALBERT.ipynb)
Size of the model is **46MB**
Example of usage:
```
tokenizer = AutoTokenizer.from_pretrained("surajp/albert-base-sanskrit")
model = AutoModel.from_pretrained("surajp/albert-base-sanskrit")
enc=tokenizer.encode("ॐ सर्वे भवन्तु सुखिनः सर्वे सन्तु निरामयाः । सर्वे भद्राणि पश्यन्तु मा कश्चिद्दुःखभाग्भवेत् । ॐ शान्तिः शान्तिः शान्तिः ॥")
print(tokenizer.decode(enc))
ps = model(torch.tensor(enc).unsqueeze(1))
print(ps[0].shape)
```
```
'''
Output:
--------
[CLS] ॐ सर्वे भवन्तु सुखिनः सर्वे सन्तु निरामयाः । सर्वे भद्राणि पश्यन्तु मा कश्चिद्दुःखभाग्भवेत् । ॐ शान्तिः शान्तिः शान्तिः ॥[SEP]
torch.Size([28, 1, 768])
```
> Created by [Suraj Parmar/@parmarsuraj99](https://twitter.com/parmarsuraj99)
> Made with <span style="color: #e25555;">&hearts;</span> in India
---
language: en
datasets:
- c4
tags:
- summarization
- translation
license: apache-2.0
inference: false
---
## Disclaimer
**Before `transformers` v3.5.0**, due do its immense size, `t5-11b` required some special treatment.
If you're using transformers `<= v3.4.0`, `t5-11b` should be loaded with flag `use_cdn` set to `False` as follows:
```python
t5 = transformers.T5ForConditionalGeneration.from_pretrained('t5-11b', use_cdn = False)
```
Secondly, a single GPU will most likely not have enough memory to even load the model into memory as the weights alone amount to over 40 GB.
Model parallelism has to be used here to overcome this problem as is explained in this [PR](https://github.com/huggingface/transformers/pull/3578).
## [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html)
Pretraining Dataset: [C4](https://huggingface.co/datasets/c4)
Other Community Checkpoints: [here](https://huggingface.co/models?search=t5)
Paper: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
Authors: *Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu*
## Abstract
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.
![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
---
language: en
datasets:
- c4
tags:
- summarization
- translation
license: apache-2.0
---
[Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html)
Pretraining Dataset: [C4](https://huggingface.co/datasets/c4)
Other Community Checkpoints: [here](https://huggingface.co/models?search=t5)
Paper: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
Authors: *Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu*
## Abstract
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.
![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
---
language: en
datasets:
- c4
tags:
- summarization
- translation
license: apache-2.0
---
[Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html)
Pretraining Dataset: [C4](https://huggingface.co/datasets/c4)
Other Community Checkpoints: [here](https://huggingface.co/models?search=t5)
Paper: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
Authors: *Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu*
## Abstract
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.
![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment