"examples/research_projects/seq2seq-distillation/callbacks.py" did not exist on "e78c1103385f2d2f9cd4980f61a8e71baa655356"
Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language: "c++"
tags:
- exbert
- authorship-identification
- fire2020
- pan2020
- ai-soco
- classification
license: "mit"
datasets:
- ai-soco
metrics:
- accuracy
---
# ai-soco-c++-roberta-tiny-96-clas
## Model description
`ai-soco-c++-roberta-tiny-96` model fine-tuned on [AI-SOCO](https://sites.google.com/view/ai-soco-2020) task.
#### How to use
You can use the model directly after tokenizing the text using the provided tokenizer with the model files.
#### Limitations and bias
The model is limited to C++ programming language only.
## Training data
The model initialized from [`ai-soco-c++-roberta-tiny-96`](https://github.com/huggingface/transformers/blob/master/model_cards/aliosm/ai-soco-c++-roberta-tiny-96) model and trained using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset to do text classification.
## Training procedure
The model trained on Google Colab platform using V100 GPU for 10 epochs, 16 batch size, 512 max sequence length (sequences larger than 512 were truncated). Each continues 4 spaces were converted to a single tab character (`\t`) before tokenization.
## Eval results
The model achieved 91.12%/91.02% accuracy on AI-SOCO task and ranked in the 7th place.
### BibTeX entry and citation info
```bibtex
@inproceedings{ai-soco-2020-fire,
title = "Overview of the {PAN@FIRE} 2020 Task on {Authorship Identification of SOurce COde (AI-SOCO)}",
author = "Fadel, Ali and Musleh, Husam and Tuffaha, Ibraheem and Al-Ayyoub, Mahmoud and Jararweh, Yaser and Benkhelifa, Elhadj and Rosso, Paolo",
booktitle = "Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation (FIRE 2020)",
year = "2020"
}
```
<a href="https://huggingface.co/exbert/?model=aliosm/ai-soco-c++-roberta-tiny-96-clas">
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>
---
language: "c++"
tags:
- exbert
- authorship-identification
- fire2020
- pan2020
- ai-soco
license: "mit"
datasets:
- ai-soco
metrics:
- perplexity
---
# ai-soco-c++-roberta-tiny-96
## Model description
From scratch pre-trained RoBERTa model with 1 layers and 96 attention heads using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset which consists of C++ codes crawled from CodeForces website.
## Intended uses & limitations
The model can be used to do code classification, authorship identification and other downstream tasks on C++ programming language.
#### How to use
You can use the model directly after tokenizing the text using the provided tokenizer with the model files.
#### Limitations and bias
The model is limited to C++ programming language only.
## Training data
The model initialized randomly and trained using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset which contains 100K C++ source codes.
## Training procedure
The model trained on Google Colab platform with 8 TPU cores for 200 epochs, 16\*8 batch size, 512 max sequence length and MLM objective. Other parameters were defaulted to the values mentioned in [`run_language_modelling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script. Each continues 4 spaces were converted to a single tab character (`\t`) before tokenization.
### BibTeX entry and citation info
```bibtex
@inproceedings{ai-soco-2020-fire,
title = "Overview of the {PAN@FIRE} 2020 Task on {Authorship Identification of SOurce COde (AI-SOCO)}",
author = "Fadel, Ali and Musleh, Husam and Tuffaha, Ibraheem and Al-Ayyoub, Mahmoud and Jararweh, Yaser and Benkhelifa, Elhadj and Rosso, Paolo",
booktitle = "Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation (FIRE 2020)",
year = "2020"
}
```
<a href="https://huggingface.co/exbert/?model=aliosm/ai-soco-c++-roberta-tiny-96">
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>
---
language: "c++"
tags:
- exbert
- authorship-identification
- fire2020
- pan2020
- ai-soco
- classification
license: "mit"
datasets:
- ai-soco
metrics:
- accuracy
---
# ai-soco-c++-roberta-tiny-clas
## Model description
`ai-soco-c++-roberta-tiny` model fine-tuned on [AI-SOCO](https://sites.google.com/view/ai-soco-2020) task.
#### How to use
You can use the model directly after tokenizing the text using the provided tokenizer with the model files.
#### Limitations and bias
The model is limited to C++ programming language only.
## Training data
The model initialized from [`ai-soco-c++-roberta-tiny`](https://github.com/huggingface/transformers/blob/master/model_cards/aliosm/ai-soco-c++-roberta-tiny) model and trained using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset to do text classification.
## Training procedure
The model trained on Google Colab platform using V100 GPU for 10 epochs, 32 batch size, 512 max sequence length (sequences larger than 512 were truncated). Each continues 4 spaces were converted to a single tab character (`\t`) before tokenization.
## Eval results
The model achieved 87.66%/87.46% accuracy on AI-SOCO task and ranked in the 9th place.
### BibTeX entry and citation info
```bibtex
@inproceedings{ai-soco-2020-fire,
title = "Overview of the {PAN@FIRE} 2020 Task on {Authorship Identification of SOurce COde (AI-SOCO)}",
author = "Fadel, Ali and Musleh, Husam and Tuffaha, Ibraheem and Al-Ayyoub, Mahmoud and Jararweh, Yaser and Benkhelifa, Elhadj and Rosso, Paolo",
booktitle = "Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation (FIRE 2020)",
year = "2020"
}
```
<a href="https://huggingface.co/exbert/?model=aliosm/ai-soco-c++-roberta-tiny-clas">
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>
---
language: "c++"
tags:
- exbert
- authorship-identification
- fire2020
- pan2020
- ai-soco
license: "mit"
datasets:
- ai-soco
metrics:
- perplexity
---
# ai-soco-c++-roberta-tiny
## Model description
From scratch pre-trained RoBERTa model with 1 layers and 12 attention heads using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset which consists of C++ codes crawled from CodeForces website.
## Intended uses & limitations
The model can be used to do code classification, authorship identification and other downstream tasks on C++ programming language.
#### How to use
You can use the model directly after tokenizing the text using the provided tokenizer with the model files.
#### Limitations and bias
The model is limited to C++ programming language only.
## Training data
The model initialized randomly and trained using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset which contains 100K C++ source codes.
## Training procedure
The model trained on Google Colab platform with 8 TPU cores for 200 epochs, 32\*8 batch size, 512 max sequence length and MLM objective. Other parameters were defaulted to the values mentioned in [`run_language_modelling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script. Each continues 4 spaces were converted to a single tab character (`\t`) before tokenization.
### BibTeX entry and citation info
```bibtex
@inproceedings{ai-soco-2020-fire,
title = "Overview of the {PAN@FIRE} 2020 Task on {Authorship Identification of SOurce COde (AI-SOCO)}",
author = "Fadel, Ali and Musleh, Husam and Tuffaha, Ibraheem and Al-Ayyoub, Mahmoud and Jararweh, Yaser and Benkhelifa, Elhadj and Rosso, Paolo",
booktitle = "Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation (FIRE 2020)",
year = "2020"
}
```
<a href="https://huggingface.co/exbert/?model=aliosm/ai-soco-c++-roberta-tiny">
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>
---
language: pl
tags:
- herbert
license: cc-by-sa-4.0
---
# HerBERT
**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish Corpora
using MLM and SSO objectives with dynamic masking of whole words.
Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.9.
## Tokenizer
The training dataset was tokenized into subwords using ``CharBPETokenizer`` a character level byte-pair encoding with
a vocabulary size of 50k tokens. The tokenizer itself was trained with a [tokenizers](https://github.com/huggingface/tokenizers) library.
We kindly encourage you to use the **Fast** version of tokenizer, namely ``HerbertTokenizerFast``.
## HerBERT usage
Example code:
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")
model = AutoModel.from_pretrained("allegro/herbert-base-cased")
output = model(
**tokenizer.batch_encode_plus(
[
(
"A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
"A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
)
],
padding='longest',
add_special_tokens=True,
return_tensors='pt'
)
)
```
## License
CC BY-SA 4.0
## Authors
Model was trained by **Allegro Machine Learning Research** team.
You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>
---
language: pl
---
# HerBERT tokenizer
**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** tokenizer is a character level byte-pair encoding with
vocabulary size of 50k tokens. The tokenizer was trained on [Wolne Lektury](https://wolnelektury.pl/) and a publicly available subset of
[National Corpus of Polish](http://nkjp.pl/index.php?page=14&lang=0) with [fastBPE](https://github.com/glample/fastBPE) library.
Tokenizer utilize `XLMTokenizer` implementation from [transformers](https://github.com/huggingface/transformers).
## Tokenizer usage
Herbert tokenizer should be used together with [HerBERT model](https://huggingface.co/allegro/herbert-klej-cased-v1):
```python
from transformers import XLMTokenizer, RobertaModel
tokenizer = XLMTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")
encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd – to jasne.", return_tensors='pt')
outputs = model(encoded_input)
```
## License
CC BY-SA 4.0
## Citation
If you use this tokenizer, please cite the following paper:
```
@misc{rybak2020klej,
title={KLEJ: Comprehensive Benchmark for Polish Language Understanding},
author={Piotr Rybak and Robert Mroczkowski and Janusz Tracz and Ireneusz Gawlik},
year={2020},
eprint={2005.00630},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
Paper is accepted at ACL 2020, as soon as proceedings appear, we will update the BibTeX.
## Authors
Tokenizer was created by **Allegro Machine Learning Research** team.
You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>
---
language: pl
---
# HerBERT
**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish Corpora
using only MLM objective with dynamic masking of whole words. For more details, please refer to:
[KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://arxiv.org/abs/2005.00630).
## Dataset
**HerBERT** training dataset is a combination of several publicly available corpora for Polish language:
| Corpus | Tokens | Texts |
| :------ | ------: | ------: |
| [OSCAR](https://traces1.inria.fr/oscar/)| 6710M | 145M |
| [Open Subtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php) | 1084M | 1.1M |
| [Wikipedia](https://dumps.wikimedia.org/) | 260M | 1.5M |
| [Wolne Lektury](https://wolnelektury.pl/) | 41M | 5.5k |
| [Allegro Articles](https://allegro.pl/artykuly) | 18M | 33k |
## Tokenizer
The training dataset was tokenized into subwords using [HerBERT Tokenizer](https://huggingface.co/allegro/herbert-klej-cased-tokenizer-v1); a character level byte-pair encoding with
a vocabulary size of 50k tokens. The tokenizer itself was trained on [Wolne Lektury](https://wolnelektury.pl/) and a publicly available subset of
[National Corpus of Polish](http://nkjp.pl/index.php?page=14&lang=0) with a [fastBPE](https://github.com/glample/fastBPE) library.
Tokenizer utilizes `XLMTokenizer` implementation for that reason, one should load it as `allegro/herbert-klej-cased-tokenizer-v1`.
## HerBERT models summary
| Model | WWM | Cased | Tokenizer | Vocab Size | Batch Size | Train Steps |
| :------ | ------: | ------: | ------: | ------: | ------: | ------: |
| herbert-klej-cased-v1 | YES | YES | BPE | 50K | 570 | 180k |
## Model evaluation
HerBERT was evaluated on the [KLEJ](https://klejbenchmark.com/) benchmark, publicly available set of nine evaluation tasks for the Polish language understanding.
It had the best average performance and obtained the best results for three of them.
| Model | Average | NKJP-NER | CDSC-E | CDSC-R | CBD | PolEmo2.0-IN |PolEmo2.0-OUT | DYK | PSC | AR |
| :------ | ------: | ------: | ------: | ------: | ------: | ------: | ------: | ------: | ------: | ------: |
| herbert-klej-cased-v1 | **80.5** | 92.7 | 92.5 | 91.9 | **50.3** | **89.2** |**76.3** |52.1 |95.3 | 84.5 |
Full leaderboard is available [online](https://klejbenchmark.com/leaderboard).
## HerBERT usage
Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.0.
Example code:
```python
from transformers import XLMTokenizer, RobertaModel
tokenizer = XLMTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")
encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd – to jasne.", return_tensors='pt')
outputs = model(encoded_input)
```
HerBERT can also be loaded using `AutoTokenizer` and `AutoModel`:
```python
tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1")
```
## License
CC BY-SA 4.0
## Citation
If you use this model, please cite the following paper:
```
@misc{rybak2020klej,
title={KLEJ: Comprehensive Benchmark for Polish Language Understanding},
author={Piotr Rybak and Robert Mroczkowski and Janusz Tracz and Ireneusz Gawlik},
year={2020},
eprint={2005.00630},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
Paper is accepted at ACL 2020, as soon as proceedings appear, we will update the BibTeX.
## Authors
Model was trained by **Allegro Machine Learning Research** team.
You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>
---
language: pl
tags:
- herbert
license: cc-by-sa-4.0
---
# HerBERT
**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish Corpora
using MLM and SSO objectives with dynamic masking of whole words.
Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.9.
## Tokenizer
The training dataset was tokenized into subwords using ``CharBPETokenizer`` a character level byte-pair encoding with
a vocabulary size of 50k tokens. The tokenizer itself was trained with a [tokenizers](https://github.com/huggingface/tokenizers) library.
We kindly encourage you to use the **Fast** version of tokenizer, namely ``HerbertTokenizerFast``.
## HerBERT usage
Example code:
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-large-cased")
model = AutoModel.from_pretrained("allegro/herbert-large-cased")
output = model(
**tokenizer.batch_encode_plus(
[
(
"A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
"A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
)
],
padding='longest',
add_special_tokens=True,
return_tensors='pt'
)
)
```
## License
CC BY-SA 4.0
## Authors
Model was trained by **Allegro Machine Learning Research** team.
You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>
---
thumbnail: https://huggingface.co/front/thumbnails/allenai.png
---
# BioMed-RoBERTa-base
BioMed-RoBERTa-base is a language model based on the RoBERTa-base (Liu et. al, 2019) architecture. We adapt RoBERTa-base to 2.68 million scientific papers from the [Semantic Scholar](https://www.semanticscholar.org) corpus via continued pretraining. This amounts to 7.55B tokens and 47GB of data. We use the full text of the papers in training, not just abstracts.
Specific details of the adaptive pretraining procedure can be found in Gururangan et. al, 2020.
## Evaluation
BioMed-RoBERTa achieves competitive performance to state of the art models on a number of NLP tasks in the biomedical domain (numbers are mean (standard deviation) over 3+ random seeds)
| Task | Task Type | RoBERTa-base | BioMed-RoBERTa-base |
|--------------|---------------------|--------------|---------------------|
| RCT-180K | Text Classification | 86.4 (0.3) | 86.9 (0.2) |
| ChemProt | Relation Extraction | 81.1 (1.1) | 83.0 (0.7) |
| JNLPBA | NER | 74.3 (0.2) | 75.2 (0.1) |
| BC5CDR | NER | 85.6 (0.1) | 87.8 (0.1) |
| NCBI-Disease | NER | 86.6 (0.3) | 87.1 (0.8) |
More evaluations TBD.
## Citation
If using this model, please cite the following paper:
```bibtex
@inproceedings{domains,
author = {Suchin Gururangan and Ana Marasović and Swabha Swayamdipta and Kyle Lo and Iz Beltagy and Doug Downey and Noah A. Smith},
title = {Don't Stop Pretraining: Adapt Language Models to Domains and Tasks},
year = {2020},
booktitle = {Proceedings of ACL},
}
```
# longformer-base-4096-extra.pos.embd.only
This model is similar to `longformer-base-4096` but it was pretrained to preserve RoBERTa weights by freezing all RoBERTa weights and only train the additional position embeddings.
### Citing
If you use `Longformer` in your research, please cite [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150).
```
@article{Beltagy2020Longformer,
title={Longformer: The Long-Document Transformer},
author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
journal={arXiv:2004.05150},
year={2020},
}
```
`Longformer` is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org).
AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
# longformer-base-4096
[Longformer](https://arxiv.org/abs/2004.05150) is a transformer model for long documents.
`longformer-base-4096` is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.
Longformer uses a combination of a sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations.
Please refer to the examples in `modeling_longformer.py` and the paper for more details on how to set global attention.
### Citing
If you use `Longformer` in your research, please cite [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150).
```
@article{Beltagy2020Longformer,
title={Longformer: The Long-Document Transformer},
author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
journal={arXiv:2004.05150},
year={2020},
}
```
`Longformer` is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org).
AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
# SciBERT
This is the pretrained model presented in [SciBERT: A Pretrained Language Model for Scientific Text](https://www.aclweb.org/anthology/D19-1371/), which is a BERT model trained on scientific text.
The training corpus was papers taken from [Semantic Scholar](https://www.semanticscholar.org). Corpus size is 1.14M papers, 3.1B tokens. We use the full text of the papers in training, not just abstracts.
SciBERT has its own wordpiece vocabulary (scivocab) that's built to best match the training corpus. We trained cased and uncased versions.
Available models include:
* `scibert_scivocab_cased`
* `scibert_scivocab_uncased`
The original repo can be found [here](https://github.com/allenai/scibert).
If using these models, please cite the following paper:
```
@inproceedings{beltagy-etal-2019-scibert,
title = "SciBERT: A Pretrained Language Model for Scientific Text",
author = "Beltagy, Iz and Lo, Kyle and Cohan, Arman",
booktitle = "EMNLP",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-1371"
}
```
# SciBERT
This is the pretrained model presented in [SciBERT: A Pretrained Language Model for Scientific Text](https://www.aclweb.org/anthology/D19-1371/), which is a BERT model trained on scientific text.
The training corpus was papers taken from [Semantic Scholar](https://www.semanticscholar.org). Corpus size is 1.14M papers, 3.1B tokens. We use the full text of the papers in training, not just abstracts.
SciBERT has its own wordpiece vocabulary (scivocab) that's built to best match the training corpus. We trained cased and uncased versions.
Available models include:
* `scibert_scivocab_cased`
* `scibert_scivocab_uncased`
The original repo can be found [here](https://github.com/allenai/scibert).
If using these models, please cite the following paper:
```
@inproceedings{beltagy-etal-2019-scibert,
title = "SciBERT: A Pretrained Language Model for Scientific Text",
author = "Beltagy, Iz and Lo, Kyle and Cohan, Arman",
booktitle = "EMNLP",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-1371"
}
```
---
language:
- en
- de
thumbnail:
tags:
- translation
- wmt16
- allenai
license: apache-2.0
datasets:
- wmt16
metrics:
- bleu
---
# FSMT
## Model description
This is a ported version of fairseq-based [wmt16 transformer](https://github.com/jungokasai/deep-shallow/) for en-de.
For more details, please, see [Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation](https://arxiv.org/abs/2006.10369).
All 3 models are available:
* [wmt16-en-de-dist-12-1](https://huggingface.co/allenai/wmt16-en-de-dist-12-1)
* [wmt16-en-de-dist-6-1](https://huggingface.co/allenai/wmt16-en-de-dist-6-1)
* [wmt16-en-de-12-1](https://huggingface.co/allenai/wmt16-en-de-12-1)
## Intended uses & limitations
#### How to use
```python
from transformers import FSMTForConditionalGeneration, FSMTTokenizer
mname = "allenai/wmt16-en-de-12-1"
tokenizer = FSMTTokenizer.from_pretrained(mname)
model = FSMTForConditionalGeneration.from_pretrained(mname)
input = "Machine learning is great, isn't it?"
input_ids = tokenizer.encode(input, return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded) # Maschinelles Lernen ist großartig, nicht wahr?
```
#### Limitations and bias
## Training data
Pretrained weights were left identical to the original model released by allenai. For more details, please, see the [paper](https://arxiv.org/abs/2006.10369).
## Eval results
Here are the BLEU scores:
model | fairseq | transformers
-------|---------|----------
wmt16-en-de-12-1 | 26.9 | 25.75
The score is slightly below the score reported in the paper, as the researchers don't use `sacrebleu` and measure the score on tokenized outputs. `transformers` score was measured using `sacrebleu` on detokenized outputs.
The score was calculated using this code:
```bash
git clone https://github.com/huggingface/transformers
cd transformers
export PAIR=en-de
export DATA_DIR=data/$PAIR
export SAVE_DIR=data/$PAIR
export BS=8
export NUM_BEAMS=5
mkdir -p $DATA_DIR
sacrebleu -t wmt16 -l $PAIR --echo src > $DATA_DIR/val.source
sacrebleu -t wmt16 -l $PAIR --echo ref > $DATA_DIR/val.target
echo $PAIR
PYTHONPATH="src:examples/seq2seq" python examples/seq2seq/run_eval.py allenai/wmt16-en-de-12-1 $DATA_DIR/val.source $SAVE_DIR/test_translations.txt --reference_path $DATA_DIR/val.target --score_path $SAVE_DIR/test_bleu.json --bs $BS --task translation --num_beams $NUM_BEAMS
```
## Data Sources
- [training, etc.](http://www.statmt.org/wmt16/)
- [test set](http://matrix.statmt.org/test_sets/newstest2016.tgz?1504722372)
### BibTeX entry and citation info
```
@misc{kasai2020deep,
title={Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation},
author={Jungo Kasai and Nikolaos Pappas and Hao Peng and James Cross and Noah A. Smith},
year={2020},
eprint={2006.10369},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
---
language:
- en
- de
thumbnail:
tags:
- translation
- wmt16
- allenai
license: apache-2.0
datasets:
- wmt16
metrics:
- bleu
---
# FSMT
## Model description
This is a ported version of fairseq-based [wmt16 transformer](https://github.com/jungokasai/deep-shallow/) for en-de.
For more details, please, see [Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation](https://arxiv.org/abs/2006.10369).
All 3 models are available:
* [wmt16-en-de-dist-12-1](https://huggingface.co/allenai/wmt16-en-de-dist-12-1)
* [wmt16-en-de-dist-6-1](https://huggingface.co/allenai/wmt16-en-de-dist-6-1)
* [wmt16-en-de-12-1](https://huggingface.co/allenai/wmt16-en-de-12-1)
## Intended uses & limitations
#### How to use
```python
from transformers import FSMTForConditionalGeneration, FSMTTokenizer
mname = "allenai/wmt16-en-de-dist-12-1"
tokenizer = FSMTTokenizer.from_pretrained(mname)
model = FSMTForConditionalGeneration.from_pretrained(mname)
input = "Machine learning is great, isn't it?"
input_ids = tokenizer.encode(input, return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded) # Maschinelles Lernen ist großartig, nicht wahr?
```
#### Limitations and bias
## Training data
Pretrained weights were left identical to the original model released by allenai. For more details, please, see the [paper](https://arxiv.org/abs/2006.10369).
## Eval results
Here are the BLEU scores:
model | fairseq | transformers
-------|---------|----------
wmt16-en-de-dist-12-1 | 28.3 | 27.52
The score is slightly below the score reported in the paper, as the researchers don't use `sacrebleu` and measure the score on tokenized outputs. `transformers` score was measured using `sacrebleu` on detokenized outputs.
The score was calculated using this code:
```bash
git clone https://github.com/huggingface/transformers
cd transformers
export PAIR=en-de
export DATA_DIR=data/$PAIR
export SAVE_DIR=data/$PAIR
export BS=8
export NUM_BEAMS=5
mkdir -p $DATA_DIR
sacrebleu -t wmt16 -l $PAIR --echo src > $DATA_DIR/val.source
sacrebleu -t wmt16 -l $PAIR --echo ref > $DATA_DIR/val.target
echo $PAIR
PYTHONPATH="src:examples/seq2seq" python examples/seq2seq/run_eval.py allenai/wmt16-en-de-dist-12-1 $DATA_DIR/val.source $SAVE_DIR/test_translations.txt --reference_path $DATA_DIR/val.target --score_path $SAVE_DIR/test_bleu.json --bs $BS --task translation --num_beams $NUM_BEAMS
```
## Data Sources
- [training, etc.](http://www.statmt.org/wmt16/)
- [test set](http://matrix.statmt.org/test_sets/newstest2016.tgz?1504722372)
### BibTeX entry and citation info
```
@misc{kasai2020deep,
title={Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation},
author={Jungo Kasai and Nikolaos Pappas and Hao Peng and James Cross and Noah A. Smith},
year={2020},
eprint={2006.10369},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
---
language:
- en
- de
thumbnail:
tags:
- translation
- wmt16
- allenai
license: apache-2.0
datasets:
- wmt16
metrics:
- bleu
---
# FSMT
## Model description
This is a ported version of fairseq-based [wmt16 transformer](https://github.com/jungokasai/deep-shallow/) for en-de.
For more details, please, see [Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation](https://arxiv.org/abs/2006.10369).
All 3 models are available:
* [wmt16-en-de-dist-12-1](https://huggingface.co/allenai/wmt16-en-de-dist-12-1)
* [wmt16-en-de-dist-6-1](https://huggingface.co/allenai/wmt16-en-de-dist-6-1)
* [wmt16-en-de-12-1](https://huggingface.co/allenai/wmt16-en-de-12-1)
## Intended uses & limitations
#### How to use
```python
from transformers import FSMTForConditionalGeneration, FSMTTokenizer
mname = "allenai/wmt16-en-de-dist-6-1"
tokenizer = FSMTTokenizer.from_pretrained(mname)
model = FSMTForConditionalGeneration.from_pretrained(mname)
input = "Machine learning is great, isn't it?"
input_ids = tokenizer.encode(input, return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded) # Maschinelles Lernen ist großartig, nicht wahr?
```
#### Limitations and bias
## Training data
Pretrained weights were left identical to the original model released by allenai. For more details, please, see the [paper](https://arxiv.org/abs/2006.10369).
## Eval results
Here are the BLEU scores:
model | fairseq | transformers
-------|---------|----------
wmt16-en-de-dist-6-1 | 27.4 | 27.11
The score is slightly below the score reported in the paper, as the researchers don't use `sacrebleu` and measure the score on tokenized outputs. `transformers` score was measured using `sacrebleu` on detokenized outputs.
The score was calculated using this code:
```bash
git clone https://github.com/huggingface/transformers
cd transformers
export PAIR=en-de
export DATA_DIR=data/$PAIR
export SAVE_DIR=data/$PAIR
export BS=8
export NUM_BEAMS=5
mkdir -p $DATA_DIR
sacrebleu -t wmt16 -l $PAIR --echo src > $DATA_DIR/val.source
sacrebleu -t wmt16 -l $PAIR --echo ref > $DATA_DIR/val.target
echo $PAIR
PYTHONPATH="src:examples/seq2seq" python examples/seq2seq/run_eval.py allenai/wmt16-en-de-dist-6-1 $DATA_DIR/val.source $SAVE_DIR/test_translations.txt --reference_path $DATA_DIR/val.target --score_path $SAVE_DIR/test_bleu.json --bs $BS --task translation --num_beams $NUM_BEAMS
```
## Data Sources
- [training, etc.](http://www.statmt.org/wmt16/)
- [test set](http://matrix.statmt.org/test_sets/newstest2016.tgz?1504722372)
### BibTeX entry and citation info
```
@misc{kasai2020deep,
title={Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation},
author={Jungo Kasai and Nikolaos Pappas and Hao Peng and James Cross and Noah A. Smith},
year={2020},
eprint={2006.10369},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
---
language:
- de
- en
thumbnail:
tags:
- translation
- wmt19
- allenai
license: apache-2.0
datasets:
- wmt19
metrics:
- bleu
---
# FSMT
## Model description
This is a ported version of fairseq-based [wmt19 transformer](https://github.com/jungokasai/deep-shallow/) for de-en.
For more details, please, see [Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation](https://arxiv.org/abs/2006.10369).
2 models are available:
* [wmt19-de-en-6-6-big](https://huggingface.co/allenai/wmt19-de-en-6-6-big)
* [wmt19-de-en-6-6-base](https://huggingface.co/allenai/wmt19-de-en-6-6-base)
## Intended uses & limitations
#### How to use
```python
from transformers import FSMTForConditionalGeneration, FSMTTokenizer
mname = "allenai/wmt19-de-en-6-6-base"
tokenizer = FSMTTokenizer.from_pretrained(mname)
model = FSMTForConditionalGeneration.from_pretrained(mname)
input = "Maschinelles Lernen ist großartig, nicht wahr?"
input_ids = tokenizer.encode(input, return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded) # Machine learning is great, isn't it?
```
#### Limitations and bias
## Training data
Pretrained weights were left identical to the original model released by allenai. For more details, please, see the [paper](https://arxiv.org/abs/2006.10369).
## Eval results
Here are the BLEU scores:
model | transformers
-------|---------
wmt19-de-en-6-6-base | 38.37
The score was calculated using this code:
```bash
git clone https://github.com/huggingface/transformers
cd transformers
export PAIR=de-en
export DATA_DIR=data/$PAIR
export SAVE_DIR=data/$PAIR
export BS=8
export NUM_BEAMS=5
mkdir -p $DATA_DIR
sacrebleu -t wmt19 -l $PAIR --echo src > $DATA_DIR/val.source
sacrebleu -t wmt19 -l $PAIR --echo ref > $DATA_DIR/val.target
echo $PAIR
PYTHONPATH="src:examples/seq2seq" python examples/seq2seq/run_eval.py allenai/wmt19-de-en-6-6-base $DATA_DIR/val.source $SAVE_DIR/test_translations.txt --reference_path $DATA_DIR/val.target --score_path $SAVE_DIR/test_bleu.json --bs $BS --task translation --num_beams $NUM_BEAMS
```
## Data Sources
- [training, etc.](http://www.statmt.org/wmt19/)
- [test set](http://matrix.statmt.org/test_sets/newstest2019.tgz?1556572561)
### BibTeX entry and citation info
```
@misc{kasai2020deep,
title={Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation},
author={Jungo Kasai and Nikolaos Pappas and Hao Peng and James Cross and Noah A. Smith},
year={2020},
eprint={2006.10369},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
---
language:
- de
- en
thumbnail:
tags:
- translation
- wmt19
- allenai
license: apache-2.0
datasets:
- wmt19
metrics:
- bleu
---
# FSMT
## Model description
This is a ported version of fairseq-based [wmt19 transformer](https://github.com/jungokasai/deep-shallow/) for de-en.
For more details, please, see [Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation](https://arxiv.org/abs/2006.10369).
2 models are available:
* [wmt19-de-en-6-6-big](https://huggingface.co/allenai/wmt19-de-en-6-6-big)
* [wmt19-de-en-6-6-base](https://huggingface.co/allenai/wmt19-de-en-6-6-base)
## Intended uses & limitations
#### How to use
```python
from transformers import FSMTForConditionalGeneration, FSMTTokenizer
mname = "allenai/wmt19-de-en-6-6-big"
tokenizer = FSMTTokenizer.from_pretrained(mname)
model = FSMTForConditionalGeneration.from_pretrained(mname)
input = "Maschinelles Lernen ist großartig, nicht wahr?"
input_ids = tokenizer.encode(input, return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded) # Machine learning is great, isn't it?
```
#### Limitations and bias
## Training data
Pretrained weights were left identical to the original model released by allenai. For more details, please, see the [paper](https://arxiv.org/abs/2006.10369).
## Eval results
Here are the BLEU scores:
model | transformers
-------|---------
wmt19-de-en-6-6-big | 39.9
The score was calculated using this code:
```bash
git clone https://github.com/huggingface/transformers
cd transformers
export PAIR=de-en
export DATA_DIR=data/$PAIR
export SAVE_DIR=data/$PAIR
export BS=8
export NUM_BEAMS=5
mkdir -p $DATA_DIR
sacrebleu -t wmt19 -l $PAIR --echo src > $DATA_DIR/val.source
sacrebleu -t wmt19 -l $PAIR --echo ref > $DATA_DIR/val.target
echo $PAIR
PYTHONPATH="src:examples/seq2seq" python examples/seq2seq/run_eval.py allenai/wmt19-de-en-6-6-big $DATA_DIR/val.source $SAVE_DIR/test_translations.txt --reference_path $DATA_DIR/val.target --score_path $SAVE_DIR/test_bleu.json --bs $BS --task translation --num_beams $NUM_BEAMS
```
## Data Sources
- [training, etc.](http://www.statmt.org/wmt19/)
- [test set](http://matrix.statmt.org/test_sets/newstest2019.tgz?1556572561)
### BibTeX entry and citation info
```
@misc{kasai2020deep,
title={Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation},
author={Jungo Kasai and Nikolaos Pappas and Hao Peng and James Cross and Noah A. Smith},
year={2020},
eprint={2006.10369},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
---
language: zh-tw
---
# Model name
Chinese-bert-wwm-electrical-health-record-ner-sequence-labeling
#### How to use
```
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("chinese-bert-wwm-ehr-ner-sl")
model = AutoModelForTokenClassification.from_pretrained("chinese-bert-wwm-ehr-ner-sl")
```
---
language: multilingual
thumbnail: "https://amberoad.de/images/logo_text.png"
tags:
- msmarco
- multilingual
- passage reranking
license: Apache-2.0
datasets:
- msmarco
metrics:
- MRR
widget:
- query: "What is a corporation?"
passage: "A company is incorporated in a specific nation, often within the bounds of a smaller subset of that nation, such as a state or province. The corporation is then governed by the laws of incorporation in that state. A corporation may issue stock, either private or public, or may be classified as a non-stock corporation. If stock is issued, the corporation will usually be governed by its shareholders, either directly or indirectly."
---
# Passage Reranking Multilingual BERT 🔃 🌍
## Model description
**Input:** Supports over 100 Languages. See [List of supported languages](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages) for all available.
**Purpose:** This module takes a search query [1] and a passage [2] and calculates if the passage matches the query.
It can be used as an improvement for Elasticsearch Results and boosts the relevancy by up to 100%.
**Architecture:** On top of BERT there is a Densly Connected NN which takes the 768 Dimensional [CLS] Token as input and provides the output ([Arxiv](https://arxiv.org/abs/1901.04085)).
**Output:** Just a single value between between -10 and 10. Better matching query,passage pairs tend to have a higher a score.
## Intended uses & limitations
Both query[1] and passage[2] have to fit in 512 Tokens.
As you normally want to rerank the first dozens of search results keep in mind the inference time of approximately 300 ms/query.
#### How to use
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("amberoad/bert-multilingual-passage-reranking-msmarco")
model = AutoModelForSequenceClassification.from_pretrained("amberoad/bert-multilingual-passage-reranking-msmarco")
```
This Model can be used as a drop-in replacement in the [Nboost Library](https://github.com/koursaros-ai/nboost)
Through this you can directly improve your Elasticsearch Results without any coding.
## Training data
This model is trained using the [**Microsoft MS Marco Dataset**](https://microsoft.github.io/msmarco/ "Microsoft MS Marco"). This training dataset contains approximately 400M tuples of a query, relevant and non-relevant passages. All datasets used for training and evaluating are listed in this [table](https://github.com/microsoft/MSMARCO-Passage-Ranking#data-information-and-formating). The used dataset for training is called *Train Triples Large*, while the evaluation was made on *Top 1000 Dev*. There are 6,900 queries in total in the development dataset, where each query is mapped to top 1,000 passage retrieved using BM25 from MS MARCO corpus.
## Training procedure
The training is performed the same way as stated in this [README](https://github.com/nyu-dl/dl4marco-bert "NYU Github"). See their excellent Paper on [Arxiv](https://arxiv.org/abs/1901.04085).
We changed the BERT Model from an English only to the default BERT Multilingual uncased Model from [Google](https://huggingface.co/bert-base-multilingual-uncased).
Training was done 400 000 Steps. This equaled 12 hours an a TPU V3-8.
## Eval results
We see nearly similar performance than the English only Model in the English [Bing Queries Dataset](http://www.msmarco.org/). Although the training data is English only internal Tests on private data showed a far higher accurancy in German than all other available models.
Fine-tuned Models | Dependency | Eval Set | Search Boost<a href='#benchmarks'> | Speed on GPU
----------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | ------------------------------------------------------------------ | ----------------------------------------------------- | ----------------------------------
**`amberoad/Multilingual-uncased-MSMARCO`** (This Model) | <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-blue"/> | <a href ='http://www.msmarco.org/'>bing queries</a> | **+61%** <sub><sup>(0.29 vs 0.18)</sup></sub> | ~300 ms/query <a href='#footnotes'>
`nboost/pt-tinybert-msmarco` | <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-red"/> | <a href ='http://www.msmarco.org/'>bing queries</a> | **+45%** <sub><sup>(0.26 vs 0.18)</sup></sub> | ~50ms/query <a href='#footnotes'>
`nboost/pt-bert-base-uncased-msmarco` | <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-red"/> | <a href ='http://www.msmarco.org/'>bing queries</a> | **+62%** <sub><sup>(0.29 vs 0.18)</sup></sub> | ~300 ms/query<a href='#footnotes'>
`nboost/pt-bert-large-msmarco` | <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-red"/> | <a href ='http://www.msmarco.org/'>bing queries</a> | **+77%** <sub><sup>(0.32 vs 0.18)</sup></sub> | -
`nboost/pt-biobert-base-msmarco` | <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-red"/> | <a href ='https://github.com/naver/biobert-pretrained'>biomed</a> | **+66%** <sub><sup>(0.17 vs 0.10)</sup></sub> | ~300 ms/query<a href='#footnotes'>
This table is taken from [nboost](https://github.com/koursaros-ai/nboost) and extended by the first line.
## Contact Infos
![](https://amberoad.de/images/logo_text.png)
Amberoad is a company focussing on Search and Business Intelligence.
We provide you:
* Advanced Internal Company Search Engines thorugh NLP
* External Search Egnines: Find Competitors, Customers, Suppliers
**Get in Contact now to benefit from our Expertise:**
The training and evaluation was performed by [**Philipp Reissel**](https://reissel.eu/) and [**Igli Manaj**](https://github.com/iglimanaj)
[![Amberoad](https://i.stack.imgur.com/gVE0j.png) Linkedin](https://de.linkedin.com/company/amberoad) | <svg xmlns="http://www.w3.org/2000/svg" x="0px" y="0px"
width="32" height="32"
viewBox="0 0 172 172"
style=" fill:#000000;"><g fill="none" fill-rule="nonzero" stroke="none" stroke-width="1" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="10" stroke-dasharray="" stroke-dashoffset="0" font-family="none" font-weight="none" font-size="none" text-anchor="none" style="mix-blend-mode: normal"><path d="M0,172v-172h172v172z" fill="none"></path><g fill="#e67e22"><path d="M37.625,21.5v86h96.75v-86h-5.375zM48.375,32.25h10.75v10.75h-10.75zM69.875,32.25h10.75v10.75h-10.75zM91.375,32.25h32.25v10.75h-32.25zM48.375,53.75h75.25v43h-75.25zM80.625,112.875v17.61572c-1.61558,0.93921 -2.94506,2.2687 -3.88428,3.88428h-49.86572v10.75h49.86572c1.8612,3.20153 5.28744,5.375 9.25928,5.375c3.97183,0 7.39808,-2.17347 9.25928,-5.375h49.86572v-10.75h-49.86572c-0.93921,-1.61558 -2.2687,-2.94506 -3.88428,-3.88428v-17.61572z"></path></g></g></svg>[Homepage](https://de.linkedin.com/company/amberoad) | [Email](info@amberoad.de)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment