"vscode:/vscode.git/clone" did not exist on "3170af71e1df8a9f7b7ef33ba8388dc630d65d79"
Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language: zu
---
# zuBERTa
zuBERTa is a RoBERTa style transformer language model trained on zulu text.
## Intended uses & limitations
The model can be used for getting embeddings to use on a down-stream task such as question answering.
#### How to use
```python
>>> from transformers import pipeline
>>> from transformers import AutoTokenizer, AutoModelWithLMHead
>>> tokenizer = AutoTokenizer.from_pretrained("MoseliMotsoehli/zuBERTa")
>>> model = AutoModelWithLMHead.from_pretrained("MoseliMotsoehli/zuBERTa")
>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
>>> unmasker("Abafika eNkandla bafika sebeholwa <mask> uMpongo kaZingelwayo.")
[
{
"sequence": "<s>Abafika eNkandla bafika sebeholwa khona uMpongo kaZingelwayo.</s>",
"score": 0.050459690392017365,
"token": 555,
"token_str": "Ġkhona"
},
{
"sequence": "<s>Abafika eNkandla bafika sebeholwa inkosi uMpongo kaZingelwayo.</s>",
"score": 0.03668094798922539,
"token": 2321,
"token_str": "Ġinkosi"
},
{
"sequence": "<s>Abafika eNkandla bafika sebeholwa ubukhosi uMpongo kaZingelwayo.</s>",
"score": 0.028774697333574295,
"token": 5101,
"token_str": "Ġubukhosi"
}
]
```
## Training data
1. 30k sentences of text, came from the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download) of zulu 2018. These were collected from news articles and creative writtings.
2. ~7500 articles of human generated translations were scraped from the zulu [wikipedia](https://zu.wikipedia.org/wiki/Special:AllPages).
### BibTeX entry and citation info
```bibtex
@inproceedings{author = {Moseli Motsoehli},
title = {Towards transformation of Southern African language models through transformers.},
year={2020}
}
```
---
language: it
---
# UmBERTo Commoncrawl Cased
[UmBERTo](https://github.com/musixmatchresearch/umberto) is a Roberta-based Language Model trained on large Italian Corpora and uses two innovative approaches: SentencePiece and Whole Word Masking. Now available at [github.com/huggingface/transformers](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1)
<p align="center">
<img src="https://user-images.githubusercontent.com/7140210/72913702-d55a8480-3d3d-11ea-99fc-f2ef29af4e72.jpg" width="700"> </br>
Marco Lodola, Monument to Umberto Eco, Alessandria 2019
</p>
## Dataset
UmBERTo-Commoncrawl-Cased utilizes the Italian subcorpus of [OSCAR](https://traces1.inria.fr/oscar/) as training set of the language model. We used deduplicated version of the Italian corpus that consists in 70 GB of plain text data, 210M sentences with 11B words where the sentences have been filtered and shuffled at line level in order to be used for NLP research.
## Pre-trained model
| Model | WWM | Cased | Tokenizer | Vocab Size | Train Steps | Download |
| ------ | ------ | ------ | ------ | ------ |------ | ------ |
| `umberto-commoncrawl-cased-v1` | YES | YES | SPM | 32K | 125k | [Link](http://bit.ly/35zO7GH) |
This model was trained with [SentencePiece](https://github.com/google/sentencepiece) and Whole Word Masking.
## Downstream Tasks
These results refers to umberto-commoncrawl-cased model. All details are at [Umberto](https://github.com/musixmatchresearch/umberto) Official Page.
#### Named Entity Recognition (NER)
| Dataset | F1 | Precision | Recall | Accuracy |
| ------ | ------ | ------ | ------ | ------ |
| **ICAB-EvalITA07** | **87.565** | 86.596 | 88.556 | 98.690 |
| **WikiNER-ITA** | **92.531** | 92.509 | 92.553 | 99.136 |
#### Part of Speech (POS)
| Dataset | F1 | Precision | Recall | Accuracy |
| ------ | ------ | ------ | ------ | ------ |
| **UD_Italian-ISDT** | 98.870 | 98.861 | 98.879 | **98.977** |
| **UD_Italian-ParTUT** | 98.786 | 98.812 | 98.760 | **98.903** |
## Usage
##### Load UmBERTo with AutoModel, Autotokenizer:
```python
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")
umberto = AutoModel.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")
encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
input_ids = torch.tensor(encoded_input).unsqueeze(0) # Batch size 1
outputs = umberto(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output
```
##### Predict masked token:
```python
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="Musixmatch/umberto-commoncrawl-cased-v1",
tokenizer="Musixmatch/umberto-commoncrawl-cased-v1"
)
result = fill_mask("Umberto Eco è <mask> un grande scrittore")
# {'sequence': '<s> Umberto Eco è considerato un grande scrittore</s>', 'score': 0.18599839508533478, 'token': 5032}
# {'sequence': '<s> Umberto Eco è stato un grande scrittore</s>', 'score': 0.17816807329654694, 'token': 471}
# {'sequence': '<s> Umberto Eco è sicuramente un grande scrittore</s>', 'score': 0.16565583646297455, 'token': 2654}
# {'sequence': '<s> Umberto Eco è indubbiamente un grande scrittore</s>', 'score': 0.0932890921831131, 'token': 17908}
# {'sequence': '<s> Umberto Eco è certamente un grande scrittore</s>', 'score': 0.054701317101716995, 'token': 5269}
```
## Citation
All of the original datasets are publicly available or were released with the owners' grant. The datasets are all released under a CC0 or CCBY license.
* UD Italian-ISDT Dataset [Github](https://github.com/UniversalDependencies/UD_Italian-ISDT)
* UD Italian-ParTUT Dataset [Github](https://github.com/UniversalDependencies/UD_Italian-ParTUT)
* I-CAB (Italian Content Annotation Bank), EvalITA [Page](http://www.evalita.it/)
* WIKINER [Page](https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500) , [Paper](https://www.sciencedirect.com/science/article/pii/S0004370212000276?via%3Dihub)
```
@inproceedings {magnini2006annotazione,
title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB},
author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo},
booktitle = {Proc.of SILFI 2006},
year = {2006}
}
@inproceedings {magnini2006cab,
title = {I - CAB: the Italian Content Annotation Bank.},
author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele},
booktitle = {LREC},
pages = {963--968},
year = {2006},
organization = {Citeseer}
}
```
## Authors
**Loreto Parisi**: `loreto at musixmatch dot com`, [loretoparisi](https://github.com/loretoparisi)
**Simone Francia**: `simone.francia at musixmatch dot com`, [simonefrancia](https://github.com/simonefrancia)
**Paolo Magnani**: `paul.magnani95 at gmail dot com`, [paulthemagno](https://github.com/paulthemagno)
## About Musixmatch AI
![Musxmatch Ai mac app icon-128](https://user-images.githubusercontent.com/163333/72244273-396aa380-35ee-11ea-894b-4ea48230c02b.png)
We do Machine Learning and Artificial Intelligence @[musixmatch](https://twitter.com/Musixmatch)
Follow us on [Twitter](https://twitter.com/musixmatchai) [Github](https://github.com/musixmatchresearch)
---
language: it
---
# UmBERTo Wikipedia Uncased
[UmBERTo](https://github.com/musixmatchresearch/umberto) is a Roberta-based Language Model trained on large Italian Corpora and uses two innovative approaches: SentencePiece and Whole Word Masking. Now available at [github.com/huggingface/transformers](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1)
<p align="center">
<img src="https://user-images.githubusercontent.com/7140210/72913702-d55a8480-3d3d-11ea-99fc-f2ef29af4e72.jpg" width="700"> </br>
Marco Lodola, Monument to Umberto Eco, Alessandria 2019
</p>
## Dataset
UmBERTo-Wikipedia-Uncased Training is trained on a relative small corpus (~7GB) extracted from [Wikipedia-ITA](https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/).
## Pre-trained model
| Model | WWM | Cased | Tokenizer | Vocab Size | Train Steps | Download |
| ------ | ------ | ------ | ------ | ------ |------ | ------ |
| `umberto-wikipedia-uncased-v1` | YES | YES | SPM | 32K | 100k | [Link](http://bit.ly/35wbSj6) |
This model was trained with [SentencePiece](https://github.com/google/sentencepiece) and Whole Word Masking.
## Downstream Tasks
These results refers to umberto-wikipedia-uncased model. All details are at [Umberto](https://github.com/musixmatchresearch/umberto) Official Page.
#### Named Entity Recognition (NER)
| Dataset | F1 | Precision | Recall | Accuracy |
| ------ | ------ | ------ | ------ | ----- |
| **ICAB-EvalITA07** | **86.240** | 85.939 | 86.544 | 98.534 |
| **WikiNER-ITA** | **90.483** | 90.328 | 90.638 | 98.661 |
#### Part of Speech (POS)
| Dataset | F1 | Precision | Recall | Accuracy |
| ------ | ------ | ------ | ------ | ------ |
| **UD_Italian-ISDT** | 98.563 | 98.508 | 98.618 | **98.717** |
| **UD_Italian-ParTUT** | 97.810 | 97.835 | 97.784 | **98.060** |
## Usage
##### Load UmBERTo Wikipedia Uncased with AutoModel, Autotokenizer:
```python
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
umberto = AutoModel.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
input_ids = torch.tensor(encoded_input).unsqueeze(0) # Batch size 1
outputs = umberto(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output
```
##### Predict masked token:
```python
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="Musixmatch/umberto-wikipedia-uncased-v1",
tokenizer="Musixmatch/umberto-wikipedia-uncased-v1"
)
result = fill_mask("Umberto Eco è <mask> un grande scrittore")
# {'sequence': '<s> umberto eco è stato un grande scrittore</s>', 'score': 0.5784581303596497, 'token': 361}
# {'sequence': '<s> umberto eco è anche un grande scrittore</s>', 'score': 0.33813193440437317, 'token': 269}
# {'sequence': '<s> umberto eco è considerato un grande scrittore</s>', 'score': 0.027196012437343597, 'token': 3236}
# {'sequence': '<s> umberto eco è diventato un grande scrittore</s>', 'score': 0.013716378249228, 'token': 5742}
# {'sequence': '<s> umberto eco è inoltre un grande scrittore</s>', 'score': 0.010662357322871685, 'token': 1030}
```
## Citation
All of the original datasets are publicly available or were released with the owners' grant. The datasets are all released under a CC0 or CCBY license.
* UD Italian-ISDT Dataset [Github](https://github.com/UniversalDependencies/UD_Italian-ISDT)
* UD Italian-ParTUT Dataset [Github](https://github.com/UniversalDependencies/UD_Italian-ParTUT)
* I-CAB (Italian Content Annotation Bank), EvalITA [Page](http://www.evalita.it/)
* WIKINER [Page](https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500) , [Paper](https://www.sciencedirect.com/science/article/pii/S0004370212000276?via%3Dihub)
```
@inproceedings {magnini2006annotazione,
title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB},
author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo},
booktitle = {Proc.of SILFI 2006},
year = {2006}
}
@inproceedings {magnini2006cab,
title = {I - CAB: the Italian Content Annotation Bank.},
author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele},
booktitle = {LREC},
pages = {963--968},
year = {2006},
organization = {Citeseer}
}
```
## Authors
**Loreto Parisi**: `loreto at musixmatch dot com`, [loretoparisi](https://github.com/loretoparisi)
**Simone Francia**: `simone.francia at musixmatch dot com`, [simonefrancia](https://github.com/simonefrancia)
**Paolo Magnani**: `paul.magnani95 at gmail dot com`, [paulthemagno](https://github.com/paulthemagno)
## About Musixmatch AI
![Musxmatch Ai mac app icon-128](https://user-images.githubusercontent.com/163333/72244273-396aa380-35ee-11ea-894b-4ea48230c02b.png)
We do Machine Learning and Artificial Intelligence @[musixmatch](https://twitter.com/Musixmatch)
Follow us on [Twitter](https://twitter.com/musixmatchai) [Github](https://github.com/musixmatchresearch)
# MS-BERT
## Introduction
This repository provides codes and models of MS-BERT.
MS-BERT was pre-trained on notes from neurological examination for Multiple Sclerosis (MS) patients at St. Michael's Hospital in Toronto, Canada.
## Data
The dataset contained approximately 75,000 clinical notes, for about 5000 patients, totaling to over 35.7 million words.
These notes were collected from patients who visited St. Michael's Hospital MS Clinic between 2015 to 2019.
The notes contained a variety of information pertaining to a neurological exam.
For example, a note can contain information on the patient's condition, their progress over time and diagnosis.
The gender split within the dataset was observed to be 72% female and 28% male ([which reflects the natural discrepancy seen in MS][1]).
Further sections will describe how MS-BERT was pre trained through the use of these clinically relevant and rich neurological notes.
## Data pre-processing
The data was pre-processed to remove any identifying information. This includes information on: patient names, doctor names, hospital names, patient identification numbers, phone numbers, addresses, and time. In order to de-identify the information, we used a curated database that contained patient and doctor information. This curated database was paired with regular expressions to find and remove any identifying pieces of information. Each of these identifiers were replaced with a specific token. These tokens were chosen based on three criteria: (1) they belong to the current BERT vocab, (2), they have relatively the same semantic meaning as the word they are replacing, and (3), the token is not found in the original unprocessed dataset. The replacements that met the criteria above were as follows:
Female first names -> Lucie
Male first names -> Ezekiel
Last/family names -> Salamanca.
Dates -> 2010s
Patient IDs -> 999
Phone numbers -> 1718
Addresses -> Silesia
Time -> 1610
Locations/Hospital/Clinic names -> Troy
## Pre-training
The starting point for our model is the already pre-trained and fine-tuned BLUE-BERT base. We further pre-train it using the masked language modelling task from the huggingface transformers [library](https://github.com/huggingface).
The hyperparameters can be found in the config file in this repository or [here](https://s3.amazonaws.com/models.huggingface.co/bert/NLP4H/ms_bert/config.json)
## Acknowledgements
We would like to thank the researchers and staff at the Data Science and Advanced Analytics (DSAA) department, St. Michael’s Hospital, for providing consistent support and guidance throughout this project.
We would also like to thank Dr. Marzyeh Ghassemi, Taylor Killan, Nathan Ng and Haoran Zhang for providing us the opportunity to work on this exciting project.
## Disclaimer
MS-BERT shows the results of research conducted at the Data Science and Advanced Analytics (DSAA) department, St. Michael’s Hospital. The results produced by MS-BERT are not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not make decisions about their health solely on the basis of the results produced by MS-BERT. St. Michael’s Hospital does not independently verify the validity or utility of the results produced by MS-BERT. If you have questions about the results produced by MS-BERT please consult a healthcare professional. If you would like more information about the research conducted at DSAA please contact [Zhen Yang](mailto:zhen.yang@unityhealth.to). If you would like more information on neurological examination notes please contact [Dr. Tony Antoniou](mailto:tony.antoniou@unityhealth.to) or [Dr. Jiwon Oh](mailto:jiwon.oh@unityhealth.to) from the MS clinic at St. Michael's Hospital.
[1]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3707353/
---
language: kn
---
# Welcome to KanBERTo (ಕನ್ಬರ್ಟೋ)
## Model Description
> This is a small language model for [Kannada](https://en.wikipedia.org/wiki/Kannada) language with 1M data samples taken from
[OSCAR page](https://traces1.inria.fr/oscar/files/compressed-orig/kn.txt.gz)
## Training params
- **Dataset** - 1M data samples are used to train this model from OSCAR page(https://traces1.inria.fr/oscar/) eventhough data set is of 1.7 GB due to resource constraint to train
I have picked only 1M data from the total 1.7GB data set. If you are interested in collaboration and have computational resources to train on you are most welcome to do so.
- **Preprocessing** - ByteLevelBPETokenizer is used to tokenize the sentences at character level and vocabulary size is set to 52k as per standard values given by 🤗
- **Hyperparameters** - __ByteLevelBPETokenizer__ : vocabulary size = 52_000 and min_frequency = 2
__Trainer__ : num_train_epochs=12 - trained for 12 epochs
per_gpu_train_batch_size=64 - batch size for the datasamples is 64
save_steps=10_000 - save model for every 10k steps
save_total_limit=2 - save limit is set for 2
**Intended uses & limitations**
this is for anyone who wants to make use of kannada language models for various tasks like language generation, translation and many more use cases.
**Whatever else is helpful!**
If you are intersted in collaboration feel free to reach me [Naveen](mailto:naveen.maltesh@gmail.com)
# BERT-Small CORD-19 fine-tuned on SQuAD 2.0
[bert-small-cord19 model](https://huggingface.co/NeuML/bert-small-cord19) fine-tuned on SQuAD 2.0
## Building the model
```bash
python run_squad.py
--model_type bert
--model_name_or_path bert-small-cord19
--do_train
--do_eval
--do_lower_case
--version_2_with_negative
--train_file train-v2.0.json
--predict_file dev-v2.0.json
--per_gpu_train_batch_size 8
--learning_rate 3e-5
--num_train_epochs 3.0
--max_seq_length 384
--doc_stride 128
--output_dir bert-small-cord19-squad2
--save_steps 0
--threads 8
--overwrite_cache
--overwrite_output_dir
# BERT-Small fine-tuned on CORD-19 dataset
[BERT L6_H-512_A-8 model](https://huggingface.co/google/bert_uncased_L-6_H-512_A-8) fine-tuned on the [CORD-19 dataset](https://www.semanticscholar.org/cord19).
## CORD-19 data subset
The training data for this dataset is stored as a [Kaggle dataset](https://www.kaggle.com/davidmezzetti/cord19-qa?select=cord19.txt). The training
data is a subset of the full corpus, focusing on high-quality, study-design detected articles.
## Building the model
```bash
python run_language_modeling.py
--model_type bert
--model_name_or_path google/bert_uncased_L-6_H-512_A-8
--do_train
--mlm
--line_by_line
--block_size 512
--train_data_file cord19.txt
--per_gpu_train_batch_size 4
--learning_rate 3e-5
--num_train_epochs 3.0
--output_dir bert-small-cord19
--save_steps 0
--overwrite_output_dir
# BERT-Small fine-tuned on CORD-19 QA dataset
[bert-small-cord19-squad model](https://huggingface.co/NeuML/bert-small-cord19-squad2) fine-tuned on the [CORD-19 QA dataset](https://www.kaggle.com/davidmezzetti/cord19-qa?select=cord19-qa.json).
## CORD-19 QA dataset
The CORD-19 QA dataset is a SQuAD 2.0 formatted list of question, context, answer combinations covering the [CORD-19 dataset](https://www.semanticscholar.org/cord19).
## Building the model
```bash
python run_squad.py \
--model_type bert \
--model_name_or_path bert-small-cord19-squad \
--do_train \
--do_lower_case \
--version_2_with_negative \
--train_file cord19-qa.json \
--per_gpu_train_batch_size 8 \
--learning_rate 5e-5 \
--num_train_epochs 10.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir bert-small-cord19qa \
--save_steps 0 \
--threads 8 \
--overwrite_cache \
--overwrite_output_dir
```
## Testing the model
Example usage below:
```python
from transformers import pipeline
qa = pipeline(
"question-answering",
model="NeuML/bert-small-cord19qa",
tokenizer="NeuML/bert-small-cord19qa"
)
qa({
"question": "What is the median incubation period?",
"context": "The incubation period is around 5 days (range: 4-7 days) with a maximum of 12-13 day"
})
qa({
"question": "What is the incubation period range?",
"context": "The incubation period is around 5 days (range: 4-7 days) with a maximum of 12-13 day"
})
qa({
"question": "What type of surfaces does it persist?",
"context": "The virus can survive on surfaces for up to 72 hours such as plastic and stainless steel ."
})
```
```json
{"score": 0.5970273583242793, "start": 32, "end": 38, "answer": "5 days"}
{"score": 0.999555868193891, "start": 39, "end": 56, "answer": "(range: 4-7 days)"}
{"score": 0.9992726505196998, "start": 61, "end": 88, "answer": "plastic and stainless steel"}
```
---
language: vn
---
# BERT for Vietnamese is trained on more 20 GB news dataset
Apply for task sentiment analysis on using [AIViVN's comments dataset](https://www.aivivn.com/contests/6)
The model achieved 0.90268 on the public leaderboard, (winner's score is 0.90087)
Bert4news is used for a toolkit Vietnames(segmentation and Named Entity Recognition) at ViNLPtoolkit(https://github.com/bino282/ViNLP)
***************New Mar 11 , 2020 ***************
**[BERT](https://github.com/google-research/bert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
We use word sentencepiece, use basic bert tokenization and same config with bert base with lowercase = False.
You can download trained model:
- [tensorflow](https://drive.google.com/file/d/1X-sRDYf7moS_h61J3L79NkMVGHP-P-k5/view?usp=sharing).
- [pytorch](https://drive.google.com/file/d/11aFSTpYIurn-oI2XpAmcCTccB_AonMOu/view?usp=sharing).
Use with huggingface/transformers
``` bash
import torch
from transformers import AutoTokenizer,AutoModel
tokenizer= AutoTokenizer.from_pretrained("NlpHUST/vibert4news-base-cased")
bert_model = AutoModel.from_pretrained("NlpHUST/vibert4news-base-cased")
line = "Tôi là sinh viên trường Bách Khoa Hà Nội ."
input_id = tokenizer.encode(line,add_special_tokens = True)
att_mask = [int(token_id > 0) for token_id in input_id]
input_ids = torch.tensor([input_id])
att_masks = torch.tensor([att_mask])
with torch.no_grad():
features = bert_model(input_ids,att_masks)
print(features)
```
Run training with base config
``` bash
python train_pytorch.py \
--model_path=bert4news.pytorch \
--max_len=200 \
--batch_size=16 \
--epochs=6 \
--lr=2e-5
```
### Contact information
For personal communication related to this project, please contact Nha Nguyen Van (nha282@gmail.com).
---
language: he
thumbnail: https://avatars1.githubusercontent.com/u/3617152?norod.jpg
widget:
- text: "<|startoftext|>החוק השני של מועדון קרב הוא"
- text: "<|startoftext|>ראש הממשלה בן גוריון"
- text: "<|startoftext|>למידת מכונה (סרט)"
- text: "<|startoftext|>מנשה פומפרניקל"
- text: "<|startoftext|>אי שוויון "
license: mit
---
# hewiki-articles-distilGPT2py-il
## A tiny GPT2 model for generating Hebrew text
A distilGPT2 sized model. <br>
Training data was hewiki-20200701-pages-articles-multistream.xml.bz2 from https://dumps.wikimedia.org/hewiki/20200701/ <br>
XML has been converted to plain text using Wikipedia Extractor http://medialab.di.unipi.it/wiki/Wikipedia_Extractor <br>
I then added <|startoftext|> and <|endoftext|> markers and deleted empty lines. <br>
#### How to use
```python
import torch
import torch.nn as nn
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("Norod78/hewiki-articles-distilGPT2py-il")
model = GPT2LMHeadModel.from_pretrained("Norod78/hewiki-articles-distilGPT2py-il").eval()
bos_token = tokenizer.bos_token #Beginning of sentace
eos_token = tokenizer.eos_token #End of sentence
def generate_word(model, tokens_tensor, temperature=1.0):
"""
Sample a word given a tensor of tokens of previous words from a model. Given
the words we have, sample a plausible word. Temperature is used for
controlling randomness. If using temperature==0 we simply use a greedy arg max.
Else, we sample from a multinomial distribution using a lower inverse
temperature to allow for more randomness to escape repetitions.
"""
with torch.no_grad():
outputs = model(tokens_tensor)
predictions = outputs[0]
if temperature>0:
# Make the distribution more or less skewed based on the temperature
predictions = outputs[0]/temperature
# Sample from the distribution
softmax = nn.Softmax(dim=0)
predicted_index = torch.multinomial(softmax(predictions[0,-1,:]),1).item()
# Simply take the arg-max of the distribution
else:
predicted_index = torch.argmax(predictions[0, -1, :]).item()
# Decode the encoding to the corresponding word
predicted_text = tokenizer.decode([predicted_index])
return predicted_text
def generate_sentence(model, tokenizer, initial_text, temperature=1.0):
""" Generate a sentence given some initial text using a model and a tokenizer.
Returns the new sentence. """
# Encode a text inputs
text = ""
sentence = text
# We avoid an infinite loop by setting a maximum range
for i in range(0,84):
indexed_tokens = tokenizer.encode(initial_text + text)
# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])
new_word = generate_word(model, tokens_tensor, temperature=temperature)
# Here the temperature is slowly decreased with each generated word,
# this ensures that the sentence (ending) makes more sense.
# We don't decrease to a temperature of 0.0 to leave some randomness in.
if temperature<(1-0.008):
temperature += 0.008
else:
temperature = 0.996
text = text+new_word
# Stop generating new words when we have reached the end of the line or the poem
if eos_token in new_word:
# returns new sentence and whether poem is done
return (text.replace(eos_token,"").strip(), True)
elif '/' in new_word:
return (text.strip(), False)
elif bos_token in new_word:
return (text.replace(bos_token,"").strip(), False)
return (text, True)
for output_num in range(1,5):
init_text = "בוקר טוב"
text = bos_token + init_text
for i in range(0,84):
sentence = generate_sentence(model, tokenizer, text, temperature=0.9)
text = init_text + sentence[0]
print(text)
if (sentence[1] == True):
break
```
---
language:
- ach
- en
tags:
- translation
license: cc-by-4.0
datasets:
- JW300
metrics:
- bleu
---
# HEL-ACH-EN
## Model description
MT model translating Acholi to English initialized with weights from [opus-mt-luo-en](https://huggingface.co/Helsinki-NLP/opus-mt-luo-en) on HuggingFace.
## Intended uses & limitations
Machine Translation experiments. Do not use for sensitive tasks.
#### How to use
```python
# You can include sample code which will be formatted
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Ogayo/Hel-ach-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Ogayo/Hel-ach-en")
```
#### Limitations and bias
Trained on Jehovah Witnesses data so contains theirs and Christian views.
## Training data
Trained on OPUS JW300 data.
Initialized with weights from [opus-mt-luo-en](https://huggingface.co/Helsinki-NLP/opus-mt-luo-en?text=Bed+gi+nyasi+mar+chieng%27+nyuol+mopong%27+gi+mor%21#model_card)
## Training procedure
Remove duplicates and rows with no alphabetic characters. Used GPU
## Eval results
testset | BLEU
--- | ---
JW300.luo.en| 46.1
---
language: "en"
---
# BART-Squad2
## Model description
BART for extractive (span-based) question answering, trained on Squad 2.0.
F1 score of 87.4.
## Intended uses & limitations
Unfortunately, the Huggingface auto-inference API won't run this model, so if you're attempting to try it through the input box above and it complains, don't be discouraged!
#### How to use
Here's a quick way to get question answering running locally:
```python
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
tokenizer = AutoTokenizer.from_pretrained("Primer/bart-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("Primer/bart-squad2")
model.to('cuda'); model.eval()
def answer(question, text):
seq = '<s>' + question + ' </s> </s> ' + text + ' </s>'
tokens = tokenizer.encode_plus(seq, return_tensors='pt', padding='max_length', max_length=1024)
input_ids = tokens['input_ids'].to('cuda')
attention_mask = tokens['attention_mask'].to('cuda')
start, end, _ = model(input_ids, attention_mask=attention_mask)
start_idx = int(start.argmax().int())
end_idx = int(end.argmax().int())
print(tokenizer.decode(input_ids[0, start_idx:end_idx]).strip())
# ^^ it will be an empty string if the model decided "unanswerable"
>>> question = "Where does Tom live?"
>>> context = "Tom is an engineer in San Francisco."
>>> answer(question, context)
San Francisco
```
(Just drop the `.to('cuda')` stuff if running on CPU).
#### Limitations and bias
Unknown, no further evaluation has been performed. In a technical sense one big limitation is that it's 1.6G 😬
## Training procedure
`run_squad.py` with:
|param|value|
|---|---|
|batch size|8|
|max_seq_length|1024|
|learning rate|1e-5|
|epochs|2|
Modified to freeze shared parameters and encoder embeddings.
## 🔥 Model cards now live inside each huggingface.co model repo 🔥
For consistency, ease of use and scalability, `README.md` model cards now live directly inside each model repo on the HuggingFace model hub.
### How to update a model card
You can directly update a model card inside any model repo you have **write access** to, i.e.:
- a model under your username namespace
- a model under any organization you are a part of.
You can either:
- update it, commit and push using your usual git workflow (command line, GUI, etc.)
- or edit it directly from the website's UI.
**What if you want to create or update a model card for a model you don't have write access to?**
In that case, given that we don't have a Pull request system yet on huggingface.co (🤯),
you can open an issue here, post the card's content, and tag the model author(s) and/or the Hugging Face team.
We might implement a more seamless process at some point, so your early feedback is precious!
Please let us know of any suggestion.
### What happened to the model cards here?
We migrated every model card from the repo to its corresponding huggingface.co model repo. Individual commits were preserved, and they link back to the original commit on GitHub.
---
language: protein
tags:
- protein language model
datasets:
- Uniref100
---
# ProtBert model
Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was introduced in
[this paper](https://doi.org/10.1101/2020.07.12.199554) and first released in
[this repository](https://github.com/agemagician/ProtTrans). This model is trained on uppercase amino acids: it only works with capital letter amino acids.
## Model description
ProtBert is based on Bert model which pretrained on a large corpus of protein sequences in a self-supervised fashion.
This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
One important difference between our Bert model and the original Bert version is the way of dealing with sequences as separate documents.
This means the Next sentence prediction is not used, as each sequence is treated as a complete document.
The masking follows the original Bert training with randomly masks 15% of the amino acids in the input.
At the end, the feature extracted from this model revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein
shape.
This implied learning some of the grammar of the language of life realized in protein sequences.
## Intended uses & limitations
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
We have noticed in some tasks you could gain more accuracy by fine-tuning the model rather than using it as a feature extractor.
### How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>> from transformers import BertForMaskedLM, BertTokenizer, pipeline
>>> tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False )
>>> model = BertForMaskedLM.from_pretrained("Rostlab/prot_bert")
>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
>>> unmasker('D L I P T S S K L V V [MASK] D T S L Q V K K A F F A L V T')
[{'score': 0.11088453233242035,
'sequence': '[CLS] D L I P T S S K L V V L D T S L Q V K K A F F A L V T [SEP]',
'token': 5,
'token_str': 'L'},
{'score': 0.08402521163225174,
'sequence': '[CLS] D L I P T S S K L V V S D T S L Q V K K A F F A L V T [SEP]',
'token': 10,
'token_str': 'S'},
{'score': 0.07328339666128159,
'sequence': '[CLS] D L I P T S S K L V V V D T S L Q V K K A F F A L V T [SEP]',
'token': 8,
'token_str': 'V'},
{'score': 0.06921856850385666,
'sequence': '[CLS] D L I P T S S K L V V K D T S L Q V K K A F F A L V T [SEP]',
'token': 12,
'token_str': 'K'},
{'score': 0.06382402777671814,
'sequence': '[CLS] D L I P T S S K L V V I D T S L Q V K K A F F A L V T [SEP]',
'token': 11,
'token_str': 'I'}]
```
Here is how to use this model to get the features of a given protein sequence in PyTorch:
```python
from transformers import BertModel, BertTokenizer
import re
tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False )
model = BertModel.from_pretrained("Rostlab/prot_bert")
sequence_Example = "A E T C Z A O"
sequence_Example = re.sub(r"[UZOB]", "X", sequence_Example)
encoded_input = tokenizer(sequence_Example, return_tensors='pt')
output = model(**encoded_input)
```
## Training data
The ProtBert model was pretrained on [Uniref100](https://www.uniprot.org/downloads), a dataset consisting of 217 million protein sequences.
## Training procedure
### Preprocessing
The protein sequences are uppercased and tokenized using a single space and a vocabulary size of 21. The rare amino acids "U,Z,O,B" were mapped to "X".
The inputs of the model are then of the form:
```
[CLS] Protein Sequence A [SEP] Protein Sequence B [SEP]
```
Furthermore, each protein sequence was treated as a separate document.
The preprocessing step was performed twice, once for a combined length (2 sequences) of less than 512 amino acids, and another time using a combined length (2 sequences) of less than 2048 amino acids.
The details of the masking procedure for each sequence followed the original Bert model as following:
- 15% of the amino acids are masked.
- In 80% of the cases, the masked amino acids are replaced by `[MASK]`.
- In 10% of the cases, the masked amino acids are replaced by a random amino acid (different) from the one they replace.
- In the 10% remaining cases, the masked amino acids are left as is.
### Pretraining
The model was trained on a single TPU Pod V3-512 for 400k steps in total.
300K steps using sequence length 512 (batch size 15k), and 100K steps using sequence length 2048 (batch size 2.5k).
The optimizer used is Lamb with a learning rate of 0.002, a weight decay of 0.01, learning rate warmup for 40k steps and linear decay of the learning rate after.
## Evaluation results
When fine-tuned on downstream tasks, this model achieves the following results:
Test results :
| Task/Dataset | secondary structure (3-states) | secondary structure (8-states) | Localization | Membrane |
|:-----:|:-----:|:-----:|:-----:|:-----:|
| CASP12 | 75 | 63 | | |
| TS115 | 83 | 72 | | |
| CB513 | 81 | 66 | | |
| DeepLoc | | | 79 | 91 |
### BibTeX entry and citation info
```bibtex
@article {Elnaggar2020.07.12.199554,
author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
elocation-id = {2020.07.12.199554},
year = {2020},
doi = {10.1101/2020.07.12.199554},
publisher = {Cold Spring Harbor Laboratory},
abstract = {Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112 times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8 states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs. Availability ProtTrans: \&lt;a href="https://github.com/agemagician/ProtTrans"\&gt;https://github.com/agemagician/ProtTrans\&lt;/a\&gt;Competing Interest StatementThe authors have declared no competing interest.},
URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
journal = {bioRxiv}
}
```
> Created by [Ahmed Elnaggar/@Elnaggar_AI](https://twitter.com/Elnaggar_AI) | [LinkedIn](https://www.linkedin.com/in/prof-ahmed-elnaggar/)
---
language: protein
tags:
- protein language model
datasets:
- BFD
---
# ProtBert-BFD model
Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was introduced in
[this paper](https://doi.org/10.1101/2020.07.12.199554) and first released in
[this repository](https://github.com/agemagician/ProtTrans). This model is trained on uppercase amino acids: it only works with capital letter amino acids.
## Model description
ProtBert-BFD is based on Bert model which pretrained on a large corpus of protein sequences in a self-supervised fashion.
This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
One important difference between our Bert model and the original Bert version is the way of dealing with sequences as separate documents
This means the Next sentence prediction is not used, as each sequence is treated as a complete document.
The masking follows the original Bert training with randomly masks 15% of the amino acids in the input.
At the end, the feature extracted from this model revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein
shape.
This implied learning some of the grammar of the language of life realized in protein sequences.
## Intended uses & limitations
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
We have noticed in some tasks you could gain more accuracy by fine-tuning the model rather than using it as a feature extractor.
### How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>> from transformers import BertForMaskedLM, BertTokenizer, pipeline
>>> tokenizer = BertTokenizer.from_pretrained('Rostlab/prot_bert_bfd', do_lower_case=False )
>>> model = BertForMaskedLM.from_pretrained("Rostlab/prot_bert_bfd")
>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
>>> unmasker('D L I P T S S K L V V [MASK] D T S L Q V K K A F F A L V T')
[{'score': 0.1165614128112793,
'sequence': '[CLS] D L I P T S S K L V V L D T S L Q V K K A F F A L V T [SEP]',
'token': 5,
'token_str': 'L'},
{'score': 0.08976086974143982,
'sequence': '[CLS] D L I P T S S K L V V V D T S L Q V K K A F F A L V T [SEP]',
'token': 8,
'token_str': 'V'},
{'score': 0.08864385634660721,
'sequence': '[CLS] D L I P T S S K L V V S D T S L Q V K K A F F A L V T [SEP]',
'token': 10,
'token_str': 'S'},
{'score': 0.06227643042802811,
'sequence': '[CLS] D L I P T S S K L V V A D T S L Q V K K A F F A L V T [SEP]',
'token': 6,
'token_str': 'A'},
{'score': 0.06194969266653061,
'sequence': '[CLS] D L I P T S S K L V V T D T S L Q V K K A F F A L V T [SEP]',
'token': 15,
'token_str': 'T'}]
```
Here is how to use this model to get the features of a given protein sequence in PyTorch:
```python
from transformers import BertModel, BertTokenizer
import re
tokenizer = BertTokenizer.from_pretrained('Rostlab/prot_bert_bfd', do_lower_case=False )
model = BertModel.from_pretrained("Rostlab/prot_bert_bfd")
sequence_Example = "A E T C Z A O"
sequence_Example = re.sub(r"[UZOB]", "X", sequence_Example)
encoded_input = tokenizer(sequence_Example, return_tensors='pt')
output = model(**encoded_input)
```
## Training data
The ProtBert-BFD model was pretrained on [BFD](https://bfd.mmseqs.com/), a dataset consisting of 2.1 billion protein sequences.
## Training procedure
### Preprocessing
The protein sequences are uppercased and tokenized using a single space and a vocabulary size of 21.
The inputs of the model are then of the form:
```
[CLS] Protein Sequence A [SEP] Protein Sequence B [SEP]
```
Furthermore, each protein sequence was treated as a separate document.
The preprocessing step was performed twice, once for a combined length (2 sequences) of less than 512 amino acids, and another time using a combined length (2 sequences) of less than 2048 amino acids.
The details of the masking procedure for each sequence followed the original Bert model as following:
- 15% of the amino acids are masked.
- In 80% of the cases, the masked amino acids are replaced by `[MASK]`.
- In 10% of the cases, the masked amino acids are replaced by a random amino acid (different) from the one they replace.
- In the 10% remaining cases, the masked amino acids are left as is.
### Pretraining
The model was trained on a single TPU Pod V3-1024 for one million steps in total.
800k steps using sequence length 512 (batch size 32k), and 200K steps using sequence length 2048 (batch size 6k).
The optimizer used is Lamb with a learning rate of 0.002, a weight decay of 0.01, learning rate warmup for 140k steps and linear decay of the learning rate after.
## Evaluation results
When fine-tuned on downstream tasks, this model achieves the following results:
Test results :
| Task/Dataset | secondary structure (3-states) | secondary structure (8-states) | Localization | Membrane |
|:-----:|:-----:|:-----:|:-----:|:-----:|
| CASP12 | 76 | 65 | | |
| TS115 | 84 | 73 | | |
| CB513 | 83 | 70 | | |
| DeepLoc | | | 78 | 91 |
### BibTeX entry and citation info
```bibtex
@article {Elnaggar2020.07.12.199554,
author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
elocation-id = {2020.07.12.199554},
year = {2020},
doi = {10.1101/2020.07.12.199554},
publisher = {Cold Spring Harbor Laboratory},
abstract = {Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112 times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8 states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs. Availability ProtTrans: \&lt;a href="https://github.com/agemagician/ProtTrans"\&gt;https://github.com/agemagician/ProtTrans\&lt;/a\&gt;Competing Interest StatementThe authors have declared no competing interest.},
URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
journal = {bioRxiv}
}
```
> Created by [Ahmed Elnaggar/@Elnaggar_AI](https://twitter.com/Elnaggar_AI) | [LinkedIn](https://www.linkedin.com/in/prof-ahmed-elnaggar/)
---
language: protein
tags:
- protein language model
datasets:
- BFD
---
# ProtT5-XL-BFD model
Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was introduced in
[this paper](https://doi.org/10.1101/2020.07.12.199554) and first released in
[this repository](https://github.com/agemagician/ProtTrans). This model is trained on uppercase amino acids: it only works with capital letter amino acids.
## Model description
ProtT5-XL-BFD is based on the `t5-3b` model and was pretrained on a large corpus of protein sequences in a self-supervised fashion.
This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
One important difference between this T5 model and the original T5 version is the denosing objective.
The original T5-3B model was pretrained using a span denosing objective, while this model was pre-trained with a Bart-like MLM denosing objective.
The masking probability is consistent with the original T5 training by randomly masking 15% of the amino acids in the input.
It has been shown that the features extracted from this self-supervised model (LM-embeddings) captured important biophysical properties governing protein shape.
shape.
This implied learning some of the grammar of the language of life realized in protein sequences.
## Intended uses & limitations
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
We have noticed in some tasks on can gain more accuracy by fine-tuning the model rather than using it as a feature extractor.
We have also noticed that for feature extraction, its better to use the feature extracted from the encoder not from the decoder.
### How to use
Here is how to use this model to extract the features of a given protein sequence in PyTorch:
```python
from transformers import T5Tokenizer, T5Model
import re
import torch
tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_bfd', do_lower_case=False)
model = T5Model.from_pretrained("Rostlab/prot_t5_xl_bfd")
sequences_Example = ["A E T C Z A O","S K T Z P"]
sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]
ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True)
input_ids = torch.tensor(ids['input_ids'])
attention_mask = torch.tensor(ids['attention_mask'])
with torch.no_grad():
embedding = model(input_ids=input_ids,attention_mask=attention_mask,decoder_input_ids=None)
# For feature extraction we recommend to use the encoder embedding
encoder_embedding = embedding[2].cpu().numpy()
decoder_embedding = embedding[0].cpu().numpy()
```
## Training data
The ProtT5-XL-BFD model was pretrained on [BFD](https://bfd.mmseqs.com/), a dataset consisting of 2.1 billion protein sequences.
## Training procedure
### Preprocessing
The protein sequences are uppercased and tokenized using a single space and a vocabulary size of 21. The rare amino acids "U,Z,O,B" were mapped to "X".
The inputs of the model are then of the form:
```
Protein Sequence [EOS]
```
The preprocessing step was performed on the fly, by cutting and padding the protein sequences up to 512 tokens.
The details of the masking procedure for each sequence are as follows:
- 15% of the amino acids are masked.
- In 90% of the cases, the masked amino acids are replaced by `[MASK]` token.
- In 10% of the cases, the masked amino acids are replaced by a random amino acid (different) from the one they replace.
### Pretraining
The model was trained on a single TPU Pod V3-1024 for 1.2 million steps in total, using sequence length 512 (batch size 4k).
It has a total of approximately 3B parameters and was trained using the encoder-decoder architecture.
The optimizer used is AdaFactor with inverse square root learning rate schedule for pre-training.
## Evaluation results
When the model is used for feature etraction, this model achieves the following results:
Test results :
| Task/Dataset | secondary structure (3-states) | secondary structure (8-states) | Localization | Membrane |
|:-----:|:-----:|:-----:|:-----:|:-----:|
| CASP12 | 77 | 66 | | |
| TS115 | 85 | 74 | | |
| CB513 | 84 | 71 | | |
| DeepLoc | | | 77 | 91 |
### BibTeX entry and citation info
```bibtex
@article {Elnaggar2020.07.12.199554,
author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
elocation-id = {2020.07.12.199554},
year = {2020},
doi = {10.1101/2020.07.12.199554},
publisher = {Cold Spring Harbor Laboratory},
abstract = {Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112 times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8 states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs. Availability ProtTrans: \&lt;a href="https://github.com/agemagician/ProtTrans"\&gt;https://github.com/agemagician/ProtTrans\&lt;/a\&gt;Competing Interest StatementThe authors have declared no competing interest.},
URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
journal = {bioRxiv}
}
```
> Created by [Ahmed Elnaggar/@Elnaggar_AI](https://twitter.com/Elnaggar_AI) | [LinkedIn](https://www.linkedin.com/in/prof-ahmed-elnaggar/)
---
language: hu
license: apache-2.0
datasets:
- common_crawl
- wikipedia
---
# huBERT base model (cased)
## Model description
Cased BERT model for Hungarian, trained on the (filtered, deduplicated) Hungarian subset of the Common Crawl and a snapshot of the Hungarian Wikipedia.
## Intended uses & limitations
The model can be used as any other (cased) BERT model. It has been tested on the chunking and
named entity recognition tasks and set a new state-of-the-art on the former.
## Training
Details of the training data and procedure can be found in the PhD thesis linked below. (With the caveat that it only contains preliminary results
based on the Wikipedia subcorpus. Evaluation of the full model will appear in a future paper.)
## Eval results
When fine-tuned (via `BertForTokenClassification`) on chunking and NER, the model outperforms multilingual BERT, achieves state-of-the-art results on the
former task and comes within 0.5% F1 to the SotA on the latter. The exact scores are
| NER | Minimal NP | Maximal NP |
|-----|------------|------------|
| 97.62% | **97.14%** | **96.97%** |
### BibTeX entry and citation info
The training corpus, parameters and the evaluation methods are discussed in the
[following PhD thesis](https://hlt.bme.hu/en/publ/nemeskey_2020):
```bibtex
@PhDThesis{ Nemeskey:2020,
author = {Nemeskey, Dávid Márk},
title = {Natural Language Processing Methods for Language Modeling},
year = {2020},
school = {E\"otv\"os Lor\'and University}
}
```
# Roberta Large STS-B
This model is a fine tuned RoBERTA model over STS-B.
It was trained with these params:
!python /content/transformers/examples/text-classification/run_glue.py \
--model_type roberta \
--model_name_or_path roberta-large \
--task_name STS-B \
--do_train \
--do_eval \
--do_lower_case \
--data_dir /content/glue_data/STS-B/ \
--max_seq_length 128 \
--per_gpu_eval_batch_size=8 \
--per_gpu_train_batch_size=8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /content/roberta-sts-b
## How to run
```python
import toolz
import torch
batch_size = 6
def roberta_similarity_batches(to_predict):
batches = toolz.partition(batch_size, to_predict)
similarity_scores = []
for batch in batches:
sentences = [(sentence_similarity["sent1"], sentence_similarity["sent2"]) for sentence_similarity in batch]
batch_scores = similarity_roberta(model, tokenizer,sentences)
similarity_scores = similarity_scores + batch_scores[0].cpu().squeeze(axis=1).tolist()
return similarity_scores
def similarity_roberta(model, tokenizer, sent_pairs):
batch_token = tokenizer(sent_pairs, padding='max_length', truncation=True, max_length=500)
res = model(torch.tensor(batch_token['input_ids']).cuda(), attention_mask=torch.tensor(batch_token["attention_mask"]).cuda())
return res
similarity_roberta(model, tokenizer, [('NEW YORK--(BUSINESS WIRE)--Rosen Law Firm, a global investor rights law firm, announces it is investigating potential securities claims on behalf of shareholders of Vale S.A. ( VALE ) resulting from allegations that Vale may have issued materially misleading business information to the investing public',
'EQUITY ALERT: Rosen Law Firm Announces Investigation of Securities Claims Against Vale S.A. – VALE')])
```
---
language: de
license: mit
---
# bert-german-dbmdz-uncased-sentence-stsb
**This model is outdated!**
The new [T-Systems-onsite/cross-en-de-roberta-sentence-transformer](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer) model is better for German language. It is also the current best model for English language and works cross-lingually. Please consider using that model.
\ No newline at end of file
---
language:
- de
- en
license: mit
tags:
- sentence_embedding
- search
- pytorch
- xlm-roberta
- roberta
- xlm-r-distilroberta-base-paraphrase-v1
- paraphrase
datasets:
- STSbenchmark
metrics:
- Spearman’s rank correlation
- cosine similarity
---
# Cross English & German RoBERTa for Sentence Embeddings
This model is intended to [compute sentence (text) embeddings](https://www.sbert.net/docs/usage/computing_sentence_embeddings.html) for English and German text. These embeddings can then be compared with [cosine-similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to find sentences with a similar semantic meaning. For example this can be useful for [semantic textual similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html), [semantic search](https://www.sbert.net/docs/usage/semantic_search.html), or [paraphrase mining](https://www.sbert.net/docs/usage/paraphrase_mining.html). To do this you have to use the [Sentence Transformers Python framework](https://github.com/UKPLab/sentence-transformers).
The speciality of this model is that it also works cross-lingually. Regardless of the language, the sentences are translated into very similar vectors according to their semantics. This means that you can, for example, enter a search in German and find results according to the semantics in German and also in English. Using a xlm model and _multilingual finetuning with language-crossing_ we reach performance that even exceeds the best current dedicated English large model (see Evaluation section below).
> Sentence-BERT (SBERT) is a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.
Source: [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)
This model is fine-tuned from [Philip May](https://eniak.de/) and open-sourced by [T-Systems-onsite](https://www.t-systems-onsite.de/). Special thanks to [Nils Reimers](https://www.nils-reimers.de/) for your awesome open-source work, the Sentence Transformers, the models and your help on GitHub.
## How to use
**The usage description above - provided by Hugging Face - is wrong for sentence embeddings! Please use this:**
To use this model install the `sentence-transformers` package (see here: <https://github.com/UKPLab/sentence-transformers>).
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('T-Systems-onsite/cross-en-de-roberta-sentence-transformer')
```
For details of usage and examples see here:
- [Computing Sentence Embeddings](https://www.sbert.net/docs/usage/computing_sentence_embeddings.html)
- [Semantic Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html)
- [Paraphrase Mining](https://www.sbert.net/docs/usage/paraphrase_mining.html)
- [Semantic Search](https://www.sbert.net/docs/usage/semantic_search.html)
- [Cross-Encoders](https://www.sbert.net/docs/usage/cross-encoder.html)
- [Examples on GitHub](https://github.com/UKPLab/sentence-transformers/tree/master/examples)
## Training
The base model is [xlm-roberta-base](https://huggingface.co/xlm-roberta-base). This model has been further trained by [Nils Reimers](https://www.nils-reimers.de/) on a large scale paraphrase dataset for 50+ languages. [Nils Reimers](https://www.nils-reimers.de/) about this [on GitHub](https://github.com/UKPLab/sentence-transformers/issues/509#issuecomment-712243280):
>A paper is upcoming for the paraphrase models.
>
>These models were trained on various datasets with Millions of examples for paraphrases, mainly derived from Wikipedia edit logs, paraphrases mined from Wikipedia and SimpleWiki, paraphrases from news reports, AllNLI-entailment pairs with in-batch-negative loss etc.
>
>In internal tests, they perform much better than the NLI+STSb models as they have see more and broader type of training data. NLI+STSb has the issue that they are rather narrow in their domain and do not contain any domain specific words / sentences (like from chemistry, computer science, math etc.). The paraphrase models has seen plenty of sentences from various domains.
>
>More details with the setup, all the datasets, and a wider evaluation will follow soon.
The resulting model called `xlm-r-distilroberta-base-paraphrase-v1` has been released here: <https://github.com/UKPLab/sentence-transformers/releases/tag/v0.3.8>
Building on this cross language model we fine-tuned it for English and German language on the [STSbenchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) dataset. For German language we used the dataset of our [German STSbenchmark dataset](https://github.com/t-systems-on-site-services-gmbh/german-STSbenchmark) which has been translated with [deepl.com](https://www.deepl.com/translator). Additionally to the German and English training samples we generated samples of English and German crossed. We call this _multilingual finetuning with language-crossing_. It doubled the traing-datasize and tests show that it further improves performance.
We did an automatic hyperparameter search for 33 trials with [Optuna](https://github.com/optuna/optuna). Using 10-fold crossvalidation on the deepl.com test and dev dataset we found the following best hyperparameter:
- batch_size = 8
- num_epochs = 2
- lr = 1.026343323298136e-05,
- eps = 4.462251033010287e-06
- weight_decay = 0.04794438776350409
- warmup_steps_proportion = 0.1609010732760181
The final model was trained with these hyperparameters on the combination of the train and dev datasets from English, German and the crossings of them. The testset was left for testing.
# Evaluation
The evaluation has been done on English, German and both languages crossed with the STSbenchmark test data. The evaluation-code is available on [Colab](https://colab.research.google.com/drive/1gtGnKq_dYU_sDYqMohTYVMVpxMJjyH0M?usp=sharing). As the metric for evaluation we use the Spearman’s rank correlation between the cosine-similarity of the sentence embeddings and STSbenchmark labels.
| Model Name | Spearman<br/>German | Spearman<br/>English | Spearman<br/>EN-DE & DE-EN<br/>(cross) |
|---------------------------------------------------------------|-------------------|--------------------|------------------|
| xlm-r-distilroberta-base-paraphrase-v1 | 0.8079 | 0.8350 | 0.7983 |
| [xlm-r-100langs-bert-base-nli-stsb-mean-tokens](https://huggingface.co/sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens) | 0.7877 | 0.8465 | 0.7908 |
| xlm-r-bert-base-nli-stsb-mean-tokens | 0.7877 | 0.8465 | 0.7908 |
| [roberta-large-nli-stsb-mean-tokens](https://huggingface.co/sentence-transformers/roberta-large-nli-stsb-mean-tokens) | 0.6371 | 0.8639 | 0.4109 |
| [T-Systems-onsite/<br/>german-roberta-sentence-transformer-v2](https://huggingface.co/T-Systems-onsite/german-roberta-sentence-transformer-v2) | 0.8529 | 0.8634 | 0.8415 |
| **T-Systems-onsite/<br/>cross-en-de-roberta-sentence-transformer** | **0.8550** | **0.8660** | **0.8525** |
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment