Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language:
- hi
- en
datasets:
- lince
license: mit
tags:
- codeswitching
- hindi-english
- language-identification
---
# codeswitch-hineng-lid-lince
This is a pretrained model for **language identification** of `hindi-english` code-mixed data used from [LinCE](https://ritual.uh.edu/lince/home)
This model is trained for this below repository.
[https://github.com/sagorbrur/codeswitch](https://github.com/sagorbrur/codeswitch)
To install codeswitch:
```
pip install codeswitch
```
## Identify Language
* **Method-1**
```py
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("sagorsarker/codeswitch-hineng-lid-lince")
model = AutoModelForTokenClassification.from_pretrained("sagorsarker/codeswitch-hineng-lid-lince")
lid_model = pipeline('ner', model=model, tokenizer=tokenizer)
lid_model("put any hindi english code-mixed sentence")
```
* **Method-2**
```py
from codeswitch.codeswitch import LanguageIdentification
lid = LanguageIdentification('hin-eng')
text = "" # your code-mixed sentence
result = lid.identify(text)
print(result)
```
---
language:
- hi
- en
datasets:
- lince
license: mit
tags:
- codeswitching
- hindi-english
- ner
---
# codeswitch-hineng-ner-lince
This is a pretrained model for **Name Entity Recognition** of `Hindi-english` code-mixed data used from [LinCE](https://ritual.uh.edu/lince/home)
This model is trained for this below repository.
[https://github.com/sagorbrur/codeswitch](https://github.com/sagorbrur/codeswitch)
To install codeswitch:
```
pip install codeswitch
```
## Name Entity Recognition of Code-Mixed Data
* **Method-1**
```py
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("sagorsarker/codeswitch-hineng-ner-lince")
model = AutoModelForTokenClassification.from_pretrained("sagorsarker/codeswitch-hineng-ner-lince")
ner_model = pipeline('ner', model=model, tokenizer=tokenizer)
ner_model("put any hindi english code-mixed sentence")
```
* **Method-2**
```py
from codeswitch.codeswitch import NER
ner = NER('hin-eng')
text = "" # your mixed sentence
result = ner.tag(text)
print(result)
```
---
language:
- hi
- en
datasets:
- lince
license: mit
tags:
- codeswitching
- hindi-english
- pos
---
# codeswitch-hineng-pos-lince
This is a pretrained model for **Part of Speech Tagging** of `hindi-english` code-mixed data used from [LinCE](https://ritual.uh.edu/lince/home)
This model is trained for this below repository.
[https://github.com/sagorbrur/codeswitch](https://github.com/sagorbrur/codeswitch)
To install codeswitch:
```
pip install codeswitch
```
## Part-of-Speech Tagging of Hindi-English Mixed Data
* **Method-1**
```py
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("sagorsarker/codeswitch-hineng-pos-lince")
model = AutoModelForTokenClassification.from_pretrained("sagorsarker/codeswitch-hineng-pos-lince")
pos_model = pipeline('ner', model=model, tokenizer=tokenizer)
pos_model("put any hindi english code-mixed sentence")
```
* **Method-2**
```py
from codeswitch.codeswitch import POS
pos = POS('hin-eng')
text = "" # your mixed sentence
result = pos.tag(text)
print(result)
```
---
language:
- ne
- en
datasets:
- lince
license: mit
tags:
- codeswitching
- nepali-english
- language-identification
---
# codeswitch-nepeng-lid-lince
This is a pretrained model for **language identification** of `nepali-english` code-mixed data used from [LinCE](https://ritual.uh.edu/lince/home).
This model is trained for this below repository.
[https://github.com/sagorbrur/codeswitch](https://github.com/sagorbrur/codeswitch)
To install codeswitch:
```
pip install codeswitch
```
## Identify Language
* **Method-1**
```py
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("sagorsarker/codeswitch-nepeng-lid-lince")
model = AutoModelForTokenClassification.from_pretrained("sagorsarker/codeswitch-nepeng-lid-lince")
lid_model = pipeline('ner', model=model, tokenizer=tokenizer)
lid_model("put any nepali english code-mixed sentence")
```
* **Method-2**
```py
from codeswitch.codeswitch import LanguageIdentification
lid = LanguageIdentification('nep-eng')
text = "" # your code-mixed sentence
result = lid.identify(text)
print(result)
```
---
language:
- es
- en
datasets:
- lince
license: mit
tags:
- codeswitching
- spanish-english
- language-identification
---
# codeswitch-spaeng-lid-lince
This is a pretrained model for **language identification** of `spanish-english` code-mixed data used from [LinCE](https://ritual.uh.edu/lince/home)
This model is trained for this below repository.
[https://github.com/sagorbrur/codeswitch](https://github.com/sagorbrur/codeswitch)
To install codeswitch:
```
pip install codeswitch
```
## Identify Language
* **Method-1**
```py
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("sagorsarker/codeswitch-spaeng-lid-lince")
model = AutoModelForTokenClassification.from_pretrained("sagorsarker/codeswitch-spaeng-lid-lince")
lid_model = pipeline('ner', model=model, tokenizer=tokenizer)
lid_model("put any spanish english code-mixed sentence")
```
* **Method-2**
```py
from codeswitch.codeswitch import LanguageIdentification
lid = LanguageIdentification('spa-eng')
text = "" # your code-mixed sentence
result = lid.identify(text)
print(result)
```
---
language:
- es
- en
datasets:
- lince
license: mit
tags:
- codeswitching
- spanish-english
- ner
---
# codeswitch-spaeng-ner-lince
This is a pretrained model for **Name Entity Recognition** of `spanish-english` code-mixed data used from [LinCE](https://ritual.uh.edu/lince/home)
This model is trained for this below repository.
[https://github.com/sagorbrur/codeswitch](https://github.com/sagorbrur/codeswitch)
To install codeswitch:
```
pip install codeswitch
```
## Name Entity Recognition of Spanish-English Mixed Data
* **Method-1**
```py
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("sagorsarker/codeswitch-spaeng-ner-lince")
model = AutoModelForTokenClassification.from_pretrained("sagorsarker/codeswitch-spaeng-ner-lince")
ner_model = pipeline('ner', model=model, tokenizer=tokenizer)
ner_model("put any spanish english code-mixed sentence")
```
* **Method-2**
```py
from codeswitch.codeswitch import NER
ner = NER('spa-eng')
text = "" # your mixed sentence
result = ner.tag(text)
print(result)
```
---
language:
- es
- en
datasets:
- lince
license: mit
tags:
- codeswitching
- spanish-english
- pos
---
# codeswitch-spaeng-pos-lince
This is a pretrained model for **Part of Speech Tagging** of `spanish-english` code-mixed data used from [LinCE](https://ritual.uh.edu/lince/home)
This model is trained for this below repository.
[https://github.com/sagorbrur/codeswitch](https://github.com/sagorbrur/codeswitch)
To install codeswitch:
```
pip install codeswitch
```
## Part-of-Speech Tagging of Spanish-English Mixed Data
* **Method-1**
```py
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("sagorsarker/codeswitch-spaeng-pos-lince")
model = AutoModelForTokenClassification.from_pretrained("sagorsarker/codeswitch-spaeng-pos-lince")
pos_model = pipeline('ner', model=model, tokenizer=tokenizer)
pos_model("put any spanish english code-mixed sentence")
```
* **Method-2**
```py
from codeswitch.codeswitch import POS
pos = POS('spa-eng')
text = "" # your mixed sentence
result = pos.tag(text)
print(result)
```
---
language:
- es
- en
datasets:
- lince
license: mit
tags:
- codeswitching
- spanish-english
- sentiment-analysis
---
# codeswitch-spaeng-sentiment-analysis-lince
This is a pretrained model for **Sentiment Analysis** of `spanish-english` code-mixed data used from [LinCE](https://ritual.uh.edu/lince/home)
This model is trained for this below repository.
[https://github.com/sagorbrur/codeswitch](https://github.com/sagorbrur/codeswitch)
To install codeswitch:
```
pip install codeswitch
```
## Sentiment Analysis of Spanish-English Code-Mixed Data
* **Method-1**
```py
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("sagorsarker/codeswitch-spaeng-sentiment-analysis-lince")
model = AutoModelForSequenceClassification.from_pretrained("sagorsarker/codeswitch-spaeng-sentiment-analysis-lince")
nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
sentence = "El perro le ladraba a La Gatita .. .. lol #teamlagatita en las playas de Key Biscayne este Memorial day"
nlp(sentence)
```
* **Method-2**
```py
from codeswitch.codeswitch import SentimentAnalysis
sa = SentimentAnalysis('spa-eng')
sentence = "El perro le ladraba a La Gatita .. .. lol #teamlagatita en las playas de Key Biscayne este Memorial day"
result = sa.analyze(sentence)
print(result)
```
---
language: id
datasets:
- oscar
---
# IndoBERT (Indonesian BERT Model)
## Model description
IndoBERT is a pre-trained language model based on BERT architecture for the Indonesian Language.
This model is base-uncased version which use bert-base config.
## Intended uses & limitations
#### How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sarahlintang/IndoBERT")
model = AutoModel.from_pretrained("sarahlintang/IndoBERT")
tokenizer.encode("hai aku mau makan.")
[2, 8078, 1785, 2318, 1946, 18, 4]
```
## Training data
This model was pre-trained on 16 GB of raw text ~2 B words from Oscar Corpus (https://oscar-corpus.com/).
This model is equal to bert-base model which has 32,000 vocabulary size.
## Training procedure
The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2.
We used a Google Cloud Storage bucket, for persistent storage of training data and models.
## Eval results
We evaluate this model on three Indonesian NLP downstream task:
- some extractive summarization model
- sentiment analysis
- Part-of-Speech Tagger
it was proven that this model outperforms multilingual BERT for all downstream tasks.
---
language: da
license: cc-by-4.0
---
# Danish ELECTRA small (cased)
An [ELECTRA](https://arxiv.org/abs/2003.10555) model pretrained on a custom Danish corpus (~17.5gb).
For details regarding data sources and training procedure, along with benchmarks on downstream tasks, go to: https://github.com/sarnikowski/danish_transformers/tree/main/electra
## Usage
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sarnikowski/electra-small-discriminator-da-256-cased")
model = AutoModel.from_pretrained("sarnikowski/electra-small-discriminator-da-256-cased")
```
## Questions?
If you have any questions feel free to open an issue on the [danish_transformers](https://github.com/sarnikowski/danish_transformers) repository, or send an email to p.sarnikowski@gmail.com
---
language: tr
---
# For Turkish language, here is an easy-to-use NER application.
** Türkçe için kolay bir python NER (Bert + Transfer Learning) (İsim Varlık Tanıma) modeli...
Thanks to @stefan-it, I applied the followings for training
cd tr-data
for file in train.txt dev.txt test.txt labels.txt
do
wget https://schweter.eu/storage/turkish-bert-wikiann/$file
done
cd ..
It will download the pre-processed datasets with training, dev and test splits and put them in a tr-data folder.
Run pre-training
After downloading the dataset, pre-training can be started. Just set the following environment variables:
```
export MAX_LENGTH=128
export BERT_MODEL=dbmdz/bert-base-turkish-cased
export OUTPUT_DIR=tr-new-model
export BATCH_SIZE=32
export NUM_EPOCHS=3
export SAVE_STEPS=625
export SEED=1
```
Then run pre-training:
```
python3 run_ner_old.py --data_dir ./tr-data3 \
--model_type bert \
--labels ./tr-data/labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR-$SEED \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict \
--fp16
```
# Usage
```
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("savasy/bert-base-turkish-ner-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-ner-cased")
ner=pipeline('ner', model=model, tokenizer=tokenizer)
ner("Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı.")
```
# Some results
Data1: For the data above
Eval Results:
* precision = 0.916400580551524
* recall = 0.9342309684101502
* f1 = 0.9252298787412536
* loss = 0.11335893666411284
Test Results:
* precision = 0.9192058759362955
* recall = 0.9303010230367262
* f1 = 0.9247201697271198
* loss = 0.11182546521618497
Data2:
https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt
The performance for the data given by @kemalaraz is as follows
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat eval_results.txt
* precision = 0.9461980692049029
* recall = 0.959309358847465
* f1 = 0.9527086063783312
* loss = 0.037054269206847804
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat test_results.txt
* precision = 0.9458370635631155
* recall = 0.9588201928530913
* f1 = 0.952284378344882
* loss = 0.035431676572445225
---
language: tr
---
# Bert-base Turkish Sentiment Model
https://huggingface.co/savasy/bert-base-turkish-sentiment-cased
This model is used for Sentiment Analysis, which is based on BERTurk for Turkish Language https://huggingface.co/dbmdz/bert-base-turkish-cased
## Dataset
The dataset is taken from the studies [[2]](#paper-2) and [[3]](#paper-3), and merged.
* The study [2] gathered movie and product reviews. The products are book, DVD, electronics, and kitchen.
The movie dataset is taken from a cinema Web page ([Beyazperde](www.beyazperde.com)) with
5331 positive and 5331 negative sentences. Reviews in the Web page are marked in
scale from 0 to 5 by the users who made the reviews. The study considered a review
sentiment positive if the rating is equal to or bigger than 4, and negative if it is less
or equal to 2. They also built Turkish product review dataset from an online retailer
Web page. They constructed benchmark dataset consisting of reviews regarding some
products (book, DVD, etc.). Likewise, reviews are marked in the range from 1 to 5,
and majority class of reviews are 5. Each category has 700 positive and 700 negative
reviews in which average rating of negative reviews is 2.27 and of positive reviews
is 4.5. This dataset is also used by the study [[1]](#paper-1).
* The study [[3]](#paper-3) collected tweet dataset. They proposed a new approach for automatically classifying the sentiment of microblog messages. The proposed approach is based on utilizing robust feature representation and fusion.
*Merged Dataset*
| *size* | *data* |
|--------|----|
| 8000 |dev.tsv|
| 8262 |test.tsv|
| 32000 |train.tsv|
| *48290* |*total*|
### The dataset is used by following papers
<a id="paper-1">[1]</a> Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12.
<a id="paper-2">[2]</a> Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment
Discovery and Opinion Mining (WISDOM ’13)
<a id="paper-3">[3]</a> Hayran, A., Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey
## Training
```shell
export GLUE_DIR="./sst-2-newall"
export TASK_NAME=SST-2
python3 run_glue.py \
--model_type bert \
--model_name_or_path dbmdz/bert-base-turkish-uncased\
--task_name "SST-2" \
--do_train \
--do_eval \
--data_dir "./sst-2-newall" \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir "./model"
```
## Results
> 05/10/2020 17:00:43 - INFO - transformers.trainer - \*\*\*\*\* Running Evaluation \*\*\*\*\*
> 05/10/2020 17:00:43 - INFO - transformers.trainer - Num examples = 7999
> 05/10/2020 17:00:43 - INFO - transformers.trainer - Batch size = 8
> Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]
> 05/10/2020 17:01:17 - INFO - \_\_main__ - \*\*\*\*\* Eval results sst-2 \*\*\*\*\*
> 05/10/2020 17:01:17 - INFO - \_\_main__ - acc = 0.9539942492811602
> 05/10/2020 17:01:17 - INFO - \_\_main__ - loss = 0.16348013816401363
Accuracy is about **95.4%**
## Code Usage
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
p = sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
print(p)
# [{'label': 'LABEL_1', 'score': 0.9871089}]
print(p[0]['label'] == 'LABEL_1')
# True
p = sa("Film çok kötü ve çok sahteydi")
print(p)
# [{'label': 'LABEL_0', 'score': 0.9975505}]
print(p[0]['label'] == 'LABEL_1')
# False
```
## Test
### Data
Suppose your file has lots of lines of comment and label (1 or 0) at the end (tab seperated)
> comment1 ... \t label
> comment2 ... \t label
> ...
### Code
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
sa = pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
input_file = "/path/to/your/file/yourfile.tsv"
i, crr = 0, 0
for line in open(input_file):
lines = line.strip().split("\t")
if len(lines) == 2:
i = i + 1
if i%100 == 0:
print(i)
pred = sa(lines[0])
pred = pred[0]["label"].split("_")[1]
if pred == lines[1]:
crr = crr + 1
print(crr, i, crr/i)
```
---
language: tr
---
# Turkish SQuAD Model : Question Answering
I fine-tuned Turkish-Bert-Model for Question-Answering problem with Turkish version of SQuAD; TQuAD
* BERT-base: https://huggingface.co/dbmdz/bert-base-turkish-uncased
* TQuAD dataset: https://github.com/TQuad/turkish-nlp-qa-dataset
# Training Code
```
!python3 run_squad.py \
--model_type bert \
--model_name_or_path dbmdz/bert-base-turkish-uncased\
--do_train \
--do_eval \
--train_file trainQ.json \
--predict_file dev1.json \
--per_gpu_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 5.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir "./model"
```
# Example Usage
> Load Model
```
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
import torch
tokenizer = AutoTokenizer.from_pretrained("./model")
model = AutoModelForQuestionAnswering.from_pretrained("./model")
nlp=pipeline("question-answering", model=model, tokenizer=tokenizer)
```
> Apply the model
```
sait="ABASIYANIK, Sait Faik. Hikayeci (Adapazarı 23 Kasım 1906-İstanbul 11 Mayıs 1954). \
İlk öğrenimine Adapazarı’nda Rehber-i Terakki Mektebi’nde başladı. İki yıl kadar Adapazarı İdadisi’nde okudu.\
İstanbul Erkek Lisesi’nde devam ettiği orta öğrenimini Bursa Lisesi’nde tamamladı (1928). İstanbul Edebiyat \
Fakültesi’ne iki yıl devam ettikten sonra babasının isteği üzerine iktisat öğrenimi için İsviçre’ye gitti. \
Kısa süre sonra iktisat öğrenimini bırakarak Lozan’dan Grenoble’a geçti. Üç yıl başıboş bir edebiyat öğrenimi \
gördükten sonra babası tarafından geri çağrıldı (1933). Bir müddet Halıcıoğlu Ermeni Yetim Mektebi'nde Türkçe \
gurup dersleri öğretmenliği yaptı. Ticarete atıldıysa da tutunamadı. Bir ay Haber gazetesinde adliye muhabirliği\
yaptı (1942). Babasının ölümü üzerine aileden kalan emlakin geliri ile avare bir hayata başladı. Evlenemedi.\
Yazları Burgaz adasındaki köşklerinde, kışları Şişli’deki apartmanlarında annesi ile beraber geçen bu fazla \
içkili bohem hayatı ömrünün sonuna kadar sürdü."
print(nlp(question="Ne zaman avare bir hayata başladı?", context=sait))
print(nlp(question="Sait Faik hangi Lisede orta öğrenimini tamamladı?", context=sait))
```
```
# Ask your self ! type your question
print(nlp(question="...?", context=sait))
```
Check My other Model
https://huggingface.co/savasy
---
language: tr
---
# Turkish Text Classification
This model is a fine-tune model of https://github.com/stefan-it/turkish-bert by using text classification data where there are 7 categories as follows
```
code_to_label={
'LABEL_0': 'dunya ',
'LABEL_1': 'ekonomi ',
'LABEL_2': 'kultur ',
'LABEL_3': 'saglik ',
'LABEL_4': 'siyaset ',
'LABEL_5': 'spor ',
'LABEL_6': 'teknoloji '}
```
## Data
The following Turkish benchmark dataset is used for fine-tuning
https://www.kaggle.com/savasy/ttc4900
## Quick Start
Bewgin with installing transformers as follows
> pip install transformers
```
# Code:
# import libraries
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer, AutoModelForSequenceClassification
tokenizer= AutoTokenizer.from_pretrained("savasy/bert-turkish-text-classification")
# build and load model, it take time depending on your internet connection
model= AutoModelForSequenceClassification.from_pretrained("savasy/bert-turkish-text-classification")
# make pipeline
nlp=pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
# apply model
nlp("bla bla")
# [{'label': 'LABEL_2', 'score': 0.4753005802631378}]
code_to_label={
'LABEL_0': 'dunya ',
'LABEL_1': 'ekonomi ',
'LABEL_2': 'kultur ',
'LABEL_3': 'saglik ',
'LABEL_4': 'siyaset ',
'LABEL_5': 'spor ',
'LABEL_6': 'teknoloji '}
code_to_label[nlp("bla bla")[0]['label']]
# > 'kultur '
```
## How the model was trained
```
## loading data for Turkish text classification
import pandas as pd
# https://www.kaggle.com/savasy/ttc4900
df=pd.read_csv("7allV03.csv")
df.columns=["labels","text"]
df.labels=pd.Categorical(df.labels)
traind_df=...
eval_df=...
# model
from simpletransformers.classification import ClassificationModel
import torch,sklearn
model_args = {
"use_early_stopping": True,
"early_stopping_delta": 0.01,
"early_stopping_metric": "mcc",
"early_stopping_metric_minimize": False,
"early_stopping_patience": 5,
"evaluate_during_training_steps": 1000,
"fp16": False,
"num_train_epochs":3
}
model = ClassificationModel(
"bert",
"dbmdz/bert-base-turkish-cased",
use_cuda=cuda_available,
args=model_args,
num_labels=7
)
model.train_model(train_df, acc=sklearn.metrics.accuracy_score)
```
For other training models please check https://simpletransformers.ai/
For the detailed usage of Turkish Text Classification please check [python notebook](https://github.com/savasy/TurkishTextClassification/blob/master/Bert_base_Text_Classification_for_Turkish.ipynb)
---
language: en
license: apache-2.0
---
## ELECTRA-small-cased
This is a cased version of `google/electra-small-discriminator`, trained on the
[OpenWebText corpus](https://skylion007.github.io/OpenWebTextCorpus/).
Uses the same tokenizer and vocab from `bert-base-cased`
---
tags:
- exbert
license: apache-2.0
---
# ouBioBERT-Base, Uncased
Bidirectional Encoder Representations from Transformers for Biomedical Text Mining by Osaka University (ouBioBERT) is a language model based on the BERT-Base (Devlin, et al., 2019) architecture. We pre-trained ouBioBERT on PubMed abstracts from the PubMed baseline (ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline) via our method.
The details of the pre-training procedure can be found in Wada, et al. (2020).
## Evaluation
We evaluated the performance of ouBioBERT in terms of the biomedical language understanding evaluation (BLUE) benchmark (Peng, et al., 2019). The numbers are mean (standard deviation) on five different random seeds.
| Dataset | Task Type | Score |
|:----------------|:-----------------------------|-------------:|
| MedSTS | Sentence similarity | 84.9 (0.6) |
| BIOSSES | Sentence similarity | 92.3 (0.8) |
| BC5CDR-disease | Named-entity recognition | 87.4 (0.1) |
| BC5CDR-chemical | Named-entity recognition | 93.7 (0.2) |
| ShARe/CLEFE | Named-entity recognition | 80.1 (0.4) |
| DDI | Relation extraction | 81.1 (1.5) |
| ChemProt | Relation extraction | 75.0 (0.3) |
| i2b2 2010 | Relation extraction | 74.0 (0.8) |
| HoC | Document classification | 86.4 (0.5) |
| MedNLI | Inference | 83.6 (0.7) |
| **Total** | Macro average of the scores |**83.8 (0.3)**|
## Code for Fine-tuning
We made the source code for fine-tuning freely available at [our repository](https://github.com/sy-wada/blue_benchmark_with_transformers).
## Citation
If you use our work in your research, please kindly cite the following paper:
```bibtex
@misc{2005.07202,
Author = {Shoya Wada and Toshihiro Takeda and Shiro Manabe and Shozo Konishi and Jun Kamohara and Yasushi Matsumura},
Title = {A pre-training technique to localize medical BERT and enhance BioBERT},
Year = {2020},
Eprint = {arXiv:2005.07202},
}
```
<a href="https://huggingface.co/exbert/?model=seiya/oubiobert-base-uncased&sentence=Coronavirus%20disease%20(COVID-19)%20is%20caused%20by%20SARS-COV2%20and%20represents%20the%20causative%20agent%20of%20a%20potentially%20fatal%20disease%20that%20is%20of%20great%20global%20public%20health%20concern.">
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>
# LaBSE Pytorch Version
This is a pytorch port of the tensorflow version of [LaBSE](https://tfhub.dev/google/LaBSE/1).
To get the sentence embeddings, you can use the following code:
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/LaBSE")
model = AutoModel.from_pretrained("sentence-transformers/LaBSE")
sentences = ["Hello World", "Hallo Welt"]
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=64, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
embeddings = model_output.pooler_output
embeddings = torch.nn.functional.normalize(embeddings)
print(embeddings)
```
When you have [sentence-transformers](https://www.sbert.net/) installed, you can use the model like this:
```python
from sentence_transformers import SentenceTransformer
sentences = ["Hello World", "Hallo Welt"]
model = SentenceTransformer('LaBSE')
embeddings = model.encode(sentences)
print(embeddings)
```
## Reference:
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Narveen Ari, Wei Wang. [Language-agnostic BERT Sentence Embedding](https://arxiv.org/abs/2007.01852). July 2020
License: [https://tfhub.dev/google/LaBSE/1](https://tfhub.dev/google/LaBSE/1)
---
language: en
tags:
- exbert
license: apache-2.0
datasets:
- snli
- multi_nli
---
# BERT base model (uncased) for Sentence Embeddings
This is the `bert-base-nli-cls-token` model from the [sentence-transformers](https://github.com/UKPLab/sentence-transformers)-repository. The sentence-transformers repository allows to train and use Transformer models for generating sentence and text embeddings.
The model is described in the paper [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)
## Usage (HuggingFace Models Repository)
You can use the model directly from the model repository to compute sentence embeddings. The CLS token of each input represents the sentence embedding:
```python
from transformers import AutoTokenizer, AutoModel
import torch
#Sentences we want sentence embeddings for
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
#Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/bert-base-nli-cls-token")
model = AutoModel.from_pretrained("sentence-transformers/bert-base-nli-cls-token")
#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')
#Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = model_output[0][:,0] #Take the first token ([CLS]) from each sentence
print("Sentence embeddings:")
print(sentence_embeddings)
```
## Usage (Sentence-Transformers)
Using this model becomes more convenient when you have [sentence-transformers](https://github.com/UKPLab/sentence-transformers) installed:
```
pip install -U sentence-transformers
```
Then you can use the model like this:
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-cls-token')
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)
```
## Citing & Authors
If you find this model helpful, feel free to cite our publication [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084):
```
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "http://arxiv.org/abs/1908.10084",
}
```
---
language: en
tags:
- exbert
license: apache-2.0
datasets:
- snli
- multi_nli
---
# BERT base model (uncased) for Sentence Embeddings
This is the `bert-base-nli-max-tokens` model from the [sentence-transformers](https://github.com/UKPLab/sentence-transformers)-repository. The sentence-transformers repository allows to train and use Transformer models for generating sentence and text embeddings.
The model is described in the paper [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)
## Usage (HuggingFace Models Repository)
You can use the model directly from the model repository to compute sentence embeddings. It uses max pooling to generate a fixed sized sentence embedding:
```python
from transformers import AutoTokenizer, AutoModel
import torch
#Max Pooling - Take the max value over time for every dimension
def max_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
token_embeddings[input_mask_expanded == 0] = -1e9 # Set padding tokens to large negative value
max_over_time = torch.max(token_embeddings, 1)[0]
return max_over_time
#Sentences we want sentence embeddings for
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
#Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/bert-base-nli-max-tokens")
model = AutoModel.from_pretrained("sentence-transformers/bert-base-nli-max-tokens")
#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')
#Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
#Perform pooling. In this case, max pooling
sentence_embeddings = max_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
```
## Usage (Sentence-Transformers)
Using this model becomes more convenient when you have [sentence-transformers](https://github.com/UKPLab/sentence-transformers) installed:
```
pip install -U sentence-transformers
```
Then you can use the model like this:
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-max-tokens')
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)
```
## Citing & Authors
If you find this model helpful, feel free to cite our publication [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084):
```
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "http://arxiv.org/abs/1908.10084",
}
```
---
language: en
tags:
- exbert
license: apache-2.0
datasets:
- snli
- multi_nli
---
# BERT base model (uncased) for Sentence Embeddings
This is the `bert-base-nli-mean-tokens` model from the [sentence-transformers](https://github.com/UKPLab/sentence-transformers)-repository. The sentence-transformers repository allows to train and use Transformer models for generating sentence and text embeddings.
The model is described in the paper [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)
## Usage (HuggingFace Models Repository)
You can use the model directly from the model repository to compute sentence embeddings:
```python
from transformers import AutoTokenizer, AutoModel
import torch
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
return sum_embeddings / sum_mask
#Sentences we want sentence embeddings for
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
#Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")
model = AutoModel.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")
#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')
#Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
#Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
```
## Usage (Sentence-Transformers)
Using this model becomes more convenient when you have [sentence-transformers](https://github.com/UKPLab/sentence-transformers) installed:
```
pip install -U sentence-transformers
```
Then you can use the model like this:
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)
```
## Citing & Authors
If you find this model helpful, feel free to cite our publication [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084):
```
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "http://arxiv.org/abs/1908.10084",
}
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment