Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language:
- te
tags:
- MaskedLM
- Telugu
- BERT
- Question-Answering
- Token Classification
- Text Classification
---
# Indic-Transformers Telugu BERT
## Model description
This is a BERT language model pre-trained on ~1.6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
## Intended uses & limitations
#### How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-bert')
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-bert')
text = "మీరు ఎలా ఉన్నారు"
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
out = model(input_ids)[0]
print(out.shape)
# out = [1, 5, 768]
```
#### Limitations and bias
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
---
language:
- te
tags:
- MaskedLM
- Telugu
- DistilBERT
- Question-Answering
- Token Classification
- Text Classification
---
# Indic-Transformers Telugu DistilBERT
## Model description
This is a DistilBERT language model pre-trained on ~2 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
## Intended uses & limitations
#### How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-distilbert')
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-distilbert')
text = "మీరు ఎలా ఉన్నారు"
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
out = model(input_ids)[0]
print(out.shape)
# out = [1, 5, 768]
```
#### Limitations and bias
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
---
language:
- te
tags:
- MaskedLM
- Telugu
- RoBERTa
- Question-Answering
- Token Classification
- Text Classification
---
# Indic-Transformers Telugu RoBERTa
## Model description
This is a RoBERTa language model pre-trained on ~2 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
## Intended uses & limitations
#### How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-roberta')
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-roberta')
text = "మీరు ఎలా ఉన్నారు"
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
out = model(input_ids)[0]
print(out.shape)
# out = [1, 14, 768]
```
#### Limitations and bias
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
---
language:
- te
tags:
- MaskedLM
- Telugu
- XLMRoBERTa
- Question-Answering
- Token Classification
- Text Classification
---
# Indic-Transformers Telugu XLMRoBERTa
## Model description
This is a XLMRoBERTa language model pre-trained on ~1.6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
## Intended uses & limitations
#### How to use
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-xlmroberta')
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-xlmroberta')
text = "మీరు ఎలా ఉన్నారు"
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
out = model(input_ids)[0]
print(out.shape)
# out = [1, 5, 768]
```
#### Limitations and bias
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
---
language: it
thumbnail: https://neuraly.ai/static/assets/images/huggingface/thumbnail.png
tags:
- sentiment
- Italian
license: MIT
widget:
- text: "Huggingface è un team fantastico!"
---
# 🤗 + neuraly - Italian BERT Sentiment model
## Model description
This model performs sentiment analysis on Italian sentences. It was trained starting from an instance of [bert-base-italian-cased](https://huggingface.co/dbmdz/bert-base-italian-cased), and fine-tuned on an Italian dataset of tweets, reaching 82% of accuracy on the latter one.
## Intended uses & limitations
#### How to use
```python
import torch
from torch import nn
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("neuraly/bert-base-italian-cased-sentiment")
# Load the model, use .cuda() to load it on the GPU
model = AutoModelForSequenceClassification.from_pretrained("neuraly/bert-base-italian-cased-sentiment")
sentence = 'Huggingface è un team fantastico!'
input_ids = tokenizer.encode(sentence, add_special_tokens=True)
# Create tensor, use .cuda() to transfer the tensor to GPU
tensor = torch.tensor(input_ids).long()
# Fake batch dimension
tensor = tensor.unsqueeze(0)
# Call the model and get the logits
logits, = model(tensor)
# Remove the fake batch dimension
logits = logits.squeeze(0)
# The model was trained with a Log Likelyhood + Softmax combined loss, hence to extract probabilities we need a softmax on top of the logits tensor
proba = nn.functional.softmax(logits, dim=0)
# Unpack the tensor to obtain negative, neutral and positive probabilities
negative, neutral, positive = proba
```
#### Limitations and bias
A possible drawback (or bias) of this model is related to the fact that it was trained on a tweet dataset, with all the limitations that come with it. The domain is strongly related to football players and teams, but it works surprisingly well even on other topics.
## Training data
We trained the model by combining the two tweet datasets taken from [Sentipolc EVALITA 2016](http://www.di.unito.it/~tutreeb/sentipolc-evalita16/data.html). Overall the dataset consists of 45K pre-processed tweets.
The model weights come from a pre-trained instance of [bert-base-italian-cased](https://huggingface.co/dbmdz/bert-base-italian-cased). A huge "thank you" goes to that team, brilliant work!
## Training procedure
#### Preprocessing
We tried to save as much information as possible, since BERT captures extremely well the semantic of complex text sequences. Overall we removed only **@mentions**, **urls** and **emails** from every tweet and kept pretty much everything else.
#### Hardware
- **GPU**: Nvidia GTX1080ti
- **CPU**: AMD Ryzen7 3700x 8c/16t
- **RAM**: 64GB DDR4
#### Hyperparameters
- Optimizer: **AdamW** with learning rate of **2e-5**, epsilon of **1e-8**
- Max epochs: **5**
- Batch size: **32**
- Early Stopping: **enabled** with patience = 1
Early stopping was triggered after 3 epochs.
## Eval results
The model achieves an overall accuracy on the test set equal to 82%
The test set is a 20% split of the whole dataset.
## About us
[Neuraly](https://neuraly.ai) is a young and dynamic startup committed to designing AI-driven solutions and services through the most advanced Machine Learning and Data Science technologies. You can find out more about who we are and what we do on our [website](https://neuraly.ai).
## Acknowledgments
Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
it is possible to download the model from their S3 storage and live test it from their inference API 🤗.
---
language: is
datasets:
- Icelandic portion of the OSCAR corpus from INRIA
- oscar
---
# IsRoBERTa a RoBERTa-like masked language model
Probably the first icelandic transformer language model!
## Overview
**Language:** Icelandic
**Downstream-task:** masked-lm
**Training data:** OSCAR corpus
**Code:** See [here](https://github.com/neurocode-io/icelandic-language-model)
**Infrastructure**: 1x Nvidia K80
## Hyperparameters
```
per_device_train_batch_size = 48
n_epochs = 1
vocab_size = 52.000
max_position_embeddings = 514
num_attention_heads = 12
num_hidden_layers = 6
type_vocab_size = 1
learning_rate=0.00005
```
## Usage
### In Transformers
```python
from transformers import (
pipeline,
AutoTokenizer,
AutoModelWithLMHead
)
model_name = "neurocode/IsRoBERTa"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelWithLMHead.from_pretrained(model_name)
>>> fill_mask = pipeline(
... "fill-mask",
... model=model,
... tokenizer=tokenizer
... )
>>> result = fill_mask("Hann fór út að <mask>.")
>>> result
[
{'sequence': '<s>Hann fór út að nýju.</s>', 'score': 0.03395755589008331, 'token': 2219, 'token_str': 'Ġnýju'},
{'sequence': '<s>Hann fór út að undanförnu.</s>', 'score': 0.029087543487548828, 'token': 7590, 'token_str': 'Ġundanförnu'},
{'sequence': '<s>Hann fór út að lokum.</s>', 'score': 0.024420788511633873, 'token': 4384, 'token_str': 'Ġlokum'},
{'sequence': '<s>Hann fór út að þessu.</s>', 'score': 0.021231256425380707, 'token': 921, 'token_str': 'Ġþessu'},
{'sequence': '<s>Hann fór út að honum.</s>', 'score': 0.0205782949924469, 'token': 1136, 'token_str': 'Ġhonum'}
]
```
## Authors
Bobby Donchev: `contact [at] donchev.is`
Elena Cramer: `elena.cramer [at] neurocode.io`
## About us
We bring AI software for our customers live
Our focus: AI software development
Get in touch:
[LinkedIn](https://de.linkedin.com/company/neurocodeio) | [Website](https://neurocode.io)
---
language: zh
---
# ERNIE-1.0
## Introduction
ERNIE (Enhanced Representation through kNowledge IntEgration) is proposed by Baidu in 2019,
which is designed to learn language representation enhanced by knowledge masking strategies i.e. entity-level masking and phrase-level masking.
Experimental results show that ERNIE achieve state-of-the-art results on five Chinese natural language processing tasks including natural language inference,
semantic similarity, named entity recognition, sentiment analysis and question answering.
More detail: https://arxiv.org/abs/1904.09223
## Released Model Info
|Model Name|Language|Model Structure|
|:---:|:---:|:---:|
|ernie-1.0| Chinese |Layer:12, Hidden:768, Heads:12|
This released pytorch model is converted from the officially released PaddlePaddle ERNIE model and
a series of experiments have been conducted to check the accuracy of the conversion.
- Official PaddlePaddle ERNIE repo: https://github.com/PaddlePaddle/ERNIE
- Pytorch Conversion repo: https://github.com/nghuyong/ERNIE-Pytorch
## How to use
```Python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nghuyong/ernie-1.0")
model = AutoModel.from_pretrained("nghuyong/ernie-1.0")
```
## Citation
```bibtex
@article{sun2019ernie,
title={Ernie: Enhanced representation through knowledge integration},
author={Sun, Yu and Wang, Shuohuan and Li, Yukun and Feng, Shikun and Chen, Xuyi and Zhang, Han and Tian, Xin and Zhu, Danxiang and Tian, Hao and Wu, Hua},
journal={arXiv preprint arXiv:1904.09223},
year={2019}
}
```
---
language: en
---
# ERNIE-2.0
## Introduction
ERNIE 2.0 is a continual pre-training framework proposed by Baidu in 2019,
which builds and learns incrementally pre-training tasks through constant multi-task learning.
Experimental results demonstrate that ERNIE 2.0 outperforms BERT and XLNet on 16 tasks including English tasks on GLUE benchmarks and several common tasks in Chinese.
More detail: https://arxiv.org/abs/1907.12412
## Released Model Info
|Model Name|Language|Model Structure|
|:---:|:---:|:---:|
|ernie-2.0-en| English |Layer:12, Hidden:768, Heads:12|
This released pytorch model is converted from the officially released PaddlePaddle ERNIE model and
a series of experiments have been conducted to check the accuracy of the conversion.
- Official PaddlePaddle ERNIE repo: https://github.com/PaddlePaddle/ERNIE
- Pytorch Conversion repo: https://github.com/nghuyong/ERNIE-Pytorch
## How to use
```Python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nghuyong/ernie-2.0-en")
model = AutoModel.from_pretrained("nghuyong/ernie-2.0-en")
```
## Citation
```bibtex
@article{sun2019ernie20,
title={ERNIE 2.0: A Continual Pre-training Framework for Language Understanding},
author={Sun, Yu and Wang, Shuohuan and Li, Yukun and Feng, Shikun and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:1907.12412},
year={2019}
}
```
# ERNIE-2.0-large
## Introduction
ERNIE 2.0 is a continual pre-training framework proposed by Baidu in 2019,
which builds and learns incrementally pre-training tasks through constant multi-task learning.
Experimental results demonstrate that ERNIE 2.0 outperforms BERT and XLNet on 16 tasks including English tasks on GLUE benchmarks and several common tasks in Chinese.
More detail: https://arxiv.org/abs/1907.12412
## Released Model Info
|Model Name|Language|Model Structure|
|:---:|:---:|:---:|
|ernie-2.0-large-en| English |Layer:24, Hidden:1024, Heads:16|
This released pytorch model is converted from the officially released PaddlePaddle ERNIE model and
a series of experiments have been conducted to check the accuracy of the conversion.
- Official PaddlePaddle ERNIE repo: https://github.com/PaddlePaddle/ERNIE
- Pytorch Conversion repo: https://github.com/nghuyong/ERNIE-Pytorch
## How to use
```Python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nghuyong/ernie-2.0-large-en")
model = AutoModel.from_pretrained("nghuyong/ernie-2.0-large-en")
```
## Citation
```bibtex
@article{sun2019ernie20,
title={ERNIE 2.0: A Continual Pre-training Framework for Language Understanding},
author={Sun, Yu and Wang, Shuohuan and Li, Yukun and Feng, Shikun and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:1907.12412},
year={2019}
}
```
---
language: en
---
# ERNIE-tiny
## Introduction
ERNIE-tiny is a compressed model from [ERNIE 2.0](../ernie-2.0-en) base model through model structure compression and model distillation.
Through compression, the performance of the ERNIE-tiny only decreases by an average of 2.37% compared to ERNIE 2.0 base,
but it outperforms Google BERT by 8.35%, and the speed increases by 4.3 times.
More details: https://github.com/PaddlePaddle/ERNIE/blob/develop/distill/README.md
## Released Model Info
|Model Name|Language|Model Structure|
|:---:|:---:|:---:|
|ernie-tiny| English |Layer:3, Hidden:1024, Heads:16|
This released pytorch model is converted from the officially released PaddlePaddle ERNIE model and
a series of experiments have been conducted to check the accuracy of the conversion.
- Official PaddlePaddle ERNIE repo: https://github.com/PaddlePaddle/ERNIE
- Pytorch Conversion repo: https://github.com/nghuyong/ERNIE-Pytorch
## How to use
```Python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nghuyong/ernie-tiny")
model = AutoModel.from_pretrained("nghuyong/ernie-tiny")
```
## Citation
```bibtex
@article{sun2019ernie20,
title={ERNIE 2.0: A Continual Pre-training Framework for Language Understanding},
author={Sun, Yu and Wang, Shuohuan and Li, Yukun and Feng, Shikun and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:1907.12412},
year={2019}
}
```
---
language: el
---
## gpt2-greek
---
language: el
thumbnail: https://github.com/nlpaueb/GreekBERT/raw/master/greek-bert-logo.png
---
# GreekBERT
A Greek version of BERT pre-trained language model.
<img src="https://github.com/nlpaueb/GreekBERT/raw/master/greek-bert-logo.png" width="600"/>
## Pre-training corpora
The pre-training corpora of `bert-base-greek-uncased-v1` include:
* The Greek part of [Wikipedia](https://el.wikipedia.org/wiki/Βικιπαίδεια:Αντίγραφα_της_βάσης_δεδομένων),
* The Greek part of [European Parliament Proceedings Parallel Corpus](https://www.statmt.org/europarl/), and
* The Greek part of [OSCAR](https://traces1.inria.fr/oscar/), a cleansed version of [Common Crawl](https://commoncrawl.org).
Future release will also include:
* The entire corpus of Greek legislation, as published by the [National Publication Office](http://www.et.gr),
* The entire corpus of EU legislation (Greek translation), as published in [Eur-Lex](https://eur-lex.europa.eu/homepage.html?locale=en).
## Pre-training details
* We trained BERT using the official code provided in Google BERT's github repository (https://github.com/google-research/bert). We then used [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers) conversion script to convert the TF checkpoint and vocabulary in the desirable format in order to be able to load the model in two lines of code for both PyTorch and TF2 users.
* We released a model similar to the English `bert-base-uncased` model (12-layer, 768-hidden, 12-heads, 110M parameters).
* We chose to follow the same training set-up: 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4.
* We were able to use a single Google Cloud TPU v3-8 provided for free from [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc), while also utilizing [GCP research credits](https://edu.google.com/programs/credits/research). Huge thanks to both Google programs for supporting us!
## Requirements
We published `bert-base-greek-uncased-v1` as part of [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers) repository. So, you need to install the transfomers library through pip along with PyTorch or Tensorflow 2.
```
pip install transfomers
pip install (torch|tensorflow)
```
## Pre-process text (Deaccent - Lower)
In order to use `bert-base-greek-uncased-v1`, you have to pre-process texts to lowercase letters and remove all Greek diacritics.
```python
import unicodedata
def strip_accents_and_lowercase(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn').lower()
accented_string = "Αυτή είναι η Ελληνική έκδοση του BERT."
unaccented_string = strip_accents_and_lowercase(accented_string)
print(unaccented_string) # αυτη ειναι η ελληνικη εκδοση του bert.
```
## Load Pretrained Model
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
model = AutoModel.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
```
## Use Pretrained Model as a Language Model
```python
import torch
from transformers import *
# Load model and tokenizer
tokenizer_greek = AutoTokenizer.from_pretrained('nlpaueb/bert-base-greek-uncased-v1')
lm_model_greek = AutoModelWithLMHead.from_pretrained('nlpaueb/bert-base-greek-uncased-v1')
# ================ EXAMPLE 1 ================
text_1 = 'O ποιητής έγραψε ένα [MASK] .'
# EN: 'The poet wrote a [MASK].'
input_ids = tokenizer_greek.encode(text_1)
print(tokenizer_greek.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'o', 'ποιητης', 'εγραψε', 'ενα', '[MASK]', '.', '[SEP]']
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 5].max(0)[1].item()))
# the most plausible prediction for [MASK] is "song"
# ================ EXAMPLE 2 ================
text_2 = 'Είναι ένας [MASK] άνθρωπος.'
# EN: 'He is a [MASK] person.'
input_ids = tokenizer_greek.encode(text_2)
print(tokenizer_greek.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', '.', '[SEP]']
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 3].max(0)[1].item()))
# the most plausible prediction for [MASK] is "good"
# ================ EXAMPLE 3 ================
text_3 = 'Είναι ένας [MASK] άνθρωπος και κάνει συχνά [MASK].'
# EN: 'He is a [MASK] person he does frequently [MASK].'
input_ids = tokenizer_greek.encode(text_3)
print(tokenizer_greek.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', 'και', 'κανει', 'συχνα', '[MASK]', '.', '[SEP]']
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 8].max(0)[1].item()))
# the most plausible prediction for the second [MASK] is "trips"
```
## Evaluation on downstream tasks
TBA
## Author
Ilias Chalkidis on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)
| Github: [@ilias.chalkidis](https://github.com/seolhokim) | Twitter: [@KiddoThe2B](https://twitter.com/KiddoThe2B) |
## About Us
[AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr) develops algorithms, models, and systems that allow computers to process and generate natural language texts.
The group's current research interests include:
* question answering systems for databases, ontologies, document collections, and the Web, especially biomedical question answering,
* natural language generation from databases and ontologies, especially Semantic Web ontologies,
text classification, including filtering spam and abusive content,
* information extraction and opinion mining, including legal text analytics and sentiment analysis,
* natural language processing tools for Greek, for example parsers and named-entity recognizers,
machine learning in natural language processing, especially deep learning.
The group is part of the Information Processing Laboratory of the Department of Informatics of the Athens University of Economics and Business.
---
language: en
tags:
- legal
---
# LEGAL-BERT: The Muppets straight out of Law School
<img align="left" src="https://i.ibb.co/p3kQ7Rw/Screenshot-2020-10-06-at-12-16-36-PM.png" width="100"/>
LEGAL-BERT is a family of BERT models for the legal domain, intended to assist legal NLP research, computational law, and legal technology applications. To pre-train the different variations of LEGAL-BERT, we collected 12 GB of diverse English legal text from several fields (e.g., legislation, court cases, contracts) scraped from publicly available resources. Sub-domains variants (CONTRACTS-, EURLEX-, ECHR-) and/or general LEGAL-BERT perform better than using BERT out of the box for domain-specific tasks. A light-weight model (33% the size of BERT-BASE) pre-trained from scratch on legal data with competitive perfomance is also available.
<br/><br/><br/><br/>
---
I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras and I. Androutsopoulos. "LEGAL-BERT: The Muppets straight out of Law School". In Findings of Empirical Methods in Natural Language Processing (EMNLP 2020) (Short Papers), to be held online, 2020. (https://arxiv.org/abs/2010.02559)
---
## Pre-training corpora
The pre-training corpora of LEGAL-BERT include:
* 116,062 documents of EU legislation, publicly available from EURLEX (http://eur-lex.europa.eu), the repository of EU Law running under the EU Publication Office.
* 61,826 documents of UK legislation, publicly available from the UK legislation portal (http://www.legislation.gov.uk).
* 19,867 cases from European Court of Justice (ECJ), also available from EURLEX.
* 12,554 cases from HUDOC, the repository of the European Court of Human Rights (ECHR) (http://hudoc.echr.coe.int/eng).
* 164,141 cases from various courts across the USA, hosted in the Case Law Access Project portal (https://case.law).
* 76,366 US contracts from EDGAR, the database of US Securities and Exchange Commission (SECOM) (https://www.sec.gov/edgar.shtml).
## Pre-training details
* We trained BERT using the official code provided in Google BERT's github repository (https://github.com/google-research/bert).
* We released a model similar to the English BERT-BASE model (12-layer, 768-hidden, 12-heads, 110M parameters).
* We chose to follow the same training set-up: 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4.
* We were able to use a single Google Cloud TPU v3-8 provided for free from [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc), while also utilizing [GCP research credits](https://edu.google.com/programs/credits/research). Huge thanks to both Google programs for supporting us!
* Part of LEGAL-BERT is a light-weight model pre-trained from scratch on legal data, which achieves comparable performance to larger models, while being much more efficient (approximately 4 times faster) with a smaller environmental footprint.
## Models list
| Model name | Model Path | Training corpora |
| ------------------- | ------------------------------------ | ------------------- |
| CONTRACTS-BERT-BASE | `nlpaueb/bert-base-uncased-contracts` | US contracts |
| EURLEX-BERT-BASE | `nlpaueb/bert-base-uncased-eurlex` | EU legislation |
| ECHR-BERT-BASE | `nlpaueb/bert-base-uncased-echr` | ECHR cases |
| LEGAL-BERT-BASE | `nlpaueb/legal-bert-base-uncased` | All |
| LEGAL-BERT-SMALL | `nlpaueb/legal-bert-small-uncased` | All |
## Load Pretrained Model
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base-uncased")
model = AutoModel.from_pretrained("nlpaueb/legal-bert-base-uncased")
```
## Use LEBAL-BERT variants as Language Models
| Corpus | Model | Masked token | Predictions |
| --------------------------------- | ---------------------------------- | ------------ | ------------ |
| | **BERT-BASE-UNCASED** |
| (Contracts) | This [MASK] Agreement is between General Motors and John Murray . | employment | ('new', '0.09'), ('current', '0.04'), ('proposed', '0.03'), ('marketing', '0.03'), ('joint', '0.02')
| (ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('torture', '0.32'), ('rape', '0.22'), ('abuse', '0.14'), ('death', '0.04'), ('violence', '0.03')
| (EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products . | bovine | ('farm', '0.25'), ('livestock', '0.08'), ('draft', '0.06'), ('domestic', '0.05'), ('wild', '0.05')
| | **CONTRACTS-BERT-BASE** |
| (Contracts) | This [MASK] Agreement is between General Motors and John Murray . | employment | ('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02')
| (ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('death', '0.39'), ('imprisonment', '0.07'), ('contempt', '0.05'), ('being', '0.03'), ('crime', '0.02')
| (EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products . | bovine | (('domestic', '0.18'), ('laboratory', '0.07'), ('household', '0.06'), ('personal', '0.06'), ('the', '0.04')
| | **EURLEX-BERT-BASE** |
| (Contracts) | This [MASK] Agreement is between General Motors and John Murray . | employment | ('supply', '0.11'), ('cooperation', '0.08'), ('service', '0.07'), ('licence', '0.07'), ('distribution', '0.05')
| (ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('torture', '0.66'), ('death', '0.07'), ('imprisonment', '0.07'), ('murder', '0.04'), ('rape', '0.02')
| (EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products . | bovine | ('live', '0.43'), ('pet', '0.28'), ('certain', '0.05'), ('fur', '0.03'), ('the', '0.02')
| | **ECHR-BERT-BASE** |
| (Contracts) | This [MASK] Agreement is between General Motors and John Murray . | employment | ('second', '0.24'), ('latter', '0.10'), ('draft', '0.05'), ('bilateral', '0.05'), ('arbitration', '0.04')
| (ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('torture', '0.99'), ('death', '0.01'), ('inhuman', '0.00'), ('beating', '0.00'), ('rape', '0.00')
| (EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products . | bovine | ('pet', '0.17'), ('all', '0.12'), ('slaughtered', '0.10'), ('domestic', '0.07'), ('individual', '0.05')
| | **LEGAL-BERT-BASE** |
| (Contracts) | This [MASK] Agreement is between General Motors and John Murray . | employment | ('settlement', '0.26'), ('letter', '0.23'), ('dealer', '0.04'), ('master', '0.02'), ('supplemental', '0.02')
| (ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('torture', '1.00'), ('detention', '0.00'), ('arrest', '0.00'), ('rape', '0.00'), ('death', '0.00')
| (EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products . | bovine | ('live', '0.67'), ('beef', '0.17'), ('farm', '0.03'), ('pet', '0.02'), ('dairy', '0.01')
| | **LEGAL-BERT-SMALL** |
| (Contracts) | This [MASK] Agreement is between General Motors and John Murray . | employment | ('license', '0.09'), ('transition', '0.08'), ('settlement', '0.04'), ('consent', '0.03'), ('letter', '0.03')
| (ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('torture', '0.59'), ('pain', '0.05'), ('ptsd', '0.05'), ('death', '0.02'), ('tuberculosis', '0.02')
| (EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products . | bovine | ('all', '0.08'), ('live', '0.07'), ('certain', '0.07'), ('the', '0.07'), ('farm', '0.05')
## Evaluation on downstream tasks
Consider the experiments in the article "LEGAL-BERT: The Muppets straight out of Law School". Chalkidis et al., 2018, (https://arxiv.org/abs/2010.02559)
## Author
Ilias Chalkidis on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)
| Github: [@ilias.chalkidis](https://github.com/seolhokim) | Twitter: [@KiddoThe2B](https://twitter.com/KiddoThe2B) |
---
language:
- en
- nl
- de
- fr
- it
- es
license: mit
---
# bert-base-multilingual-uncased-sentiment
This a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish and Italian. It predicts the sentiment of the review as a number of stars (between 1 and 5).
This model is intended for direct use as a sentiment analysis model for product reviews in any of the six languages above, or for further finetuning on related sentiment analysis tasks.
## Training data
Here is the number of product reviews we used for finetuning the model:
| Language | Number of reviews |
| -------- | ----------------- |
| English | 150k |
| Dutch | 80k |
| German | 137k |
| French | 140k |
| Italian | 72k |
| Spanish | 50k |
## Accuracy
The finetuned model obtained the following accuracy on 5,000 held-out product reviews in each of the languages:
- Accuracy (exact) is the exact match on the number of stars.
- Accuracy (off-by-1) is the percentage of reviews where the number of stars the model predicts differs by a maximum of 1 from the number given by the human reviewer.
| Language | Accuracy (exact) | Accuracy (off-by-1) |
| -------- | ---------------------- | ------------------- |
| English | 67% | 95%
| Dutch | 57% | 93%
| German | 61% | 94%
| French | 59% | 94%
| Italian | 59% | 95%
| Spanish | 58% | 95%
## Contact
Contact [NLP Town](https://www.nlp.town) for questions, feedback and/or requests for similar models.
../roberta_1M_to_1B/README.md
\ No newline at end of file
../roberta_1M_to_1B/README.md
\ No newline at end of file
../roberta_1M_to_1B/README.md
\ No newline at end of file
../roberta_1M_to_1B/README.md
\ No newline at end of file
../roberta_1M_to_1B/README.md
\ No newline at end of file
../roberta_1M_to_1B/README.md
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment