Create README.md

1eec69a9 · Ilias Chalkidis · Julien Chaumond · 8744402f · 1eec69a9
Commit 1eec69a9 authored Feb 14, 2020 by Ilias Chalkidis Committed by Julien Chaumond Feb 13, 2020
Hide whitespace changes
Inline Side-by-side

Showing with 76 additions and 0 deletions

model_cards/nlpaueb/bert-base-greek-uncased-v1/README.md model_cards/nlpaueb/bert-base-greek-uncased-v1/README.md +76 -0

No files found.
--- a/model_cards/nlpaueb/bert-base-greek-uncased-v1/README.md
+++ b/model_cards/nlpaueb/bert-base-greek-uncased-v1/README.md
+# GreekBERT
+
+A Greek version of BERT pre-trained language model.
+
+<img src="https://github.com/nlpaueb/GreekBERT/raw/master/greek-bert-logo.png" width="600"/> 
+
+
+## Pre-training corpora
+
+The pre-training corpora of `bert-base-greek-uncased-v1` include:
+
+* The Greek part of [Wikipedia](https://el.wikipedia.org/wiki/Βικιπαίδεια:Αντίγραφα_της_βάσης_δεδομένων),
+* The Greek part of [European Parliament Proceedings Parallel Corpus](https://www.statmt.org/europarl/), and
+* The Greek part of [OSCAR](https://traces1.inria.fr/oscar/), a cleansed version of [Common Crawl](https://commoncrawl.org).
+
+Future release will also include:
+
+* The entire corpus of Greek legislation, as published by the [National Publication Office](http://www.et.gr),  
+* The entire corpus of EU legislation (Greek translation), as published in [Eur-Lex](https://eur-lex.europa.eu/homepage.html?locale=en).
+
+## Requirements
+
+We published `bert-base-greek-uncased-v1` as part of [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers) repository. So, you need to install transfomers library through pip along with PyTorch or Tensorflow 2.
+
+```
+pip install transfomers
+pip install (torch|tensorflow)
+```
+
+## Pre-process text (Deaccent - Lower)
+
+In order to use `bert-base-greek-uncased-v1`, you have to pre-process texts in order to lowercase letters and remove all Greek diacritics.
+
+```python
+
+import unicodedata
+
+def strip_accents_and_lowercase(s):
+   return ''.join(c for c in unicodedata.normalize('NFD', s)
+                  if unicodedata.category(c) != 'Mn').lower()
+
+accented_string = "Αυτή είναι η Ελληνίκη έκδοση του BERT."
+unaccented_string = strip_accents_and_lowercase(accented_string)
+
+print(unaccented_string) # αυτη ειναι η ελληνικη εκδοση του bert.
+
+```
+
+## Load Pretrained Model 
+
+```python
+from transformers import AutoTokenizer, AutoModel
+
+tokenizer = AutoTokenizer.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
+model = AutoModel.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
+```
+
+## Author
+
+Ilias Chalkidis on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)
+
+| Github: [@ilias.chalkidis](https://github.com/seolhokim) | Twitter: [@KiddoThe2B](https://twitter.com/KiddoThe2B) |
+
+## About Us
+
+[AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr) develops algorithms, models, and systems that allow computers to process and generate natural language texts.
+
+The group's current research interests include:
+* question answering systems for databases, ontologies, document collections, and the Web, especially biomedical question answering,
+* natural language generation from databases and ontologies, especially Semantic Web ontologies,
+text classification, including filtering spam and abusive content,
+* information extraction and opinion mining, including legal text analytics and sentiment analysis,
+* natural language processing tools for Greek, for example parsers and named-entity recognizers,
+machine learning in natural language processing, especially deep learning.
+
+The group is part of the Information Processing Laboratory of the Department of Informatics of the Athens University of Economics and Business.