Unverified Commit 8a2d9bc9 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

Add model cards for DeepPavlov models (#3138)

* add empty model cards for every current DeepPavlov model

* fix: replace cyrillic `с` with `c`

* docs: add model cards for current DeepPavlov BERT models

* docs: add links for arXiv preprints
parent 012cbdb0
---
language:
- bulgarian
- czech
- polish
- russian
---
# bert-base-bg-cs-pl-ru-cased
SlavicBERT\[1\] \(Slavic \(bg, cs, pl, ru\), cased, 12-layer, 768-hidden, 12-heads, 180M parameters\) was trained
on Russian News and four Wikipedias: Bulgarian, Czech, Polish, and Russian.
Subtoken vocabulary was built using this data. Multilingual BERT was used as an initialization for SlavicBERT.
\[1\]: Arkhipov M., Trofimova M., Kuratov Y., Sorokin A. \(2019\).
[Tuning Multilingual Transformers for Language-Specific Named Entity Recognition](https://www.aclweb.org/anthology/W19-3712/).
ACL anthology W19-3712.
---
language:
- english
---
# bert-base-cased-conversational
Conversational BERT \(English, cased, 12-layer, 768-hidden, 12-heads, 110M parameters\) was trained
on the English part of Twitter, Reddit, DailyDialogues\[1\], OpenSubtitles\[2\], Debates\[3\], Blogs\[4\],
Facebook News Comments. We used this training data to build the vocabulary of English subtokens and took
English cased version of BERT-base as an initialization for English Conversational BERT.
\[1\]: Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. DailyDialog: A Manually Labelled
Multi-turn Dialogue Dataset. IJCNLP 2017.
\[2\]: P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles.
In Proceedings of the 10th International Conference on Language Resources and Evaluation \(LREC 2016\)
\[3\]: Justine Zhang, Ravi Kumar, Sujith Ravi, Cristian Danescu-Niculescu-Mizil. Proceedings of NAACL, 2016.
\[4\]: J. Schler, M. Koppel, S. Argamon and J. Pennebaker \(2006\). Effects of Age and Gender on Blogging
in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs.
---
language:
- multilingual
---
# bert-base-multilingual-cased-sentence
Sentence Multilingual BERT \(101 languages, cased, 12-layer, 768-hidden, 12-heads, 180M parameters\)
is a representation-based sentence encoder for 101 languages of Multilingual BERT.
It is initialized with Multilingual BERT and then fine-tuned on english MultiNLI\[1\] and on dev set
of multilingual XNLI\[2\].
Sentence representations are mean pooled token embeddings in the same manner as in Sentence-BERT\[3\].
\[1\]: Williams A., Nangia N. & Bowman S. \(2017\) A Broad-Coverage Challenge Corpus for Sentence Understanding
through Inference. arXiv preprint [arXiv:1704.05426](https://arxiv.org/abs/1704.05426)
\[2\]: Williams A., Bowman S. \(2018\) XNLI: Evaluating Cross-lingual Sentence Representations.
arXiv preprint [arXiv:1809.05053](https://arxiv.org/abs/1809.05053)
\[3\]: N. Reimers, I. Gurevych \(2019\) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
arXiv preprint [arXiv:1908.10084](https://arxiv.org/abs/1908.10084)
---
language:
- russian
---
# rubert-base-cased-conversational
Conversational RuBERT \(Russian, cased, 12-layer, 768-hidden, 12-heads, 180M parameters\) was trained
on OpenSubtitles\[1\], [Dirty](https://d3.ru/), [Pikabu](https://pikabu.ru/),
and a Social Media segment of Taiga corpus\[2\]. We assembled a new vocabulary for Conversational RuBERT model
on this data and initialized the model with [RuBERT](../rubert-base-cased).
\[1\]: P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles.
In Proceedings of the 10th International Conference on Language Resources and Evaluation \(LREC 2016\)
\[2\]: Shavrina T., Shapovalova O. \(2017\) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING:
«TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of “CORPORA2017”, international conference , Saint-Petersbourg, 2017.
---
language:
- russian
---
# rubert-base-cased-sentence
Sentence RuBERT \(Russian, cased, 12-layer, 768-hidden, 12-heads, 180M parameters\)
is a representation-based sentence encoder for Russian. It is initialized with RuBERT and fine-tuned on SNLI\[1\]
google-translated to russian and on russian part of XNLI dev set\[2\]. Sentence representations are mean pooled
token embeddings in the same manner as in Sentence-BERT\[3\].
\[1\]: S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. \(2015\) A large annotated corpus for learning
natural language inference. arXiv preprint [arXiv:1508.05326](https://arxiv.org/abs/1508.05326)
\[2\]: Williams A., Bowman S. \(2018\) XNLI: Evaluating Cross-lingual Sentence Representations.
arXiv preprint [arXiv:1809.05053](https://arxiv.org/abs/1809.05053)
\[3\]: N. Reimers, I. Gurevych \(2019\) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
arXiv preprint [arXiv:1908.10084](https://arxiv.org/abs/1908.10084)
---
language:
- russian
---
# rubert-base-cased
RuBERT \(Russian, cased, 12-layer, 768-hidden, 12-heads, 180M parameters\) was trained on the Russian part of Wikipedia
and news data. We used this training data to build a vocabulary of Russian subtokens and took a multilingual version
of BERT-base as an initialization for RuBERT\[1\].
\[1\]: Kuratov, Y., Arkhipov, M. \(2019\). Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language.
arXiv preprint [arXiv:1905.07213](https://arxiv.org/abs/1905.07213).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment