README.md 3.47 KB
Newer Older
Txus's avatar
Txus committed
1
---
2
3
4
5
6
7
language: "ca"
tags:
  - masked-lm
  - catalan
  - exbert
license: mit
Txus's avatar
Txus committed
8
9
---

10
# Calbert: a Catalan Language Model
Txus's avatar
Txus committed
11
12
13

## Introduction

14
CALBERT is an open-source language model for Catalan pretrained on the ALBERT architecture.
Txus's avatar
Txus committed
15

16
It is now available on Hugging Face in its `tiny-uncased` version and `base-uncased` (the one you're looking at) as well, and was pretrained on the [OSCAR dataset](https://traces1.inria.fr/oscar/).
Txus's avatar
Txus committed
17
18
19
20
21

For further information or requests, please go to the [GitHub repository](https://github.com/codegram/calbert)

## Pre-trained models

22
23
24
25
| Model                               | Arch.          | Training data          |
| ----------------------------------- | -------------- | ---------------------- |
| `codegram` / `calbert-tiny-uncased` | Tiny (uncased) | OSCAR (4.3 GB of text) |
| `codegram` / `calbert-base-uncased` | Base (uncased) | OSCAR (4.3 GB of text) |
Txus's avatar
Txus committed
26

27
## How to use Calbert with HuggingFace
Txus's avatar
Txus committed
28

29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
#### Load Calbert and its tokenizer:

```python
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("codegram/calbert-base-uncased")
model = AutoModel.from_pretrained("codegram/calbert-base-uncased")

model.eval() # disable dropout (or leave in train mode to finetune
```

#### Filling masks using pipeline

```python
from transformers import pipeline

calbert_fill_mask  = pipeline("fill-mask", model="codegram/calbert-base-uncased", tokenizer="codegram/calbert-base-uncased")
results = calbert_fill_mask("M'agrada [MASK] aix貌")
# results
# [{'sequence': "[CLS] m'agrada molt aixo[SEP]", 'score': 0.614592969417572, 'token': 61},
#  {'sequence': "[CLS] m'agrada molt铆ssim aixo[SEP]", 'score': 0.06058056280016899, 'token': 4867},
#  {'sequence': "[CLS] m'agrada m茅s aixo[SEP]", 'score': 0.017195818945765495, 'token': 43},
#  {'sequence': "[CLS] m'agrada llegir aixo[SEP]", 'score': 0.016321714967489243, 'token': 684},
#  {'sequence': "[CLS] m'agrada escriure aixo[SEP]", 'score': 0.012185849249362946, 'token': 1306}]

```

#### Extract contextual embedding features from Calbert output

```python
import torch
# Tokenize in sub-words with SentencePiece
tokenized_sentence = tokenizer.tokenize("M'茅s una mica igual")
# ['鈻乵', "'", 'es', '鈻乽na', '鈻乵ica', '鈻乮gual']

# 1-hot encode and add special starting and end tokens
encoded_sentence = tokenizer.encode(tokenized_sentence)
# [2, 109, 7, 71, 36, 371, 1103, 3]
# NB: Can be done in one step : tokenize.encode("M'茅s una mica igual")

# Feed tokens to Calbert as a torch tensor (batch dim 1)
encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
embeddings, _ = model(encoded_sentence)
embeddings.size()
# torch.Size([1, 8, 768])
embeddings.detach()
# tensor([[[-0.0261,  0.1166, -0.1075,  ..., -0.0368,  0.0193,  0.0017],
#          [ 0.1289, -0.2252,  0.9881,  ..., -0.1353,  0.3534,  0.0734],
#          [-0.0328, -1.2364,  0.9466,  ...,  0.3455,  0.7010, -0.2085],
#          ...,
#          [ 0.0397, -1.0228, -0.2239,  ...,  0.2932,  0.1248,  0.0813],
#          [-0.0261,  0.1165, -0.1074,  ..., -0.0368,  0.0193,  0.0017],
#          [-0.1934, -0.2357, -0.2554,  ...,  0.1831,  0.6085,  0.1421]]])
```

## Authors
Txus's avatar
Txus committed
85
86
87

CALBERT was trained and evaluated by [Txus Bach](https://twitter.com/txustice), as part of [Codegram](https://www.codegram.com)'s applied research.

88
<a href="https://huggingface.co/exbert/?model=codegram/calbert-base-uncased&modelKind=bidirectional&sentence=M%27agradaria%20for莽a%20saber-ne%20m茅s">
89
	<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
90
</a>