README.md 1.53 KB
Newer Older
Rodolfo De Nadai's avatar
Rodolfo De Nadai committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
language: pt
tags:
- portuguese
- brazil
- pt_BR
---

# BR_BERTo

Portuguese (Brazil) model for text inference.

## Params

Trained on a corpus of 5_258_624 sentences, with 132_807_374 non unique tokens (992_418 unique tokens).

But since my machine doesn`t support bigger model, at the end it has a vocab size of 54_000 tokens. The rest of the parameters are the default used in the HuggingFace tutorial.

[How to train a new language model from scratch using Transformers and Tokenizers](https://huggingface.co/blog/how-to-train)

## Results

```python
fill_mask("gostei muito dessa <mask>")

#[{'sequence': '<s>gostei muito dessa experi锚ncia</s>',
#  'score': 0.0719294399023056,
#  'token': 2322,
#  'token_str': '臓experi脙陋ncia'},
# {'sequence': '<s>gostei muito dessa diferen莽a</s>',
#  'score': 0.05286405608057976,
#  'token': 3472,
#  'token_str': '臓diferen脙搂a'},
# {'sequence': '<s>gostei muito dessa aten莽茫o</s>',
#  'score': 0.027575725689530373,
#  'token': 2557,
#  'token_str': '臓aten脙搂脙拢o'},
# {'sequence': '<s>gostei muito dessa hist贸ria</s>',
#  'score': 0.026764703914523125,
#  'token': 1329,
#  'token_str': '臓hist脙鲁ria'},
# {'sequence': '<s>gostei muito dessa raz茫o</s>',
#  'score': 0.0250675268471241,
#  'token': 3323,
#  'token_str': '臓raz脙拢o'},
# {'sequence': '<s>gostei muito dessa resposta</s>',
#  'score': 0.024784332141280174,
#  'token': 2403,
#  'token_str': '臓resposta'},
# {'sequence': '<s>gostei muito dessa dose</s>',
#  'score': 0.01720510423183441,
#  'token': 1042,
#  'token_str': '臓dose'}]
```