README.md 3.54 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
# Roberta Trained Model For Masked Language Model On French Corpus :robot:


This is a Masked Language Model trained with [Roberta](https://huggingface.co/transformers/model_doc/roberta.html) on a small French News Corpus(Leipzig corpora).
The model is built using Huggingface transformers.
The model can be found at :[French-Roberta](https://huggingface.co/abhilash1910/french-roberta)


## Specifications


The corpus for training is taken from Leipzig Corpora (French News) , and is trained on a small set of the corpus (300K). 


## Model Specification


The model chosen for training is [Roberta](https://arxiv.org/abs/1907.11692) with the following specifications:
 1. vocab_size=32000
 2. max_position_embeddings=514
 3. num_attention_heads=12
 4. num_hidden_layers=6
 5. type_vocab_size=1


This is trained by using  RobertaConfig from transformers package.The total training parameters :68124416
The model is trained for 100 epochs with a gpu batch size of 64 units. 
More details for building custom models can be found at the [HuggingFace Blog](https://huggingface.co/blog/how-to-train)



## Usage Specifications


For using this model, we have to first import AutoTokenizer and AutoModelWithLMHead Modules from transformers
After that we have to specify, the pre-trained model,which in this case is 'abhilash1910/french-roberta' for the tokenizers and the model.


```python
from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("abhilash1910/french-roberta")

model = AutoModelWithLMHead.from_pretrained("abhilash1910/french-roberta")
```


After this the model will be downloaded, it will take some time to download all the model files.
For testing the model, we have to import  pipeline module from transformers and create a masked output model for inference as follows:


```python
from transformers import pipeline
model_mask = pipeline('fill-mask', model='abhilash1910/french-roberta')
model_mask("Le tweet <mask>.")
```


Some of the examples are also provided with generic French sentences:

Example 1:


```python
model_mask("脌 ce jour, <mask> projet a entra卯n茅")
```


Output:


```bash
[{'sequence': '<s>脌 ce jour, belles projet a entra卯n茅</s>',
  'score': 0.18685665726661682,
  'token': 6504,
  'token_str': '臓belles'},
 {'sequence': '<s>脌 ce jour,- projet a entra卯n茅</s>',
  'score': 0.0005200508167035878,
  'token': 17,
  'token_str': '-'},
 {'sequence': '<s>脌 ce jour, de projet a entra卯n茅</s>',
  'score': 0.00045729897101409733,
  'token': 268,
  'token_str': '臓de'},
 {'sequence': '<s>脌 ce jour, du projet a entra卯n茅</s>',
  'score': 0.0004307595663703978,
  'token': 326,
  'token_str': '臓du'},
 {'sequence': '<s>脌 ce jour," projet a entra卯n茅</s>',
  'score': 0.0004219160182401538,
  'token': 6,
  'token_str': '"'}]
  ```
 
 Example 2:
 
```python
 model_mask("C'est un <mask>")
```

Output:

```bash
[{'sequence': "<s>C'est un belles</s>",
  'score': 0.16440927982330322,
  'token': 6504,
  'token_str': '臓belles'},
 {'sequence': "<s>C'est un de</s>",
  'score': 0.0005495127406902611,
  'token': 268,
  'token_str': '臓de'},
 {'sequence': "<s>C'est un du</s>",
  'score': 0.00044988933950662613,
  'token': 326,
  'token_str': '臓du'},
 {'sequence': "<s>C'est un-</s>",
  'score': 0.00044542422983795404,
  'token': 17,
  'token_str': '-'},
 {'sequence': "<s>C'est un\t</s>",
  'score': 0.00037563967634923756,
  'token': 202,
  'token_str': '膲'}]
  ```
  

## Resources

For all resources , please look into the [HuggingFace](https://huggingface.co/) Site and the [Repositories](https://github.com/huggingface).