README.md 4.89 KB
Newer Older
Savaş Yıldırım's avatar
Savaş Yıldırım committed
1
---
2
language: tr
Savaş Yıldırım's avatar
Savaş Yıldırım committed
3
---
Savaş Yıldırım's avatar
Savaş Yıldırım committed
4
# Bert-base Turkish Sentiment Model
Savaş Yıldırım's avatar
Savaş Yıldırım committed
5
6
7
8
9
10

https://huggingface.co/savasy/bert-base-turkish-sentiment-cased

This model is used for Sentiment Analysis, which is based on BERTurk for Turkish Language https://huggingface.co/dbmdz/bert-base-turkish-cased


11
## Dataset
Savaş Yıldırım's avatar
Savaş Yıldırım committed
12

13
The dataset is taken from the studies [[2]](#paper-2) and [[3]](#paper-3), and merged.
Savaş Yıldırım's avatar
Savaş Yıldırım committed
14
15

* The study [2] gathered movie and product reviews. The products are book, DVD, electronics, and kitchen.
16
The movie dataset is taken from a cinema Web page ([Beyazperde](www.beyazperde.com)) with
Savaş Yıldırım's avatar
Savaş Yıldırım committed
17
18
19
20
21
22
23
24
5331 positive and 5331 negative sentences. Reviews in the Web page are marked in
scale from 0 to 5 by the users who made the reviews. The study considered a review
sentiment positive if the rating is equal to or bigger than 4, and negative if it is less
or equal to 2. They also built Turkish product review dataset from an online retailer
Web page. They constructed benchmark dataset consisting of reviews regarding some
products (book, DVD, etc.). Likewise, reviews are marked in the range from 1 to 5,
and majority class of reviews are 5. Each category has 700 positive and 700 negative
reviews in which average rating of negative reviews is 2.27 and of positive reviews
25
is 4.5. This dataset is also used by the study [[1]](#paper-1).
Savaş Yıldırım's avatar
Savaş Yıldırım committed
26

27
* The study [[3]](#paper-3) collected tweet dataset. They proposed a new approach for automatically classifying the sentiment of microblog messages. The proposed approach is based on utilizing robust feature representation and fusion. 
Savaş Yıldırım's avatar
Savaş Yıldırım committed
28

Savaş Yıldırım's avatar
Savaş Yıldırım committed
29
30
31
32
33
34
35
36
*Merged Dataset* 

| *size*   | *data* |
|--------|----|
|   8000 |dev.tsv|
|   8262 |test.tsv|
|  32000 |train.tsv|
|  *48290* |*total*|
Savaş Yıldırım's avatar
Savaş Yıldırım committed
37

38
### The dataset is used by following papers
Savaş Yıldırım's avatar
Savaş Yıldırım committed
39

40
41
42
<a id="paper-1">[1]</a> Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12. 

<a id="paper-2">[2]</a> Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment
Savaş Yıldırım's avatar
Savaş Yıldırım committed
43
Discovery and Opinion Mining (WISDOM ’13)
Savaş Yıldırım's avatar
Savaş Yıldırım committed
44

45
<a id="paper-3">[3]</a> Hayran, A.,   Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey
Savaş Yıldırım's avatar
Savaş Yıldırım committed
46

47
48
49
50

## Training

```shell
Savaş Yıldırım's avatar
Savaş Yıldırım committed
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
export GLUE_DIR="./sst-2-newall"
export TASK_NAME=SST-2

python3 run_glue.py \
  --model_type bert \
  --model_name_or_path dbmdz/bert-base-turkish-uncased\
  --task_name "SST-2" \
  --do_train \
  --do_eval \
  --data_dir "./sst-2-newall" \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir "./model"
```


69
## Results
Savaş Yıldırım's avatar
Savaş Yıldırım committed
70

71
72
73
74
75
76
77
> 05/10/2020 17:00:43 - INFO - transformers.trainer -   \*\*\*\*\* Running Evaluation \*\*\*\*\*  
> 05/10/2020 17:00:43 - INFO - transformers.trainer -     Num examples = 7999  
> 05/10/2020 17:00:43 - INFO - transformers.trainer -     Batch size = 8  
> Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]  
> 05/10/2020 17:01:17 - INFO - \_\_main__ -   \*\*\*\*\* Eval results sst-2 \*\*\*\*\*  
> 05/10/2020 17:01:17 - INFO - \_\_main__ -     acc = 0.9539942492811602  
> 05/10/2020 17:01:17 - INFO - \_\_main__ -     loss = 0.16348013816401363
Savaş Yıldırım's avatar
Savaş Yıldırım committed
78

79
Accuracy is about **95.4%**
Savaş Yıldırım's avatar
Savaş Yıldırım committed
80
81


82
## Code Usage
Savaş Yıldırım's avatar
Savaş Yıldırım committed
83

84
```python
Savaş Yıldırım's avatar
Savaş Yıldırım committed
85
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
86

Savaş Yıldırım's avatar
Savaş Yıldırım committed
87
88
89
90
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)

91
p = sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
Savaş Yıldırım's avatar
Savaş Yıldırım committed
92
print(p)
93
94
95
# [{'label': 'LABEL_1', 'score': 0.9871089}]
print(p[0]['label'] == 'LABEL_1')
# True
Savaş Yıldırım's avatar
Savaş Yıldırım committed
96

97
p = sa("Film çok kötü ve çok sahteydi")
Savaş Yıldırım's avatar
Savaş Yıldırım committed
98
print(p)
99
100
101
# [{'label': 'LABEL_0', 'score': 0.9975505}]
print(p[0]['label'] == 'LABEL_1')
# False
Savaş Yıldırım's avatar
Savaş Yıldırım committed
102
```
Savaş Yıldırım's avatar
Savaş Yıldırım committed
103
104


105
106
## Test
### Data
Savaş Yıldırım's avatar
Savaş Yıldırım committed
107

108
Suppose your file has lots of lines of comment and label (1 or 0) at the end  (tab seperated)
Savaş Yıldırım's avatar
Savaş Yıldırım committed
109

110
111
> comment1 ... \t label  
> comment2 ... \t label  
Savaş Yıldırım's avatar
Savaş Yıldırım committed
112
113
> ...

114
### Code
Savaş Yıldırım's avatar
Savaş Yıldırım committed
115

116
```python
Savaş Yıldırım's avatar
Savaş Yıldırım committed
117
118
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

Savaş Yıldırım's avatar
Savaş Yıldırım committed
119
120
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
sa = pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)

input_file = "/path/to/your/file/yourfile.tsv"

i, crr = 0, 0
for line in open(input_file):
    lines = line.strip().split("\t")
    if len(lines) == 2:
        
        i = i + 1
        if i%100 == 0:
            print(i)
        
        pred = sa(lines[0])
        pred = pred[0]["label"].split("_")[1]
        
        if pred == lines[1]:
        crr = crr + 1
Savaş Yıldırım's avatar
Savaş Yıldırım committed
139
140
141

print(crr, i, crr/i)
```