Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
242005d7
Unverified
Commit
242005d7
authored
May 08, 2020
by
Savaş Yıldırım
Committed by
GitHub
May 08, 2020
Browse files
Create README.md (#4132)
* Create README.md * Adding code fence around code block
parent
5940c73b
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
90 additions
and
0 deletions
+90
-0
model_cards/savasy/bert-base-turkish-ner-cased/README.md
model_cards/savasy/bert-base-turkish-ner-cased/README.md
+90
-0
No files found.
model_cards/savasy/bert-base-turkish-ner-cased/README.md
0 → 100644
View file @
242005d7
# For Turkish language, here is an easy-to-use NER application.
**
Türkçe için kolay bir python NER (Bert + Transfer Learning) (İsim Varlık Tanıma) modeli...
Thanks to @stefan-it, I applied the followings for training
cd tr-data
for file in train.txt dev.txt test.txt labels.txt
do
wget https://schweter.eu/storage/turkish-bert-wikiann/$file
done
cd ..
It will download the pre-processed datasets with training, dev and test splits and put them in a tr-data folder.
Run pre-training
After downloading the dataset, pre-training can be started. Just set the following environment variables:
```
export MAX_LENGTH=128
export BERT_MODEL=dbmdz/bert-base-turkish-cased
export OUTPUT_DIR=tr-new-model
export BATCH_SIZE=32
export NUM_EPOCHS=3
export SAVE_STEPS=625
export SEED=1
```
Then run pre-training:
```
python3 run_ner.py --data_dir ./tr-data3 \
--model_type bert \
--labels ./tr-data/labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR-$SEED \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict \
--fp16
```
# Usage
```
from
transformers
import
pipeline
,
AutoModelForTokenClassification
,
AutoTokenizer
model
=
AutoModelForTokenClassification
.
from_pretrained
(
"savasy/bert-base-turkish-ner-cased"
)
tokenizer
=
AutoTokenizer
.
from_pretrained
(
"savasy/bert-base-turkish-ner-cased"
)
ner
=
pipeline
(
'ner'
,
model
=
model
,
tokenizer
=
tokenizer
)
ner
(
"Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı."
)
```
# Some results
Data1: For the data above
Eval Results:
*
precision = 0.916400580551524
*
recall = 0.9342309684101502
*
f1 = 0.9252298787412536
*
loss = 0.11335893666411284
Test Results:
*
precision = 0.9192058759362955
*
recall = 0.9303010230367262
*
f1 = 0.9247201697271198
*
loss = 0.11182546521618497
Data2:
https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt
The performance for the data given by @kemalaraz is as follows
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat eval_results.txt
*
precision = 0.9461980692049029
*
recall = 0.959309358847465
*
f1 = 0.9527086063783312
*
loss = 0.037054269206847804
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat test_results.txt
*
precision = 0.9458370635631155
*
recall = 0.9588201928530913
*
f1 = 0.952284378344882
*
loss = 0.035431676572445225
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment