Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
efb46937
Commit
efb46937
authored
Mar 03, 2025
by
Baber
Browse files
Merge branch 'main' into convert_gen
# Conflicts: # lm_eval/__main__.py # lm_eval/evaluator.py
parents
7fbf899c
ade01428
Changes
177
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
941 additions
and
2 deletions
+941
-2
lm_eval/tasks/evalita_llm/_evalita-mp_wic_p4.yaml
lm_eval/tasks/evalita_llm/_evalita-mp_wic_p4.yaml
+6
-0
lm_eval/tasks/evalita_llm/_evalita-mp_wic_p5.yaml
lm_eval/tasks/evalita_llm/_evalita-mp_wic_p5.yaml
+6
-0
lm_eval/tasks/evalita_llm/_evalita-mp_wic_p6.yaml
lm_eval/tasks/evalita_llm/_evalita-mp_wic_p6.yaml
+6
-0
lm_eval/tasks/evalita_llm/_evalita-mp_wic_tasks.yaml
lm_eval/tasks/evalita_llm/_evalita-mp_wic_tasks.yaml
+9
-0
lm_eval/tasks/evalita_llm/_faq_template_yaml
lm_eval/tasks/evalita_llm/_faq_template_yaml
+8
-0
lm_eval/tasks/evalita_llm/_hs_template_yaml
lm_eval/tasks/evalita_llm/_hs_template_yaml
+9
-0
lm_eval/tasks/evalita_llm/_ls_template_yaml
lm_eval/tasks/evalita_llm/_ls_template_yaml
+16
-0
lm_eval/tasks/evalita_llm/_ner_template_yaml
lm_eval/tasks/evalita_llm/_ner_template_yaml
+14
-0
lm_eval/tasks/evalita_llm/_re_template_yaml
lm_eval/tasks/evalita_llm/_re_template_yaml
+14
-0
lm_eval/tasks/evalita_llm/_sa_template_v2_yaml
lm_eval/tasks/evalita_llm/_sa_template_v2_yaml
+9
-0
lm_eval/tasks/evalita_llm/_sa_template_yaml
lm_eval/tasks/evalita_llm/_sa_template_yaml
+9
-0
lm_eval/tasks/evalita_llm/_sum_template_fp-small_yaml
lm_eval/tasks/evalita_llm/_sum_template_fp-small_yaml
+10
-0
lm_eval/tasks/evalita_llm/_sum_template_fp_yaml
lm_eval/tasks/evalita_llm/_sum_template_fp_yaml
+9
-0
lm_eval/tasks/evalita_llm/_sum_template_yaml
lm_eval/tasks/evalita_llm/_sum_template_yaml
+11
-0
lm_eval/tasks/evalita_llm/_te_template_yaml
lm_eval/tasks/evalita_llm/_te_template_yaml
+13
-0
lm_eval/tasks/evalita_llm/_wic_template_yaml
lm_eval/tasks/evalita_llm/_wic_template_yaml
+14
-0
lm_eval/tasks/evalita_llm/metrics.py
lm_eval/tasks/evalita_llm/metrics.py
+165
-0
lm_eval/tasks/evalita_llm/utils.py
lm_eval/tasks/evalita_llm/utils.py
+576
-0
lm_eval/tasks/fda/task.py
lm_eval/tasks/fda/task.py
+3
-1
lm_eval/tasks/galician_bench/README.md
lm_eval/tasks/galician_bench/README.md
+34
-1
No files found.
lm_eval/tasks/evalita_llm/_evalita-mp_wic_p4.yaml
0 → 100644
View file @
efb46937
tag
:
evalita-mp_wic_tasks
task
:
evalita-mp_wic_prompt-4
task_alias
:
prompt-4
include
:
_wic_template_yaml
doc_to_text
:
"
Devi
determinare
se
una
stessa
parola
usata
in
due
frasi
differenti
ha
lo
stesso
significato
in
entrambi
i
contesti.
La
parola
'{{sentence1[start1:end1]}}'
nella
frase
'{{sentence1}}'
ha
lo
stesso
significato
della
parola
'{{sentence2[start2:end2]}}'
nella
frase
'{{sentence2}}'?
\n
A:
Sì
\n
B:
No
\n
Risposta:"
doc_to_choice
:
[
"
B"
,
"
A"
]
lm_eval/tasks/evalita_llm/_evalita-mp_wic_p5.yaml
0 → 100644
View file @
efb46937
tag
:
evalita-mp_wic_tasks
task
:
evalita-mp_wic_prompt-5
task_alias
:
prompt-5
include
:
_wic_template_yaml
doc_to_text
:
"
La
parola:
'{{sentence1[start1:end1]}}'
nella
frase:
'{{sentence1}}'
e
la
parola:
'{{sentence2[start2:end2]}}'
nella
frase:
'{{sentence2}}'"
doc_to_choice
:
[
"
non
hanno
lo
stesso
significato"
,
"
hanno
lo
stesso
significato"
]
lm_eval/tasks/evalita_llm/_evalita-mp_wic_p6.yaml
0 → 100644
View file @
efb46937
tag
:
evalita-mp_wic_tasks
task
:
evalita-mp_wic_prompt-6
task_alias
:
prompt-6
include
:
_wic_template_yaml
doc_to_text
:
"
Devi
determinare
se
una
stessa
parola
usata
in
due
frasi
differenti
ha
lo
stesso
significato
in
entrambi
i
contesti.
La
parola:
'{{sentence1[start1:end1]}}'
nella
frase:
'{{sentence1}}'
e
la
parola:
'{{sentence2[start2:end2]}}'
nella
frase:
'{{sentence2}}'"
doc_to_choice
:
[
"
non
hanno
lo
stesso
significato"
,
"
hanno
lo
stesso
significato"
]
lm_eval/tasks/evalita_llm/_evalita-mp_wic_tasks.yaml
0 → 100644
View file @
efb46937
group
:
evalita-mp_wic
group_alias
:
word-in-context
task
:
-
evalita-mp_wic_tasks
# this has to match the tag in the task yaml file
aggregate_metric_list
:
-
metric
:
f1
weight_by_size
:
True
metadata
:
version
:
1
lm_eval/tasks/evalita_llm/_faq_template_yaml
0 → 100644
View file @
efb46937
dataset_path: evalitahf/faq
test_split: test_1
fewshot_split: dev_1
doc_to_target: !function utils.faq_doc_to_target
doc_to_choice: ["A", "B", "C", "D"]
output_type: multiple_choice
metadata:
version: 1
lm_eval/tasks/evalita_llm/_hs_template_yaml
0 → 100644
View file @
efb46937
dataset_path: evalitahf/hatespeech_detection
output_type: multiple_choice
test_split: test_all
fewshot_split: dev
validation_split: dev
doc_to_target: hs # 0 = Falso, 1 = Vero
doc_to_choice: ["Falso", "Vero"]
metadata:
version: 1
lm_eval/tasks/evalita_llm/_ls_template_yaml
0 → 100644
View file @
efb46937
dataset_path: evalitahf/lexical_substitution
test_split: test
validation_split: dev
fewshot_split: dev
output_type: generate_until
generation_kwargs:
until:
- "</s>"
doc_to_target: !function utils.ls_doc_to_target
process_results: !function utils.ls_process_results
metric_list:
- metric: f1
higher_is_better: True
aggregation: !function metrics._aggreg_ls
metadata:
version: 1
lm_eval/tasks/evalita_llm/_ner_template_yaml
0 → 100644
View file @
efb46937
dataset_path: evalitahf/entity_recognition
output_type: generate_until
generation_kwargs:
until:
- "</s>"
- "\n"
doc_to_target: !function utils.ner_doc_to_target
process_results: !function utils.ner_process_results
metric_list:
- metric: f1
higher_is_better: True
aggregation: !function metrics._aggreg_ner
metadata:
version: 1
lm_eval/tasks/evalita_llm/_re_template_yaml
0 → 100644
View file @
efb46937
dataset_path: evalitahf/relation_extraction
test_split: test
output_type: generate_until
generation_kwargs:
until:
- "</s>"
doc_to_target: !function utils.re_doc_to_target
process_results: !function utils.rel_process_results_v3
metric_list:
- metric: f1
higher_is_better: True
aggregation: !function metrics._aggreg_rel
metadata:
version: 1
lm_eval/tasks/evalita_llm/_sa_template_v2_yaml
0 → 100644
View file @
efb46937
dataset_path: evalitahf/sentiment_analysis
output_type: multiple_choice
test_split: test
fewshot_split: train
validation_split: test
doc_to_target: !function utils.sa_doc_to_target_v2
doc_to_choice: ["positivo", "negativo", "neutrale", "misto"]
metadata:
version: 1
lm_eval/tasks/evalita_llm/_sa_template_yaml
0 → 100644
View file @
efb46937
dataset_path: evalitahf/sentiment_analysis
output_type: multiple_choice
test_split: test
fewshot_split: train
validation_split: test
doc_to_target: !function utils.sa_doc_to_target
doc_to_choice: !function utils.sa_doc_to_choice
metadata:
version: 1
lm_eval/tasks/evalita_llm/_sum_template_fp-small_yaml
0 → 100644
View file @
efb46937
dataset_path: evalitahf/summarization-fp
output_type: generate_until
generation_kwargs:
until:
- "</s>"
test_split: test_100
fewshot_split: dev
doc_to_target: "{{target}}"
metadata:
version: 1
lm_eval/tasks/evalita_llm/_sum_template_fp_yaml
0 → 100644
View file @
efb46937
dataset_path: ARTeLab/fanpage
output_type: generate_until
generation_kwargs:
until:
- "</s>"
test_split: test
doc_to_target: "{{target}}"
metadata:
version: 1.0
lm_eval/tasks/evalita_llm/_sum_template_yaml
0 → 100644
View file @
efb46937
dataset_path: silvia-casola/WITS
output_type: generate_until
generation_kwargs:
until:
- "</s>"
test_split: test_100
fewshot_split: dev
#test_split: train
doc_to_target: "{{summary}}"
metadata:
version: 1
lm_eval/tasks/evalita_llm/_te_template_yaml
0 → 100644
View file @
efb46937
dataset_path: evalitahf/textual_entailment
output_type: multiple_choice
test_split: test
fewshot_split: dev
validation_split: dev
doc_to_target: "{{ 0 if entailment == 'SI' else 1 }}"
doc_to_choice: ["Sì", "No"]
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1
lm_eval/tasks/evalita_llm/_wic_template_yaml
0 → 100644
View file @
efb46937
dataset_path: evalitahf/word_in_context
dataset_name: default
output_type: multiple_choice
test_split: test
fewshot_split: dev
validation_split: dev
doc_to_target: label # 0: No, 1: Si
doc_to_choice: ["No", "Sì"]
metric_list:
- metric: f1
higher_is_better: true
aggregation: f1
metadata:
version: 1.0
lm_eval/tasks/evalita_llm/metrics.py
0 → 100644
View file @
efb46937
import
torch
from
sklearn.metrics
import
f1_score
,
precision_score
,
recall_score
inference_decorator
=
(
torch
.
inference_mode
if
torch
.
__version__
>=
"2.0.0"
else
torch
.
no_grad
)
def
_aggreg_ls
(
predictions
):
"""
Custom aggregation to compute corpus level metrics for the lexical substitution task
predictions is a list of tuples (prec, has_answ, has_annotation)
prec is the precision before dividing by |A|
has_answ is 0 if the model did not produce any answer
has_annotation is 0 if the gold answer is empty: no synonims from annotators
"""
# get |A| and |T| to compute the final precision and recall using a lambda function
A
=
sum
([
p
[
1
]
for
p
in
predictions
])
T
=
sum
([
p
[
2
]
for
p
in
predictions
])
# compute the final precision and recall
if
A
==
0
:
prec
=
sum
([
p
[
0
]
for
p
in
predictions
])
/
1
else
:
prec
=
sum
([
p
[
0
]
for
p
in
predictions
])
/
A
if
T
==
0
:
rec
=
sum
([
p
[
0
]
for
p
in
predictions
])
/
1
else
:
rec
=
sum
([
p
[
0
]
for
p
in
predictions
])
/
T
# compute the final F1 score
f1
=
0
if
prec
+
rec
!=
0
:
f1
=
(
2
*
prec
*
rec
)
/
(
prec
+
rec
)
return
f1
def
_aggreg_sa_v2
(
predictions
):
"""
This aggregation considers the sentiment analysis task as a multiple choice one with four classes
the f1 score is computed as the average of the f1 scores for each class weighted by the number of samples
See sklearn.metrics.f1_score for more details
"""
predictions
,
references
=
zip
(
*
predictions
)
f1
=
f1_score
(
references
,
predictions
,
average
=
"weighted"
)
return
f1
def
_aggreg_sa
(
predictions
):
"""
Custom aggregation function for the sentiment analysis task
The original tasks compute the F1 score for each class and then average them
Since the prompt cast the task to a multple choice one we need to aggregate the results in a different way
"""
# split the predictions and references in two lists (pred is a tuple)
predictions
,
references
=
zip
(
*
predictions
)
"""
Class 0: positivo -> 'opos': 1, 'oneg': 0
Class 1: negativo -> 'opos': 0, 'oneg': 1
etc.
"""
def
_map_to_original_labels
(
x
):
"""
Return two separate list of labels for opos and oneg
x is a list of integers
"""
opos
=
[]
oneg
=
[]
for
i
in
x
:
if
i
==
0
:
# positive
opos
.
append
(
1
)
oneg
.
append
(
0
)
elif
i
==
1
:
# negative
opos
.
append
(
0
)
oneg
.
append
(
1
)
elif
i
==
2
:
# neutral
opos
.
append
(
0
)
oneg
.
append
(
0
)
elif
i
==
3
:
# mixed
opos
.
append
(
1
)
oneg
.
append
(
1
)
else
:
pass
return
opos
,
oneg
pred_opos
,
pred_oneg
=
_map_to_original_labels
(
predictions
)
ref_opos
,
ref_oneg
=
_map_to_original_labels
(
references
)
opos_f1
=
f1_score
(
ref_opos
,
pred_opos
,
average
=
None
)
opos_f1_c0
=
f1_score
(
ref_opos
,
pred_opos
,
average
=
None
)[
0
]
if
len
(
opos_f1
)
>
1
:
opos_f1_c1
=
opos_f1
[
1
]
else
:
opos_f1_c1
=
0
# oneg class
oneg_prec_c0
,
oneg_prec_c1
=
precision_score
(
ref_oneg
,
pred_oneg
,
labels
=
[
0
,
1
],
average
=
None
)
oneg_rec_c0
,
oneg_rec_c1
=
recall_score
(
ref_oneg
,
pred_oneg
,
labels
=
[
0
,
1
],
average
=
None
)
oneg_f1
=
f1_score
(
ref_oneg
,
pred_oneg
,
average
=
None
)
oneg_f1_c0
=
f1_score
(
ref_oneg
,
pred_oneg
,
average
=
None
)[
0
]
if
len
(
oneg_f1
)
>
1
:
oneg_f1_c1
=
f1_score
(
ref_oneg
,
pred_oneg
,
average
=
None
)[
1
]
else
:
oneg_f1_c1
=
0
# average f1 score for each class (opos and oneg)
f1_score_opos
=
(
opos_f1_c0
+
opos_f1_c1
)
/
2
f1_score_oneg
=
(
oneg_f1_c0
+
oneg_f1_c1
)
/
2
# average f1 score for the two classes
f1_final
=
(
f1_score_opos
+
f1_score_oneg
)
/
2
return
f1_final
def
_aggreg_ner
(
predictions
):
pred
,
ref
=
zip
(
*
predictions
)
# concat all the predictions and references
all_pred
=
[]
for
p
in
pred
:
all_pred
.
extend
(
p
)
all_ref
=
[]
for
r
in
ref
:
all_ref
.
extend
(
r
)
# compute the F1 score
f1
=
f1_score
(
all_ref
,
all_pred
,
average
=
None
)
if
len
(
f1
)
>
1
:
f1_sum
=
sum
(
f1
[:
-
1
])
/
(
len
(
f1
)
-
1
)
else
:
f1_sum
=
f1
[
0
]
return
f1_sum
def
_aggreg_rel
(
predictions
):
pred
,
ref
=
zip
(
*
predictions
)
# concat all the predictions and references
all_pred
=
[]
for
p
in
pred
:
all_pred
.
extend
(
p
)
all_ref
=
[]
for
r
in
ref
:
all_ref
.
extend
(
r
)
# compute the F1 score
f1
=
f1_score
(
all_ref
,
all_pred
,
average
=
"macro"
)
return
f1
# ------------------------ DOCUMENT DATING ---------------------------
def
_aggreg_dd
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"macro"
)
return
fscore
lm_eval/tasks/evalita_llm/utils.py
0 → 100644
View file @
efb46937
import
logging
from
evaluate
import
load
from
sklearn.metrics
import
f1_score
eval_logger
=
logging
.
getLogger
(
"lm-eval"
)
# ---------------------- SENTIMENT ANALYSIS ----------------------
def
sa_doc_to_target
(
x
):
"""
Function to extract the target from the dataset for sentiment analysis
"""
opos
=
x
[
"opos"
]
oneg
=
x
[
"oneg"
]
# return indexes matches the choices in sa_doc_to_choice
if
opos
==
"1"
and
oneg
==
"0"
:
return
0
elif
opos
==
"0"
and
oneg
==
"1"
:
return
1
elif
opos
==
"0"
and
oneg
==
"0"
:
return
2
elif
opos
==
"1"
and
oneg
==
"1"
:
return
3
else
:
pass
def
sa_doc_to_target_v2
(
x
):
"""
Function to extract the target from the dataset for sentiment analysis
"""
opos
=
x
[
"opos"
]
oneg
=
x
[
"oneg"
]
# return indexes matches the choices in sa_doc_to_choice
if
opos
==
"1"
and
oneg
==
"0"
:
return
0
elif
opos
==
"0"
and
oneg
==
"1"
:
return
1
elif
opos
==
"0"
and
oneg
==
"0"
:
return
2
elif
opos
==
"1"
and
oneg
==
"1"
:
return
3
else
:
pass
def
sa_doc_to_choice
(
x
):
"""
Function to return the choices from the dataset for sentiment analysis
"""
return
[
"Positivo"
,
"Negativo"
,
"Neutrale"
,
"Misto"
]
# ---------------------- LEXICAL SUBSTITUTION ----------------------
NO_SYN_STRING
=
"&&NOSYN&&"
def
_ls_gold_to_target
(
x
):
"""
Generate the target for the lexical similarity task
"""
# all_answers = [(i["word"], i["count"]) for i in x["answers"]]
if
len
(
x
[
"answers"
])
==
0
:
return
NO_SYN_STRING
ans_str
=
""
for
i
in
x
[
"answers"
]:
ans_str
+=
i
[
"word"
]
+
"$$"
+
str
(
i
[
"count"
])
+
"::"
if
len
(
ans_str
)
!=
0
and
ans_str
[
-
2
]
==
":"
:
ans_str
=
ans_str
[:
-
2
]
# print(ans_str)
return
ans_str
def
ls_doc_to_target
(
x
):
"""
Generate the target for the lexical similarity task
"""
if
len
(
x
[
"answers"
])
==
0
:
return
NO_SYN_STRING
ans_str
=
""
for
i
in
x
[
"answers"
]:
ans_str
+=
i
[
"word"
]
+
", "
if
len
(
ans_str
)
!=
0
and
ans_str
[
-
2
]
==
","
:
ans_str
=
ans_str
[:
-
2
]
return
ans_str
def
_ls_split_gold
(
x
):
"""
Split the gold string into a list of tuples
"""
if
x
==
NO_SYN_STRING
:
return
[],
[]
answers
=
x
.
split
(
"::"
)
words
=
[]
freqs
=
[]
if
len
(
answers
)
!=
0
:
for
a
in
answers
:
if
"$$"
in
a
:
word
,
count
=
a
.
split
(
"$$"
)
words
.
append
(
word
)
try
:
freqs
.
append
(
int
(
count
))
except
ValueError
:
freqs
.
append
(
0
)
return
words
,
freqs
def
ls_process_results
(
doc
,
results
):
"""
Process the results of the evaluation for the lexical substitution task
look at coqa for another example
"""
gold_to_target
=
_ls_gold_to_target
(
doc
)
words
,
freqs
=
_ls_split_gold
(
gold_to_target
)
prec
=
0
# Considering a maximum of the first 10 synonyms
results
=
split_text_with_regex
(
results
[
0
],
LS_SPLIT_REGEX
)
results
=
results
[:
min
(
10
,
len
(
results
))]
# Remove non-alphabetic characters from the word at the end of the list
if
results
:
# Check if results is not empty
results
[
-
1
]
=
""
.
join
(
char
for
char
in
results
[
-
1
]
if
char
.
isalpha
())
has_answ
=
0
if
len
(
results
)
==
0
else
1
# so we can compute |A|
has_annotation
=
0
if
len
(
words
)
==
0
else
1
# so we can compute |T|
matching_res
=
[]
# for debugging
for
r
in
results
:
if
r
in
words
:
# get frequency of the synonyms from annotators
idx
=
words
.
index
(
r
.
strip
())
prec
+=
freqs
[
idx
]
matching_res
.
append
(
r
)
# In the case of the OOT (out of ten) subtask, this normalization should not be applied
# ai = len(results) if len(results) != 0 else 1
# prec = prec / ai
Hi
=
sum
(
freqs
)
if
Hi
!=
0
:
prec
=
prec
/
Hi
else
:
eval_logger
.
debug
(
"H_i is 0"
)
return
{
"f1"
:
(
prec
,
has_answ
,
has_annotation
)}
# ---------------------- NER ----------------------
NO_ENT_STRING
=
"&&NOENT&&"
NER_ENTITY_SEPARATOR
=
","
NER_TYPE_SEPARATOR
=
"$"
NER_MAPPING_V2
=
{
"PER"
:
0
,
"LOC"
:
1
,
"ORG"
:
2
,
NO_ENT_STRING
:
3
,
"O"
:
4
}
NER_MAPPING
=
{
"PER"
:
0
,
"LOC"
:
1
,
"ORG"
:
2
,
"O"
:
3
}
def
_ner_gold_to_target
(
x
:
list
)
->
list
:
"""
Convert the gold entities to the target format according to the NER_MAPPING
"""
res
=
[
NER_MAPPING
[
e
[
"type"
]]
for
e
in
x
]
return
res
def
_ner_gold_to_target_v2
(
x
:
list
)
->
list
:
"""
Convert the gold entities to the target format according to the NER_MAPPING
"""
res
=
[
NER_MAPPING
[
e
[
"type"
]]
for
e
in
x
]
return
res
def
ner_doc_to_target
(
doc
):
ents
=
doc
[
"entities"
]
targ_str
=
""
# Entità$Tipo%Entità$Tipo.
if
ents
==
[]:
return
NO_ENT_STRING
else
:
for
e
in
ents
:
targ_str
+=
(
e
[
"entity_text"
]
+
NER_TYPE_SEPARATOR
+
e
[
"type"
]
+
NER_ENTITY_SEPARATOR
)
return
targ_str
[:
-
1
]
def
ner_process_results
(
doc
,
results
):
"""
Process the results of the Named Entity Recognition task
"""
# each document has a list of entities with the following format:
# [{"entity_text": "string", "type": "string"}]
gold
=
doc
[
"entities"
]
raw_results
=
results
[
0
]
results
=
_ner_process_raw_output
(
raw_results
)
gold_labels
=
_ner_gold_to_target
(
gold
)
res_labels
=
[
0
]
*
len
(
gold_labels
)
matched_gold_idx
=
[]
if
len
(
results
)
>
len
(
gold
):
for
r
in
results
:
r_text
=
r
[
0
]
r_type
=
r
[
1
]
for
i
in
range
(
len
(
gold
)):
if
r_text
==
gold
[
i
][
"entity_text"
]
and
r_type
==
gold
[
i
][
"type"
]:
res_labels
[
i
]
=
NER_MAPPING
[
r_type
]
matched_gold_idx
.
append
(
i
)
# Since we have more results than gold, we artificially set to false positive the remaining labels
# extend gold label list
for
i
in
range
(
len
(
results
)
-
len
(
gold
)):
gold_labels
.
append
(
3
)
res_labels
.
append
(
2
)
elif
len
(
results
)
==
0
and
len
(
gold
)
==
0
:
res_labels
=
[
3
]
gold_labels
=
res_labels
else
:
# len(results) <= len(gold)
for
r
in
results
:
r_text
=
r
[
0
]
r_type
=
r
[
1
]
for
i
in
range
(
len
(
gold
)):
if
r_text
==
gold
[
i
][
"entity_text"
]
and
r_type
==
gold
[
i
][
"type"
]:
res_labels
[
i
]
=
NER_MAPPING
[
r_type
]
matched_gold_idx
.
append
(
i
)
# we map all wrong predictions to the "O" class
for
i
in
range
(
len
(
gold_labels
)):
if
i
in
matched_gold_idx
:
continue
if
gold_labels
[
i
]
==
1
:
res_labels
[
i
]
=
3
elif
gold_labels
[
i
]
==
0
:
res_labels
[
i
]
=
3
else
:
res_labels
[
i
]
=
3
assert
len
(
gold_labels
)
==
len
(
res_labels
)
return
{
"f1"
:
(
res_labels
,
gold_labels
)}
def
ner_process_results_v2
(
doc
,
results
):
"""
Process the results of the Named Entity Recognition task
This version considers and score explicitly when the model responds that there are no entities
"""
# each document has a list of entities with the following format:
# [{"entity_text": "string", "type": "string"}]
gold
=
doc
[
"entities"
]
raw_results
=
results
[
0
]
results
=
_ner_process_raw_output_v2
(
raw_results
)
# eval_logger.debug(f"results {results}")
# eval_logger.debug(f"gold {gold}")
gold_labels
=
_ner_gold_to_target_v2
(
gold
)
res_labels
=
[
0
]
*
len
(
gold_labels
)
matched_gold_idx
=
[]
if
len
(
results
)
>
len
(
gold
):
for
r
in
results
:
# print(r)
r_text
=
r
[
0
]
r_type
=
r
[
1
]
for
i
in
range
(
len
(
gold
)):
if
r_text
==
gold
[
i
][
"entity_text"
]
and
r_type
==
gold
[
i
][
"type"
]:
res_labels
[
i
]
=
NER_MAPPING
[
r_type
]
matched_gold_idx
.
append
(
i
)
# Since we have more results than gold, we artificially set to false positive the remaining labels
# extend gold label list
for
i
in
range
(
len
(
results
)
-
len
(
gold
)):
# gold_labels.append(3)
# res_labels.append(2)
gold_labels
.
append
(
4
)
res_labels
.
append
(
3
)
elif
len
(
results
)
==
0
and
len
(
gold
)
==
0
:
# res_labels = [random.choice([0, 1, 2, 3])]
res_labels
=
[
3
]
gold_labels
=
res_labels
elif
len
(
results
)
==
1
and
results
[
0
]
==
NO_ENT_STRING
:
# res_labels = [3]
res_labels
=
[
4
]
gold_labels
=
res_labels
else
:
# len(results) <= len(gold)
for
r
in
results
:
r_text
=
r
[
0
]
r_type
=
r
[
1
]
for
i
in
range
(
len
(
gold
)):
if
r_text
==
gold
[
i
][
"entity_text"
]
and
r_type
==
gold
[
i
][
"type"
]:
res_labels
[
i
]
=
NER_MAPPING
[
r_type
]
matched_gold_idx
.
append
(
i
)
# we map all wrong predictions to the "O" class
for
i
in
range
(
len
(
gold_labels
)):
if
i
in
matched_gold_idx
:
continue
if
gold_labels
[
i
]
==
1
:
# res_labels[i] = 2
res_labels
[
i
]
=
4
elif
gold_labels
[
i
]
==
0
:
# res_labels[i] = 1
res_labels
[
i
]
=
4
else
:
res_labels
[
i
]
=
4
assert
len
(
gold_labels
)
==
len
(
res_labels
)
return
{
"f1"
:
(
res_labels
,
gold_labels
)}
def
_ner_process_raw_output
(
llm_result
:
str
)
->
list
[
tuple
]:
if
NO_ENT_STRING
in
llm_result
:
return
[]
if
llm_result
==
""
:
return
[
"WRONG"
]
tmp_results
=
llm_result
.
split
(
NER_ENTITY_SEPARATOR
)
results
=
[]
for
res
in
tmp_results
:
r
=
res
.
strip
()
# split on type separator
r_text
=
""
r_type
=
""
r_splitted
=
r
.
split
(
NER_TYPE_SEPARATOR
)
if
len
(
r_splitted
)
<
2
:
r_text
=
r_splitted
[
0
]
r_type
=
""
else
:
r_text
=
r_splitted
[
0
]
r_type
=
r_splitted
[
1
]
if
r_text
!=
""
:
results
.
append
((
r_text
,
r_type
.
upper
()))
return
results
def
_ner_process_raw_output_v2
(
llm_result
:
str
)
->
list
[
tuple
]:
if
NO_ENT_STRING
in
llm_result
:
return
[
NO_ENT_STRING
]
if
llm_result
==
""
:
return
[
"WRONG"
]
tmp_results
=
llm_result
.
split
(
NER_ENTITY_SEPARATOR
)
results
=
[]
for
res
in
tmp_results
:
r
=
res
.
strip
()
# split on type separator
r_text
=
""
r_type
=
""
r_splitted
=
r
.
split
(
NER_TYPE_SEPARATOR
)
if
len
(
r_splitted
)
<
2
:
r_text
=
r_splitted
[
0
]
r_type
=
""
else
:
r_text
=
r_splitted
[
0
]
r_type
=
r_splitted
[
1
]
if
r_text
!=
""
:
results
.
append
((
r_text
,
r_type
.
upper
()))
return
results
# ---------------------- RELATION EXTRACTION ----------------------
def
_rel_process_raw_output
(
llm_result
:
str
)
->
list
[
str
]:
if
NO_REL_STRING
in
llm_result
:
return
[]
if
llm_result
==
""
:
return
[
"WRONG"
]
tmp_results
=
llm_result
.
split
(
INTER_REL_SEPARATOR
)
relations
=
[]
for
res
in
tmp_results
:
r_text1
=
""
r_text2
=
""
r_splitted
=
res
.
split
(
INTRA_REL_SEPARATOR
)
if
len
(
r_splitted
)
<
2
:
r_text1
=
r_splitted
[
0
].
strip
()
r_text2
=
""
else
:
r_text1
=
r_splitted
[
0
].
strip
()
r_text2
=
r_splitted
[
1
].
strip
()
relations
.
append
((
r_text1
,
r_text2
))
assert
len
(
relations
)
==
len
(
tmp_results
)
return
relations
INTER_REL_SEPARATOR
=
"%"
INTRA_REL_SEPARATOR
=
"$"
NO_REL_STRING
=
"&&NOREL&&"
def
re_doc_to_target
(
doc
):
ents
=
doc
[
"relations"
]
targ_str
=
""
# Entità$Tipo%Entità$Tipo.
if
ents
==
[]:
return
NO_ENT_STRING
else
:
for
e
in
ents
:
targ_str
+=
e
[
0
]
+
INTRA_REL_SEPARATOR
+
e
[
1
]
+
INTER_REL_SEPARATOR
return
targ_str
[:
-
1
]
def
_rel_gold_to_target
(
x
:
list
)
->
list
:
if
x
==
[]:
return
[
0
]
else
:
return
[
1
]
*
len
(
x
)
def
rel_doc_to_target
(
doc
):
rel
=
doc
[
"relations"
]
targ_str
=
""
# misura1$result1%misure2$result2.
if
rel
==
[]:
return
NO_REL_STRING
else
:
for
r
in
rel
:
targ_str
+=
r
[
0
]
+
"$"
+
r
[
1
]
+
"%"
return
targ_str
[:
-
1
]
def
_extract_relations
(
results
):
relations
=
[]
for
r
in
results
:
r_text1
=
""
r_text2
=
""
r_splitted
=
r
.
split
(
INTRA_REL_SEPARATOR
)
if
len
(
r_splitted
)
<
2
:
r_text1
=
r_splitted
[
0
]
r_text2
=
""
else
:
r_text1
=
r_splitted
[
0
]
r_text2
=
r_splitted
[
1
]
relations
.
append
((
r_text1
,
r_text2
))
assert
len
(
relations
)
==
len
(
results
)
return
relations
def
rel_process_results_v3
(
doc
,
results
):
"""
Process the results of the Relation extraction task not considering the order of the relation extracted
"""
# each document has a list of relation with the following format:
# [[text1, text2], [text3, text4]]
gold
=
doc
[
"relations"
]
raw_results
=
results
[
0
]
has_results
=
0
if
NO_REL_STRING
in
raw_results
else
1
has_gold
=
1
if
gold
!=
[]
else
0
res_labels
=
[]
gold_labels
=
[]
if
has_results
==
0
and
has_gold
:
# False negative
gold_labels
=
_rel_gold_to_target
(
gold
)
res_labels
=
[
0
]
*
len
(
gold_labels
)
elif
has_results
==
0
and
has_gold
==
0
:
# True negative
gold_labels
=
_rel_gold_to_target
(
gold
)
res_labels
=
gold_labels
elif
has_results
and
has_gold
==
0
:
# False positive
gold_labels
=
_rel_gold_to_target
(
gold
)
res_labels
=
[
1
]
*
len
(
gold_labels
)
else
:
results
=
_rel_process_raw_output
(
raw_results
)
# results = raw_results.split(INTER_REL_SEPARATOR)
gold_labels
=
_rel_gold_to_target
(
gold
)
res_labels
=
[
0
]
*
len
(
gold_labels
)
assert
len
(
gold
)
>
0
for
i
in
range
(
len
(
gold
)):
for
j
in
range
(
len
(
results
)):
r_text1
=
results
[
j
][
0
]
r_text2
=
results
[
j
][
1
]
if
r_text1
==
gold
[
i
][
0
]
and
r_text2
==
gold
[
i
][
1
]:
# list of lists
res_labels
[
i
]
=
1
results
[
j
]
=
(
"DELETED"
,
"DELETED"
)
elif
r_text1
==
"DELETED"
and
r_text2
==
"DELETED"
:
continue
else
:
pass
# if there are more predictions than gold, we set the remaining predictions to false positive
if
len
(
results
)
-
len
(
gold
)
>
0
:
for
i
in
range
(
len
(
results
)
-
len
(
gold
)):
if
results
[
i
]
==
(
"DELETED"
,
"DELETED"
):
continue
res_labels
.
append
(
1
)
gold_labels
.
append
(
0
)
assert
len
(
gold_labels
)
==
len
(
res_labels
)
return
{
"f1"
:
(
res_labels
,
gold_labels
)}
LS_SPLIT_REGEX
=
r
"[^,]+"
def
split_text_with_regex
(
text
,
pattern
):
"""
pattern: str - a regex pattern to match the text
text: str - the text to split
"""
import
re
# Get text with model-generated words for comparison with the gold standard
text
=
text
.
split
(
"
\n
"
)[
0
]
# Find all matches for the pattern
matches
=
re
.
findall
(
pattern
,
text
)
# Split each matched segment further if it contains a comma and is quoted
result
=
[]
for
match
in
matches
:
if
match
.
startswith
(
'"'
)
and
match
.
endswith
(
'"'
):
# Remove the quotes and split inside the quoted string
inner_matches
=
re
.
findall
(
r
"[^,]+"
,
match
[
1
:
-
1
])
result
.
extend
(
inner_matches
)
else
:
result
.
append
(
match
)
# Strip leading and trailing whitespaces from each element
result
=
[
element
.
strip
().
replace
(
'"'
,
""
)
for
element
in
result
]
return
result
# ---------------------- SUMMARIZATION ----------------------
def
rouge1_score
(
references
,
predictions
,
**
kwargs
):
"""
suboptimal way of compute rouge because of the following issue:
https://github.com/EleutherAI/lm-evaluation-harness/issues/1302
"""
rouge
=
load
(
"rouge"
)
return
rouge
.
compute
(
predictions
=
predictions
,
references
=
references
,
**
kwargs
)[
"rouge1"
]
def
process_results_sum
(
doc
,
results
):
"""
Process the results of the Evalita summarization task
"""
ref
=
doc
[
"summary"
]
if
"summary"
in
doc
.
keys
()
else
doc
[
"target"
]
rouge_scorer
=
load
(
"rouge"
,
keep_in_memory
=
True
)
r1score
=
rouge_scorer
.
compute
(
predictions
=
results
,
references
=
[
ref
])[
"rouge1"
]
return
{
"rouge1"
:
r1score
,
}
def
faq_doc_to_target
(
x
):
if
x
[
"correct_answer"
]
==
"A"
:
return
0
elif
x
[
"correct_answer"
]
==
"B"
:
return
1
elif
x
[
"correct_answer"
]
==
"C"
:
return
2
elif
x
[
"correct_answer"
]
==
"D"
:
return
3
else
:
eval_logger
.
warning
(
'WARNING: correct answer not found or not in ["A", "B", "C", "D"]'
)
def
ht_doc_to_target
(
x
):
if
x
[
"source"
]
==
"ilgiornale"
:
return
0
elif
x
[
"source"
]
==
"repubblica"
:
return
1
else
:
eval_logger
.
warning
(
'WARNING: source not found or not in ["ilgiornale", "repubblica"]'
)
lm_eval/tasks/fda/task.py
View file @
efb46937
...
...
@@ -33,7 +33,9 @@ class FDA(ConfigurableTask):
def
doc_to_target
(
self
,
doc
):
return
doc
[
"value"
]
def
construct_requests
(
self
,
doc
,
ctx
,
**
kwargs
):
def
construct_requests
(
self
,
doc
,
ctx
,
chat_template
=
None
,
apply_chat_template
=
False
,
**
kwargs
):
"""Uses RequestFactory to construct Requests and returns an iterable of
Requests which will be sent to the LM.
...
...
lm_eval/tasks/galician_bench/README.md
View file @
efb46937
...
...
@@ -26,7 +26,40 @@ The datasets included in GalicianBench that have been made public in previous pu
### Citation
Paper for GalicianBench coming soon.
```
@inproceedings{baucells-etal-2025-iberobench,
title = "{I}bero{B}ench: A Benchmark for {LLM} Evaluation in {I}berian Languages",
author = "Baucells, Irene and
Aula-Blasco, Javier and
de-Dios-Flores, Iria and
Paniagua Su{\'a}rez, Silvia and
Perez, Naiara and
Salles, Anna and
Sotelo Docio, Susana and
Falc{\~a}o, J{\'u}lia and
Saiz, Jose Javier and
Sepulveda Torres, Robiert and
Barnes, Jeremy and
Gamallo, Pablo and
Gonzalez-Agirre, Aitor and
Rigau, German and
Villegas, Marta",
editor = "Rambow, Owen and
Wanner, Leo and
Apidianaki, Marianna and
Al-Khalifa, Hend and
Eugenio, Barbara Di and
Schockaert, Steven",
booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
month = jan,
year = "2025",
address = "Abu Dhabi, UAE",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.coling-main.699/",
pages = "10491--10519",
}
```
### Groups and Tasks
...
...
Prev
1
…
3
4
5
6
7
8
9
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment