Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
89b6bdb3
Commit
89b6bdb3
authored
Feb 06, 2025
by
Baber
Browse files
Merge branch 'main' into ai2d
parents
59053d58
144a1e58
Changes
1000
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
456 additions
and
0 deletions
+456
-0
lm_eval/tasks/aradice/boolq/ENG/utils.py
lm_eval/tasks/aradice/boolq/ENG/utils.py
+18
-0
lm_eval/tasks/aradice/boolq/LEV/boolq_lev.yaml
lm_eval/tasks/aradice/boolq/LEV/boolq_lev.yaml
+25
-0
lm_eval/tasks/aradice/boolq/LEV/metrics.py
lm_eval/tasks/aradice/boolq/LEV/metrics.py
+25
-0
lm_eval/tasks/aradice/boolq/LEV/utils.py
lm_eval/tasks/aradice/boolq/LEV/utils.py
+18
-0
lm_eval/tasks/aradice/boolq/MSA/boolq_msa.yaml
lm_eval/tasks/aradice/boolq/MSA/boolq_msa.yaml
+25
-0
lm_eval/tasks/aradice/boolq/MSA/metrics.py
lm_eval/tasks/aradice/boolq/MSA/metrics.py
+25
-0
lm_eval/tasks/aradice/boolq/MSA/utils.py
lm_eval/tasks/aradice/boolq/MSA/utils.py
+18
-0
lm_eval/tasks/aradice/cultural-benchmark/egypt.yaml
lm_eval/tasks/aradice/cultural-benchmark/egypt.yaml
+25
-0
lm_eval/tasks/aradice/cultural-benchmark/jordan.yaml
lm_eval/tasks/aradice/cultural-benchmark/jordan.yaml
+25
-0
lm_eval/tasks/aradice/cultural-benchmark/lebanon.yaml
lm_eval/tasks/aradice/cultural-benchmark/lebanon.yaml
+25
-0
lm_eval/tasks/aradice/cultural-benchmark/metrics.py
lm_eval/tasks/aradice/cultural-benchmark/metrics.py
+25
-0
lm_eval/tasks/aradice/cultural-benchmark/palestine.yaml
lm_eval/tasks/aradice/cultural-benchmark/palestine.yaml
+25
-0
lm_eval/tasks/aradice/cultural-benchmark/qatar.yaml
lm_eval/tasks/aradice/cultural-benchmark/qatar.yaml
+25
-0
lm_eval/tasks/aradice/cultural-benchmark/syria.yaml
lm_eval/tasks/aradice/cultural-benchmark/syria.yaml
+25
-0
lm_eval/tasks/aradice/cultural-benchmark/utils.py
lm_eval/tasks/aradice/cultural-benchmark/utils.py
+6
-0
lm_eval/tasks/aradice/openbookqa/metrics.py
lm_eval/tasks/aradice/openbookqa/metrics.py
+25
-0
lm_eval/tasks/aradice/openbookqa/openbookqa_egy.yaml
lm_eval/tasks/aradice/openbookqa/openbookqa_egy.yaml
+24
-0
lm_eval/tasks/aradice/openbookqa/openbookqa_eng.yaml
lm_eval/tasks/aradice/openbookqa/openbookqa_eng.yaml
+24
-0
lm_eval/tasks/aradice/openbookqa/openbookqa_lev.yaml
lm_eval/tasks/aradice/openbookqa/openbookqa_lev.yaml
+24
-0
lm_eval/tasks/aradice/openbookqa/openbookqa_msa.yaml
lm_eval/tasks/aradice/openbookqa/openbookqa_msa.yaml
+24
-0
No files found.
Too many changes to show.
To preserve performance only
1000 of 1000+
files are displayed.
Plain diff
Email patch
lm_eval/tasks/aradice/boolq/ENG/utils.py
0 → 100644
View file @
89b6bdb3
en_answer_mapping
=
{
"true"
:
"yes"
,
"false"
:
"no"
,
True
:
"yes"
,
False
:
"no"
}
def
process_docs
(
dataset
):
def
remove_question_mark
(
text
):
text
=
text
.
strip
()
if
text
.
endswith
(
"?"
)
or
text
.
endswith
(
"؟"
):
text
=
text
[:
-
1
]
text
=
text
.
strip
()
return
text
def
_helper
(
doc
):
doc
[
"question"
]
=
remove_question_mark
(
doc
[
"question"
])
doc
[
"target"
]
=
en_answer_mapping
[
doc
[
"answer"
]]
return
doc
return
dataset
.
map
(
_helper
)
lm_eval/tasks/aradice/boolq/LEV/boolq_lev.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_boolq_lev
dataset_path
:
QCRI/AraDiCE-BoolQ
dataset_name
:
BoolQ-lev
output_type
:
multiple_choice
training_split
:
null
validation_split
:
null
test_split
:
test
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
{{passage}}
\n
سؤال:
{{question}}؟
\n
جواب:"
doc_to_target
:
target
doc_to_choice
:
[
"
لا"
,
"
نعم"
]
should_decontaminate
:
true
doc_to_decontamination_query
:
passage
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/boolq/LEV/metrics.py
0 → 100644
View file @
89b6bdb3
from
sklearn.metrics
import
f1_score
def
macro_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"macro"
)
return
fscore
def
micro_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"micro"
)
return
fscore
def
weighted_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"weighted"
)
return
fscore
lm_eval/tasks/aradice/boolq/LEV/utils.py
0 → 100644
View file @
89b6bdb3
lev_answer_mapping
=
{
"true"
:
"نعم"
,
"false"
:
"لا"
,
True
:
"نعم"
,
False
:
"لا"
}
def
process_docs
(
dataset
):
def
remove_question_mark
(
text
):
text
=
text
.
strip
()
if
text
.
endswith
(
"?"
)
or
text
.
endswith
(
"؟"
):
text
=
text
[:
-
1
]
text
=
text
.
strip
()
return
text
def
_helper
(
doc
):
doc
[
"question"
]
=
remove_question_mark
(
doc
[
"question"
])
doc
[
"target"
]
=
lev_answer_mapping
[
doc
[
"answer"
]]
return
doc
return
dataset
.
map
(
_helper
)
lm_eval/tasks/aradice/boolq/MSA/boolq_msa.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_boolq_msa
dataset_path
:
QCRI/AraDiCE-BoolQ
dataset_name
:
BoolQ-msa
output_type
:
multiple_choice
training_split
:
null
validation_split
:
null
test_split
:
test
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
{{passage}}
\n
سؤال:
{{question}}؟
\n
جواب:"
doc_to_target
:
target
doc_to_choice
:
[
"
لا"
,
"
نعم"
]
should_decontaminate
:
true
doc_to_decontamination_query
:
passage
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/boolq/MSA/metrics.py
0 → 100644
View file @
89b6bdb3
from
sklearn.metrics
import
f1_score
def
macro_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"macro"
)
return
fscore
def
micro_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"micro"
)
return
fscore
def
weighted_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"weighted"
)
return
fscore
lm_eval/tasks/aradice/boolq/MSA/utils.py
0 → 100644
View file @
89b6bdb3
msa_answer_mapping
=
{
"true"
:
"نعم"
,
"false"
:
"لا"
,
True
:
"نعم"
,
False
:
"لا"
}
def
process_docs
(
dataset
):
def
remove_question_mark
(
text
):
text
=
text
.
strip
()
if
text
.
endswith
(
"?"
)
or
text
.
endswith
(
"؟"
):
text
=
text
[:
-
1
]
text
=
text
.
strip
()
return
text
def
_helper
(
doc
):
doc
[
"question"
]
=
remove_question_mark
(
doc
[
"question"
])
doc
[
"target"
]
=
msa_answer_mapping
[
doc
[
"answer"
]]
return
doc
return
dataset
.
map
(
_helper
)
lm_eval/tasks/aradice/cultural-benchmark/egypt.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_egypt_cultural
dataset_path
:
QCRI/AraDiCE-Culture
dataset_name
:
Egypt
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
سؤال
:
{{Question}}
\n
إجابة
:"
doc_to_target
:
0
doc_to_choice
:
choices
should_decontaminate
:
true
doc_to_decontamination_query
:
Question
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/cultural-benchmark/jordan.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_jordan_cultural
dataset_path
:
QCRI/AraDiCE-Culture
dataset_name
:
Jordan
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
سؤال
:
{{Question}}
\n
إجابة
:"
doc_to_target
:
0
doc_to_choice
:
choices
should_decontaminate
:
true
doc_to_decontamination_query
:
Question
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/cultural-benchmark/lebanon.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_lebanon_cultural
dataset_path
:
QCRI/AraDiCE-Culture
dataset_name
:
Lebanon
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
سؤال
:
{{Question}}
\n
إجابة
:"
doc_to_target
:
0
doc_to_choice
:
choices
should_decontaminate
:
true
doc_to_decontamination_query
:
Question
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/cultural-benchmark/metrics.py
0 → 100644
View file @
89b6bdb3
from
sklearn.metrics
import
f1_score
def
macro_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"macro"
)
return
fscore
def
micro_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"micro"
)
return
fscore
def
weighted_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"weighted"
)
return
fscore
lm_eval/tasks/aradice/cultural-benchmark/palestine.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_palestine_cultural
dataset_path
:
QCRI/AraDiCE-Culture
dataset_name
:
Palestine
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
سؤال
:
{{Question}}
\n
إجابة
:"
doc_to_target
:
0
doc_to_choice
:
choices
should_decontaminate
:
true
doc_to_decontamination_query
:
Question
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/cultural-benchmark/qatar.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_qatar_cultural
dataset_path
:
QCRI/AraDiCE-Culture
dataset_name
:
Qatar
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
سؤال
:
{{Question}}
\n
إجابة
:"
doc_to_target
:
0
doc_to_choice
:
choices
should_decontaminate
:
true
doc_to_decontamination_query
:
Question
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/cultural-benchmark/syria.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_syria_cultural
dataset_path
:
QCRI/AraDiCE-Culture
dataset_name
:
Syria
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
سؤال
:
{{Question}}
\n
إجابة
:"
doc_to_target
:
0
doc_to_choice
:
choices
should_decontaminate
:
true
doc_to_decontamination_query
:
Question
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/cultural-benchmark/utils.py
0 → 100644
View file @
89b6bdb3
def
process_docs
(
dataset
):
def
_helper
(
doc
):
doc
[
"choices"
]
=
[
doc
[
"Option A"
],
doc
[
"Option B"
],
doc
[
"Option C"
]]
return
doc
return
dataset
.
map
(
_helper
)
lm_eval/tasks/aradice/openbookqa/metrics.py
0 → 100644
View file @
89b6bdb3
from
sklearn.metrics
import
f1_score
def
macro_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"macro"
)
return
fscore
def
micro_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"micro"
)
return
fscore
def
weighted_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"weighted"
)
return
fscore
lm_eval/tasks/aradice/openbookqa/openbookqa_egy.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_openbookqa_egy
dataset_path
:
QCRI/AraDiCE-OpenBookQA
dataset_name
:
OBQA-egy
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
doc_to_text
:
!function
utils.doc_to_text
doc_to_target
:
!function
utils.doc_to_target
doc_to_choice
:
!function
utils.doc_to_choice
should_decontaminate
:
true
doc_to_decontamination_query
:
"
{{question.stem}}"
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/openbookqa/openbookqa_eng.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_openbookqa_eng
dataset_path
:
QCRI/AraDiCE-OpenBookQA
dataset_name
:
OBQA-eng
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
doc_to_text
:
!function
utils.doc_to_text
doc_to_target
:
!function
utils.doc_to_target
doc_to_choice
:
!function
utils.doc_to_choice
should_decontaminate
:
true
doc_to_decontamination_query
:
"
{{question.stem}}"
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/openbookqa/openbookqa_lev.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_openbookqa_lev
dataset_path
:
QCRI/AraDiCE-OpenBookQA
dataset_name
:
OBQA-lev
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
doc_to_text
:
!function
utils.doc_to_text
doc_to_target
:
!function
utils.doc_to_target
doc_to_choice
:
!function
utils.doc_to_choice
should_decontaminate
:
true
doc_to_decontamination_query
:
"
{{question.stem}}"
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/openbookqa/openbookqa_msa.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_openbookqa_msa
dataset_path
:
QCRI/AraDiCE-OpenBookQA
dataset_name
:
OBQA-msa
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
doc_to_text
:
!function
utils.doc_to_text
doc_to_target
:
!function
utils.doc_to_target
doc_to_choice
:
!function
utils.doc_to_choice
should_decontaminate
:
true
doc_to_decontamination_query
:
"
{{question.stem}}"
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
Prev
1
…
7
8
9
10
11
12
13
14
15
…
50
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment