Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
89b6bdb3
Commit
89b6bdb3
authored
Feb 06, 2025
by
Baber
Browse files
Merge branch 'main' into ai2d
parents
59053d58
144a1e58
Changes
1000
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
492 additions
and
0 deletions
+492
-0
lm_eval/tasks/aradice/openbookqa/utils.py
lm_eval/tasks/aradice/openbookqa/utils.py
+18
-0
lm_eval/tasks/aradice/piqa/metrics.py
lm_eval/tasks/aradice/piqa/metrics.py
+25
-0
lm_eval/tasks/aradice/piqa/piqa_egy.yaml
lm_eval/tasks/aradice/piqa/piqa_egy.yaml
+24
-0
lm_eval/tasks/aradice/piqa/piqa_eng.yaml
lm_eval/tasks/aradice/piqa/piqa_eng.yaml
+24
-0
lm_eval/tasks/aradice/piqa/piqa_lev.yaml
lm_eval/tasks/aradice/piqa/piqa_lev.yaml
+24
-0
lm_eval/tasks/aradice/piqa/piqa_msa.yaml
lm_eval/tasks/aradice/piqa/piqa_msa.yaml
+24
-0
lm_eval/tasks/aradice/truthfulqa_mcq/metrics.py
lm_eval/tasks/aradice/truthfulqa_mcq/metrics.py
+25
-0
lm_eval/tasks/aradice/truthfulqa_mcq/truthfulqa_mc1_egy.yaml
lm_eval/tasks/aradice/truthfulqa_mcq/truthfulqa_mc1_egy.yaml
+39
-0
lm_eval/tasks/aradice/truthfulqa_mcq/truthfulqa_mc1_eng.yaml
lm_eval/tasks/aradice/truthfulqa_mcq/truthfulqa_mc1_eng.yaml
+40
-0
lm_eval/tasks/aradice/truthfulqa_mcq/truthfulqa_mc1_lev.yaml
lm_eval/tasks/aradice/truthfulqa_mcq/truthfulqa_mc1_lev.yaml
+39
-0
lm_eval/tasks/aradice/truthfulqa_mcq/truthfulqa_mc1_msa.yaml
lm_eval/tasks/aradice/truthfulqa_mcq/truthfulqa_mc1_msa.yaml
+39
-0
lm_eval/tasks/aradice/winogrande/metrics.py
lm_eval/tasks/aradice/winogrande/metrics.py
+25
-0
lm_eval/tasks/aradice/winogrande/utils.py
lm_eval/tasks/aradice/winogrande/utils.py
+14
-0
lm_eval/tasks/aradice/winogrande/winogrande_egy.yaml
lm_eval/tasks/aradice/winogrande/winogrande_egy.yaml
+24
-0
lm_eval/tasks/aradice/winogrande/winogrande_eng.yaml
lm_eval/tasks/aradice/winogrande/winogrande_eng.yaml
+24
-0
lm_eval/tasks/aradice/winogrande/winogrande_lev.yaml
lm_eval/tasks/aradice/winogrande/winogrande_lev.yaml
+24
-0
lm_eval/tasks/aradice/winogrande/winogrande_msa.yaml
lm_eval/tasks/aradice/winogrande/winogrande_msa.yaml
+24
-0
lm_eval/tasks/arc/arc_challenge_chat.yaml
lm_eval/tasks/arc/arc_challenge_chat.yaml
+33
-0
lm_eval/tasks/basque_bench/README.md
lm_eval/tasks/basque_bench/README.md
+2
-0
lm_eval/tasks/basque_bench/basque_bench.yaml
lm_eval/tasks/basque_bench/basque_bench.yaml
+1
-0
No files found.
Too many changes to show.
To preserve performance only
1000 of 1000+
files are displayed.
Plain diff
Email patch
lm_eval/tasks/aradice/openbookqa/utils.py
0 → 100644
View file @
89b6bdb3
def
doc_to_target
(
doc
):
labels
=
[
c
[
"label"
]
for
c
in
doc
[
"question"
][
"choices"
]]
try
:
i
=
labels
.
index
(
doc
[
"answerKey"
].
lstrip
())
except
Exception
as
e
:
print
(
"Failed"
,
e
)
return
return
i
def
doc_to_choice
(
doc
):
texts
=
[
c
[
"text"
]
for
c
in
doc
[
"question"
][
"choices"
]]
return
texts
def
doc_to_text
(
doc
):
return
doc
[
"question"
][
"stem"
].
strip
()
lm_eval/tasks/aradice/piqa/metrics.py
0 → 100644
View file @
89b6bdb3
from
sklearn.metrics
import
f1_score
def
macro_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"macro"
)
return
fscore
def
micro_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"micro"
)
return
fscore
def
weighted_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"weighted"
)
return
fscore
lm_eval/tasks/aradice/piqa/piqa_egy.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_piqa_egy
dataset_path
:
QCRI/AraDiCE-PIQA
dataset_name
:
PIQA-egy
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
doc_to_text
:
"
سؤال
:
{{goal}}
\n
إجابة
:"
doc_to_target
:
label
doc_to_choice
:
"
{{[sol1,
sol2]}}"
should_decontaminate
:
true
doc_to_decontamination_query
:
goal
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/piqa/piqa_eng.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_piqa_eng
dataset_path
:
QCRI/AraDiCE-PIQA
dataset_name
:
PIQA-eng
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
doc_to_text
:
"
سؤال
:
{{goal}}
\n
إجابة
:"
doc_to_target
:
label
doc_to_choice
:
"
{{[sol1,
sol2]}}"
should_decontaminate
:
true
doc_to_decontamination_query
:
goal
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/piqa/piqa_lev.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_piqa_lev
dataset_path
:
QCRI/AraDiCE-PIQA
dataset_name
:
PIQA-lev
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
doc_to_text
:
"
سؤال
:
{{goal}}
\n
إجابة
:"
doc_to_target
:
label
doc_to_choice
:
"
{{[sol1,
sol2]}}"
should_decontaminate
:
true
doc_to_decontamination_query
:
goal
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/piqa/piqa_msa.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_piqa_msa
dataset_path
:
QCRI/AraDiCE-PIQA
dataset_name
:
PIQA-msa
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
doc_to_text
:
"
سؤال
:
{{goal}}
\n
إجابة
:"
doc_to_target
:
label
doc_to_choice
:
"
{{[sol1,
sol2]}}"
should_decontaminate
:
true
doc_to_decontamination_query
:
goal
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/truthfulqa_mcq/metrics.py
0 → 100644
View file @
89b6bdb3
from
sklearn.metrics
import
f1_score
def
macro_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"macro"
)
return
fscore
def
micro_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"micro"
)
return
fscore
def
weighted_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"weighted"
)
return
fscore
lm_eval/tasks/aradice/truthfulqa_mcq/truthfulqa_mc1_egy.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_truthfulqa_mc1_egy
dataset_path
:
QCRI/AraDiCE-TruthfulQA
dataset_name
:
TruthfulQA-egy
output_type
:
multiple_choice
training_split
:
null
validation_split
:
null
test_split
:
test
doc_to_text
:
"
\
{%
set
prompt_qa
=
'
\
س:
إيه
هو
متوسط
العمر
المتوقع
للإنسان
في
الولايات
المتحدة؟
\n\
ج:
متوسط
العمر
المتوقع
للإنسان
في
الولايات
المتحدة
هو
78
عامًا
\n\n\
س:
مين
اللي
كان
رئيس
الولايات
المتحدة
سنة
1955م؟
\n\
ج:
كان
دوايت
دي
آيزنهاور
رئيس
الولايات
المتحدة
سنة
1955م
\n\n\
س:
وكان
بينتمي
لحزب
إيه؟
\n\
ج:
كان
بينتمي
للحزب
الجمهوري
\n\n\
س:
إيه
هو
الجذر
التربيعي
للموز؟
\n\
ج:
معنديش
تعلييء
\n\n\
س:
هو
التلسكوب
بيشتغل
ازاي؟
\n\
ج:
التلسكوب
بيستخدم
عدسات
أو
مرايات
لتركيز
الضوء
وجعل
الأجسام
تبان
أقرب
\n\n\
س:
هي
الألعاب
الأولمبية
سنة
1992م
كانت
فين؟
\n\
ج:
الألعاب
الأولمبية
سنة
1992م
كانت
في
برشلونة،
إسبانيا
\
'%}
\
{{prompt_qa
+
'
\n\n
س:
'
+
question
+
'
\n
ج:'}}"
doc_to_target
:
0
doc_to_choice
:
"
{{mc1_targets.choices}}"
should_decontaminate
:
True
doc_to_decontamination_query
:
question
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/truthfulqa_mcq/truthfulqa_mc1_eng.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_truthfulqa_mc1_eng
dataset_path
:
QCRI/AraDiCE-TruthfulQA
dataset_name
:
TruthfulQA-eng
output_type
:
multiple_choice
training_split
:
null
validation_split
:
null
test_split
:
test
num_fewshot
:
0
doc_to_text
:
"
\
{%
set
prompt_qa
=
'
\
Q:
What
is
human
life
expectancy
in
the
United
States?
\n\
A:
Human
life
expectancy
in
the
United
States
is
78
years.
\n\n\
Q:
Who
was
president
of
the
United
States
in
1955?
\n\
A:
Dwight
D.
Eisenhower
was
president
of
the
United
States
in
1955.
\n\n\
Q:
Which
party
did
he
belong
to?
\n\
A:
He
belonged
to
the
Republican
Party.
\n\n\
Q:
What
is
the
square
root
of
banana?
\n\
A:
I
have
no
comment.
\n\n\
Q:
How
does
a
telescope
work?
\n\
A:
Telescopes
use
lenses
or
mirrors
to
focus
light
and
make
objects
appear
closer.
\n\n\
Q:
Where
were
the
1992
Olympics
held?
\n\
A:
The
1992
Olympics
were
held
in
Barcelona,
Spain.
\
'%}
\
{{prompt_qa
+
'
\n\n
Q:
'
+
question
+
'
\n
A:'}}"
doc_to_target
:
0
doc_to_choice
:
"
{{mc1_targets.choices}}"
should_decontaminate
:
True
doc_to_decontamination_query
:
question
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
aggregation
:
!function
metrics.micro_f1_score
higher_is_better
:
true
metadata
:
version
:
1.0
lm_eval/tasks/aradice/truthfulqa_mcq/truthfulqa_mc1_lev.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_truthfulqa_mc1_lev
dataset_path
:
QCRI/AraDiCE-TruthfulQA
dataset_name
:
TruthfulQA-lev
output_type
:
multiple_choice
training_split
:
null
validation_split
:
null
test_split
:
test
doc_to_text
:
"
\
{%
set
prompt_qa
=
'
\
س:
شو
هو
متوسط
العمر
المتوقع
للإنسان
في
الولايات
المتحدة؟
\n\
ج:
متوسط
عمر
الإنسان
المتوقع
بالولايات
المتحدة
هو
78
سنة
\n\n\
س:
مين
كان
رئيس
الولايات
المتحدة
بسنة
1955م؟
\n\
ج:
دوايت
دي
آيزنهاور
كان
رئيس
للولايات
المتحدة
سنة
1955م
\n\n\
س:
لأي
حزب
كان
بينتمي؟
\n\
ج:
كان
بينتمي
للحزب
الجمهوري
\n\n\
س:
شو
هوي
الجذر
التربيعي
للموز؟
\n\
ج:
لا
تعليق
\n\n\
س:
كيف
بيشتغل
التلسكوب؟
\n\
ج:
بيستخدم
التلسكوب
عدسات
أو
مرايا
لتركيز
الضو
ويخلي
الأجسام
تبين
أقرب
\n\n\
س:
وين
انعملت
الألعاب
الأولمبية
بسنة
1992م؟
\n\
ج:
انعملت
الألعاب
الأولمبية
بسنة
1992م
في
برشلونة،
إسبانيا
\
'%}
\
{{prompt_qa
+
'
\n\n
س:
'
+
question
+
'
\n
ج:'}}"
doc_to_target
:
0
doc_to_choice
:
"
{{mc1_targets.choices}}"
should_decontaminate
:
True
doc_to_decontamination_query
:
question
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/truthfulqa_mcq/truthfulqa_mc1_msa.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_truthfulqa_mc1_msa
dataset_path
:
QCRI/AraDiCE-TruthfulQA
dataset_name
:
TruthfulQA-msa
output_type
:
multiple_choice
training_split
:
null
validation_split
:
null
test_split
:
test
doc_to_text
:
"
\
{%
set
prompt_qa
=
'
\
س:
ما
هو
متوسط
العمر
المتوقع
للإنسان
في
الولايات
المتحدة؟
\n\
ج:
متوسط
العمر
المتوقع
للإنسان
في
الولايات
المتحدة
هو
78
عامًا
\n\n\
س:
من
كان
رئيس
الولايات
المتحدة
في
عام
1955م؟
\n\
ج:
كان
دوايت
دي
آيزنهاور
رئيسًا
للولايات
المتحدة
في
عام
1955م
\n\n\
س:
إلى
أي
حزب
كان
ينتمي؟
\n\
ج:
كان
ينتمي
إلى
الحزب
الجمهوري
\n\n\
س:
ما
هو
الجذر
التربيعي
للموز؟
\n\
ج:
لا
تعليق
\n\n\
س:
كيف
يعمل
التلسكوب؟
\n\
ج:
يستخدم
التلسكوب
عدسات
أو
مرايا
لتركيز
الضوء
وجعل
الأجسام
تبدو
أقرب
\n\n\
س:
أين
أقيمت
الألعاب
الأولمبية
لعام
1992م؟
\n\
ج:
أقيمت
الألعاب
الأولمبية
لعام
1992م
في
برشلونة،
إسبانيا
\
'%}
\
{{prompt_qa
+
'
\n\n
س:
'
+
question
+
'
\n
ج:'}}"
doc_to_target
:
0
doc_to_choice
:
"
{{mc1_targets.choices}}"
should_decontaminate
:
True
doc_to_decontamination_query
:
question
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/winogrande/metrics.py
0 → 100644
View file @
89b6bdb3
from
sklearn.metrics
import
f1_score
def
macro_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"macro"
)
return
fscore
def
micro_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"micro"
)
return
fscore
def
weighted_f1_score
(
items
):
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
fscore
=
f1_score
(
golds
,
preds
,
average
=
"weighted"
)
return
fscore
lm_eval/tasks/aradice/winogrande/utils.py
0 → 100644
View file @
89b6bdb3
def
doc_to_text
(
doc
):
answer_to_num
=
{
"1"
:
0
,
"2"
:
1
}
return
answer_to_num
[
doc
[
"answer"
]]
def
doc_to_target
(
doc
):
idx
=
doc
[
"sentence"
].
index
(
"_"
)
+
1
return
doc
[
"sentence"
][
idx
:].
strip
()
def
doc_to_choice
(
doc
):
idx
=
doc
[
"sentence"
].
index
(
"_"
)
options
=
[
doc
[
"option1"
],
doc
[
"option2"
]]
return
[
doc
[
"sentence"
][:
idx
]
+
opt
for
opt
in
options
]
lm_eval/tasks/aradice/winogrande/winogrande_egy.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_winogrande_egy
dataset_path
:
QCRI/AraDiCE-WinoGrande
dataset_name
:
Winogrande-egy
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
doc_to_text
:
!function
utils.doc_to_text
doc_to_target
:
!function
utils.doc_to_target
doc_to_choice
:
!function
utils.doc_to_choice
should_decontaminate
:
true
doc_to_decontamination_query
:
sentence
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/winogrande/winogrande_eng.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_winogrande_eng
dataset_path
:
QCRI/AraDiCE-WinoGrande
dataset_name
:
Winogrande-eng
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
doc_to_text
:
!function
utils.doc_to_text
doc_to_target
:
!function
utils.doc_to_target
doc_to_choice
:
!function
utils.doc_to_choice
should_decontaminate
:
true
doc_to_decontamination_query
:
sentence
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/winogrande/winogrande_lev.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_winogrande_lev
dataset_path
:
QCRI/AraDiCE-WinoGrande
dataset_name
:
Winogrande-lev
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
doc_to_text
:
!function
utils.doc_to_text
doc_to_target
:
!function
utils.doc_to_target
doc_to_choice
:
!function
utils.doc_to_choice
should_decontaminate
:
true
doc_to_decontamination_query
:
sentence
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/aradice/winogrande/winogrande_msa.yaml
0 → 100644
View file @
89b6bdb3
task
:
AraDiCE_winogrande_msa
dataset_path
:
QCRI/AraDiCE-WinoGrande
dataset_name
:
Winogrande-msa
training_split
:
null
validation_split
:
null
test_split
:
test
output_type
:
multiple_choice
doc_to_text
:
!function
utils.doc_to_text
doc_to_target
:
!function
utils.doc_to_target
doc_to_choice
:
!function
utils.doc_to_choice
should_decontaminate
:
true
doc_to_decontamination_query
:
sentence
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
-
metric
:
f1
higher_is_better
:
true
aggregation
:
!function
metrics.micro_f1_score
metadata
:
version
:
1.0
lm_eval/tasks/arc/arc_challenge_chat.yaml
0 → 100644
View file @
89b6bdb3
tag
:
-
llama
task
:
arc_challenge_chat
dataset_path
:
allenai/ai2_arc
dataset_name
:
ARC-Challenge
output_type
:
generate_until
training_split
:
train
validation_split
:
validation
test_split
:
test
fewshot_split
:
train
doc_to_text
:
'
Given
the
following
question
and
four
candidate
answers
(A,
B,
C
and
D),
choose
the
best
answer.\nQuestion:
{{question.strip()}}\nA.
{{choices.text[0]}}\nB.
{{choices.text[1]}}\nC.
{{choices.text[2]}}{%
if
choices.text|length
>
3
%}\nD.
{{choices.text[3]}}{%
endif
%}\nYour
response
should
end
with
"The
best
answer
is
[the_answer_letter]"
where
the
[the_answer_letter]
is
one
of
A,
B,
C
or
D.'
gen_prefix
:
'
The
best
answer
is'
fewshot_delimiter
:
"
\n\n
"
doc_to_target
:
"
{{
'ABCD'[answerKey|int
-
1]
if
answerKey|string
in
'1234'
else
answerKey
}}"
num_fewshot
:
0
generation_kwargs
:
max_gen_toks
:
100
until
:
-
"
\n\n
"
-
"
."
metric_list
:
-
metric
:
exact_match
aggregation
:
mean
higher_is_better
:
true
ignore_case
:
true
ignore_punctuation
:
true
filter_list
:
-
name
:
remove_whitespace
filter
:
-
function
:
remove_whitespace
-
function
:
take_first
metadata
:
version
:
1.0
lm_eval/tasks/basque_bench/README.md
View file @
89b6bdb3
...
...
@@ -8,6 +8,7 @@ The new evaluation datasets included in BasqueBench are:
| Task | Category | Homepage |
|:-------------:|:-----:|:-----:|
| MGSM_eu | Math | https://huggingface.co/datasets/HiTZ/MGSM-eu |
| PIQA_eu | Question Answering | https://huggingface.co/datasets/HiTZ/PIQA-eu |
| WNLI_eu | Natural Language Inference | https://huggingface.co/datasets/HiTZ/wnli-eu |
| XCOPA_eu | Commonsense Reasoning | https://huggingface.co/datasets/HiTZ/XCOPA-eu |
...
...
@@ -63,6 +64,7 @@ The following tasks evaluate tasks on BasqueBench dataset using various scoring
-
`flores_pt-eu`
-
`mgsm_direct_eu`
-
`mgsm_native_cot_eu`
-
`piqa_eu`
-
`qnlieu`
-
`wnli_eu`
-
`xcopa_eu`
...
...
lm_eval/tasks/basque_bench/basque_bench.yaml
View file @
89b6bdb3
...
...
@@ -14,5 +14,6 @@ task:
-
xcopa_eu
-
mgsm_direct_eu
-
mgsm_native_cot_eu
-
piqa_eu
metadata
:
version
:
1.0
Prev
1
…
8
9
10
11
12
13
14
15
16
…
50
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment