Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
4eecbabb
Commit
4eecbabb
authored
Sep 16, 2024
by
Baber
Browse files
Merge branch 'main' into prefill
parents
dac8b534
fb963f0f
Changes
465
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
448 additions
and
0 deletions
+448
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_piqa_light/utils.py
...rd_light/arabic_leaderboard_arabic_mt_piqa_light/utils.py
+23
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_race_light/arabic_leaderboard_arabic_mt_race_light.yaml
...t_race_light/arabic_leaderboard_arabic_mt_race_light.yaml
+13
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_race_light/arabic_mt_race_light.yaml
...eaderboard_arabic_mt_race_light/arabic_mt_race_light.yaml
+23
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_race_light/utils.py
...rd_light/arabic_leaderboard_arabic_mt_race_light/utils.py
+23
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_sciq_light/arabic_leaderboard_arabic_mt_sciq_light.yaml
...t_sciq_light/arabic_leaderboard_arabic_mt_sciq_light.yaml
+13
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_sciq_light/arabic_mt_sciq_light.yaml
...eaderboard_arabic_mt_sciq_light/arabic_mt_sciq_light.yaml
+23
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_sciq_light/utils.py
...rd_light/arabic_leaderboard_arabic_mt_sciq_light/utils.py
+41
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_toxigen_light/arabic_leaderboard_arabic_mt_toxigen_light.yaml
...gen_light/arabic_leaderboard_arabic_mt_toxigen_light.yaml
+13
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_toxigen_light/arabic_mt_toxigen_light.yaml
...oard_arabic_mt_toxigen_light/arabic_mt_toxigen_light.yaml
+23
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_toxigen_light/utils.py
...light/arabic_leaderboard_arabic_mt_toxigen_light/utils.py
+23
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Algeria_light.yaml
...ard_avca_light/arabic_leaderboard_acva_Algeria_light.yaml
+23
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Ancient_Egypt_light.yaml
...ca_light/arabic_leaderboard_acva_Ancient_Egypt_light.yaml
+23
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Arab_Empire_light.yaml
...avca_light/arabic_leaderboard_acva_Arab_Empire_light.yaml
+23
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Arabic_Architecture_light.yaml
...ht/arabic_leaderboard_acva_Arabic_Architecture_light.yaml
+23
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Arabic_Art_light.yaml
..._avca_light/arabic_leaderboard_acva_Arabic_Art_light.yaml
+23
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Arabic_Astronomy_light.yaml
...light/arabic_leaderboard_acva_Arabic_Astronomy_light.yaml
+23
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Arabic_Calligraphy_light.yaml
...ght/arabic_leaderboard_acva_Arabic_Calligraphy_light.yaml
+23
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Arabic_Ceremony_light.yaml
..._light/arabic_leaderboard_acva_Arabic_Ceremony_light.yaml
+23
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Arabic_Clothing_light.yaml
..._light/arabic_leaderboard_acva_Arabic_Clothing_light.yaml
+23
-0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Arabic_Culture_light.yaml
...a_light/arabic_leaderboard_acva_Arabic_Culture_light.yaml
+23
-0
No files found.
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_piqa_light/utils.py
0 → 100644
View file @
4eecbabb
import
datasets
import
numpy
as
np
def
process_docs
(
dataset
:
datasets
.
Dataset
):
def
_process_doc
(
doc
):
question
=
doc
[
"query"
]
answer_index
=
int
(
doc
[
"label"
])
# Dynamically determining the choices by excluding '__few_shots', 'query' and 'label'
choices_keys
=
[
key
for
key
in
doc
.
keys
()
if
key
not
in
[
"query"
,
"label"
,
"__few_shots"
]
]
choices
=
[
doc
[
key
]
for
key
in
choices_keys
]
instruction
=
"الأسئلة التالية هي أسئلة متعددة الإختيارات مع الجواب الصحيح
\n\n
"
query
=
f
"
{
instruction
}
السؤال:
{
question
}
\n
"
for
index
,
choice
in
enumerate
(
choices
):
query
+=
f
"
{
index
}
)
{
choice
}
\n
"
query
+=
"الإجابة:"
return
{
"query"
:
query
,
"choices"
:
choices
,
"gold"
:
answer_index
}
return
dataset
.
map
(
_process_doc
)
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_race_light/arabic_leaderboard_arabic_mt_race_light.yaml
0 → 100644
View file @
4eecbabb
group
:
arabic_leaderboard_arabic_mt_race_light
task
:
-
arabic_mt_race_light
aggregate_metric_list
:
-
metric
:
acc
aggregation
:
mean
weight_by_size
:
true
-
metric
:
acc_norm
aggregation
:
mean
weight_by_size
:
true
metadata
:
version
:
1.0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_race_light/arabic_mt_race_light.yaml
0 → 100644
View file @
4eecbabb
task
:
arabic_mt_race_light
dataset_path
:
arcee-globe/AlGhafa-Arabic-LLM-Benchmark-Translated-10percent
dataset_name
:
race_ar
output_type
:
multiple_choice
training_split
:
null
validation_split
:
validation
test_split
:
test
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
{{query}}"
doc_to_target
:
"
{{gold}}"
doc_to_choice
:
"
choices"
fewshot_split
:
validation
fewshot_config
:
sampler
:
first_n
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
metadata
:
version
:
1.0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_race_light/utils.py
0 → 100644
View file @
4eecbabb
import
datasets
import
numpy
as
np
def
process_docs
(
dataset
:
datasets
.
Dataset
):
def
_process_doc
(
doc
):
question
=
doc
[
"query"
]
answer_index
=
int
(
doc
[
"label"
])
# Dynamically determining the choices by excluding '__few_shots', 'query' and 'label'
choices_keys
=
[
key
for
key
in
doc
.
keys
()
if
key
not
in
[
"query"
,
"label"
,
"__few_shots"
]
]
choices
=
[
doc
[
key
]
for
key
in
choices_keys
]
instruction
=
"الأسئلة التالية هي أسئلة متعددة الإختيارات مع الجواب الصحيح
\n\n
"
query
=
f
"
{
instruction
}
السؤال:
{
question
}
\n
"
for
index
,
choice
in
enumerate
(
choices
):
query
+=
f
"
{
index
}
)
{
choice
}
\n
"
query
+=
"الإجابة:"
return
{
"query"
:
query
,
"choices"
:
choices
,
"gold"
:
answer_index
}
return
dataset
.
map
(
_process_doc
)
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_sciq_light/arabic_leaderboard_arabic_mt_sciq_light.yaml
0 → 100644
View file @
4eecbabb
group
:
arabic_leaderboard_arabic_mt_sciq_light
task
:
-
arabic_mt_sciq_light
aggregate_metric_list
:
-
metric
:
acc
aggregation
:
mean
weight_by_size
:
true
-
metric
:
acc_norm
aggregation
:
mean
weight_by_size
:
true
metadata
:
version
:
1.0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_sciq_light/arabic_mt_sciq_light.yaml
0 → 100644
View file @
4eecbabb
task
:
arabic_mt_sciq_light
dataset_path
:
arcee-globe/AlGhafa-Arabic-LLM-Benchmark-Translated-10percent
dataset_name
:
sciq_ar
output_type
:
multiple_choice
training_split
:
null
validation_split
:
validation
test_split
:
test
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
{{query}}"
doc_to_target
:
"
{{gold}}"
doc_to_choice
:
"
choices"
fewshot_split
:
validation
fewshot_config
:
sampler
:
first_n
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
metadata
:
version
:
1.0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_sciq_light/utils.py
0 → 100644
View file @
4eecbabb
import
random
import
datasets
import
numpy
as
np
def
doc_to_text
(
doc
):
instruction
=
(
"بناءً على السياق أدناه، اختر الإجابة الصحيحة للسؤال التالي من قائمة الاقتراحات"
)
support
=
doc
[
"support"
]
question
=
doc
[
"question"
]
query
=
f
"""
{
instruction
}
السياق:
{
support
}
السؤال:
{
question
}
الإجابات المحتملة:
"""
return
query
def
process_docs
(
dataset
:
datasets
.
Dataset
):
def
_process_doc
(
doc
):
correct_answer
=
doc
[
"correct_answer"
]
choices
=
[
doc
[
"distractor1"
],
doc
[
"distractor2"
],
doc
[
"distractor3"
],
correct_answer
,
]
# Shuffle the choices
random
.
shuffle
(
choices
)
answer_index
=
choices
.
index
(
correct_answer
)
return
{
"query"
:
doc_to_text
(
doc
),
"choices"
:
choices
,
"gold"
:
answer_index
}
return
dataset
.
map
(
_process_doc
)
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_toxigen_light/arabic_leaderboard_arabic_mt_toxigen_light.yaml
0 → 100644
View file @
4eecbabb
group
:
arabic_leaderboard_arabic_mt_toxigen_light
task
:
-
arabic_mt_toxigen_light
aggregate_metric_list
:
-
metric
:
acc
aggregation
:
mean
weight_by_size
:
true
-
metric
:
acc_norm
aggregation
:
mean
weight_by_size
:
true
metadata
:
version
:
1.0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_toxigen_light/arabic_mt_toxigen_light.yaml
0 → 100644
View file @
4eecbabb
task
:
arabic_mt_toxigen_light
dataset_path
:
arcee-globe/AlGhafa-Arabic-LLM-Benchmark-Translated-10percent
dataset_name
:
toxigen_ar
output_type
:
multiple_choice
training_split
:
null
validation_split
:
validation
test_split
:
test
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
{{query}}"
doc_to_target
:
"
{{gold}}"
doc_to_choice
:
"
choices"
fewshot_split
:
validation
fewshot_config
:
sampler
:
first_n
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
metadata
:
version
:
1.0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_arabic_mt_toxigen_light/utils.py
0 → 100644
View file @
4eecbabb
import
datasets
import
numpy
as
np
def
process_docs
(
dataset
:
datasets
.
Dataset
)
->
datasets
.
Dataset
:
def
_process_doc
(
doc
):
text
=
doc
[
"text"
]
label
=
1
if
((
doc
[
"toxicity_ai"
]
+
doc
[
"toxicity_human"
])
>
5.5
)
else
0
instruction
=
'هل العبارة التالية تحض على الكراهية؟ قم بالرد بـ"نعم" أو "لا".'
query
=
f
"""
{
instruction
}
العبارة:
'
{
text
}
'
الإجابة:
"""
out_doc
=
{
"query"
:
query
,
"choices"
:
[
"لا"
,
"نعم"
],
"gold"
:
label
,
}
return
out_doc
return
dataset
.
map
(
_process_doc
)
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Algeria_light.yaml
0 → 100644
View file @
4eecbabb
task
:
arabic_leaderboard_acva_Algeria_light
dataset_path
:
arcee-globe/ACVA-10percent
dataset_name
:
Algeria
output_type
:
multiple_choice
training_split
:
null
validation_split
:
validation
test_split
:
test
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
{{query}}"
doc_to_target
:
"
{{gold}}"
doc_to_choice
:
"
choices"
fewshot_split
:
validation
fewshot_config
:
sampler
:
first_n
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
metadata
:
version
:
1.0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Ancient_Egypt_light.yaml
0 → 100644
View file @
4eecbabb
task
:
arabic_leaderboard_acva_Ancient_Egypt_light
dataset_path
:
arcee-globe/ACVA-10percent
dataset_name
:
Ancient_Egypt
output_type
:
multiple_choice
training_split
:
null
validation_split
:
validation
test_split
:
test
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
{{query}}"
doc_to_target
:
"
{{gold}}"
doc_to_choice
:
"
choices"
fewshot_split
:
validation
fewshot_config
:
sampler
:
first_n
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
metadata
:
version
:
1.0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Arab_Empire_light.yaml
0 → 100644
View file @
4eecbabb
task
:
arabic_leaderboard_acva_Arab_Empire_light
dataset_path
:
arcee-globe/ACVA-10percent
dataset_name
:
Arab_Empire
output_type
:
multiple_choice
training_split
:
null
validation_split
:
validation
test_split
:
test
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
{{query}}"
doc_to_target
:
"
{{gold}}"
doc_to_choice
:
"
choices"
fewshot_split
:
validation
fewshot_config
:
sampler
:
first_n
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
metadata
:
version
:
1.0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Arabic_Architecture_light.yaml
0 → 100644
View file @
4eecbabb
task
:
arabic_leaderboard_acva_Arabic_Architecture_light
dataset_path
:
arcee-globe/ACVA-10percent
dataset_name
:
Arabic_Architecture
output_type
:
multiple_choice
training_split
:
null
validation_split
:
validation
test_split
:
test
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
{{query}}"
doc_to_target
:
"
{{gold}}"
doc_to_choice
:
"
choices"
fewshot_split
:
validation
fewshot_config
:
sampler
:
first_n
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
metadata
:
version
:
1.0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Arabic_Art_light.yaml
0 → 100644
View file @
4eecbabb
task
:
arabic_leaderboard_acva_Arabic_Art_light
dataset_path
:
arcee-globe/ACVA-10percent
dataset_name
:
Arabic_Art
output_type
:
multiple_choice
training_split
:
null
validation_split
:
validation
test_split
:
test
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
{{query}}"
doc_to_target
:
"
{{gold}}"
doc_to_choice
:
"
choices"
fewshot_split
:
validation
fewshot_config
:
sampler
:
first_n
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
metadata
:
version
:
1.0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Arabic_Astronomy_light.yaml
0 → 100644
View file @
4eecbabb
task
:
arabic_leaderboard_acva_Arabic_Astronomy_light
dataset_path
:
arcee-globe/ACVA-10percent
dataset_name
:
Arabic_Astronomy
output_type
:
multiple_choice
training_split
:
null
validation_split
:
validation
test_split
:
test
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
{{query}}"
doc_to_target
:
"
{{gold}}"
doc_to_choice
:
"
choices"
fewshot_split
:
validation
fewshot_config
:
sampler
:
first_n
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
metadata
:
version
:
1.0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Arabic_Calligraphy_light.yaml
0 → 100644
View file @
4eecbabb
task
:
arabic_leaderboard_acva_Arabic_Calligraphy_light
dataset_path
:
arcee-globe/ACVA-10percent
dataset_name
:
Arabic_Calligraphy
output_type
:
multiple_choice
training_split
:
null
validation_split
:
validation
test_split
:
test
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
{{query}}"
doc_to_target
:
"
{{gold}}"
doc_to_choice
:
"
choices"
fewshot_split
:
validation
fewshot_config
:
sampler
:
first_n
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
metadata
:
version
:
1.0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Arabic_Ceremony_light.yaml
0 → 100644
View file @
4eecbabb
task
:
arabic_leaderboard_acva_Arabic_Ceremony_light
dataset_path
:
arcee-globe/ACVA-10percent
dataset_name
:
Arabic_Ceremony
output_type
:
multiple_choice
training_split
:
null
validation_split
:
validation
test_split
:
test
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
{{query}}"
doc_to_target
:
"
{{gold}}"
doc_to_choice
:
"
choices"
fewshot_split
:
validation
fewshot_config
:
sampler
:
first_n
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
metadata
:
version
:
1.0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Arabic_Clothing_light.yaml
0 → 100644
View file @
4eecbabb
task
:
arabic_leaderboard_acva_Arabic_Clothing_light
dataset_path
:
arcee-globe/ACVA-10percent
dataset_name
:
Arabic_Clothing
output_type
:
multiple_choice
training_split
:
null
validation_split
:
validation
test_split
:
test
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
{{query}}"
doc_to_target
:
"
{{gold}}"
doc_to_choice
:
"
choices"
fewshot_split
:
validation
fewshot_config
:
sampler
:
first_n
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
metadata
:
version
:
1.0
lm_eval/tasks/arabic_leaderboard_light/arabic_leaderboard_avca_light/arabic_leaderboard_acva_Arabic_Culture_light.yaml
0 → 100644
View file @
4eecbabb
task
:
arabic_leaderboard_acva_Arabic_Culture_light
dataset_path
:
arcee-globe/ACVA-10percent
dataset_name
:
Arabic_Culture
output_type
:
multiple_choice
training_split
:
null
validation_split
:
validation
test_split
:
test
process_docs
:
!function
utils.process_docs
doc_to_text
:
"
{{query}}"
doc_to_target
:
"
{{gold}}"
doc_to_choice
:
"
choices"
fewshot_split
:
validation
fewshot_config
:
sampler
:
first_n
metric_list
:
-
metric
:
acc
aggregation
:
mean
higher_is_better
:
true
-
metric
:
acc_norm
aggregation
:
mean
higher_is_better
:
true
metadata
:
version
:
1.0
Prev
1
…
11
12
13
14
15
16
17
18
19
…
24
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment