Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
601be343
Commit
601be343
authored
Jun 23, 2025
by
Baber
Browse files
Merge branch 'main' into feature/eval_from_config
parents
d0884a96
68c3a811
Changes
1000
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
395 additions
and
0 deletions
+395
-0
lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_kin.yaml
lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_kin.yaml
+13
-0
lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_swa.yaml
lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_swa.yaml
+16
-0
lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_twi.yaml
lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_twi.yaml
+13
-0
lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_yor.yaml
lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_yor.yaml
+13
-0
lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_zul.yaml
lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_zul.yaml
+13
-0
lm_eval/tasks/afrobench/afriqa/prompt_2/utils.py
lm_eval/tasks/afrobench/afriqa/prompt_2/utils.py
+53
-0
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa
+42
-0
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_bem.yaml
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_bem.yaml
+12
-0
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_fon.yaml
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_fon.yaml
+12
-0
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_hau.yaml
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_hau.yaml
+12
-0
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_ibo.yaml
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_ibo.yaml
+12
-0
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_kin.yaml
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_kin.yaml
+12
-0
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_swa.yaml
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_swa.yaml
+15
-0
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_twi.yaml
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_twi.yaml
+12
-0
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_yor.yaml
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_yor.yaml
+12
-0
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_zul.yaml
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_zul.yaml
+12
-0
lm_eval/tasks/afrobench/afriqa/prompt_3/utils.py
lm_eval/tasks/afrobench/afriqa/prompt_3/utils.py
+53
-0
lm_eval/tasks/afrobench/afriqa/prompt_4/afriqa
lm_eval/tasks/afrobench/afriqa/prompt_4/afriqa
+42
-0
lm_eval/tasks/afrobench/afriqa/prompt_4/afriqa_bem.yaml
lm_eval/tasks/afrobench/afriqa/prompt_4/afriqa_bem.yaml
+13
-0
lm_eval/tasks/afrobench/afriqa/prompt_4/afriqa_fon.yaml
lm_eval/tasks/afrobench/afriqa/prompt_4/afriqa_fon.yaml
+13
-0
No files found.
Too many changes to show.
To preserve performance only
1000 of 1000+
files are displayed.
Plain diff
Email patch
lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_kin.yaml
0 → 100644
View file @
601be343
# Generated by utils.py
dataset_name
:
kin
doc_to_text
:
'
Your
task
is
to
answer
a
question
given
a
context.
The
question
is
in
Kinyarwanda,
while
the
context
is
in
English
or
French.Make
sure
you
respond
with
the
shortest
span
in
the
context
that
contains
the
answer.
Question:
{{question_lang}}
Context:
{{context}}
Answer:'
include
:
afriqa
task
:
afriqa_kin_prompt_2
lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_swa.yaml
0 → 100644
View file @
601be343
# Generated by utils.py
dataset_name
:
swa
doc_to_text
:
'
Your
task
is
to
answer
a
question
given
a
context.
The
question
is
in
Swahili,
while
the
context
is
in
English
or
French.Make
sure
you
respond
with
the
shortest
span
in
the
context
that
contains
the
answer.
Question:
{{question_lang}}
Context:
{{context}}
Answer:'
include
:
afriqa
fewshot_split
:
test
fewshot_config
:
sampler
:
first_n
task
:
afriqa_swa_prompt_2
lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_twi.yaml
0 → 100644
View file @
601be343
# Generated by utils.py
dataset_name
:
twi
doc_to_text
:
'
Your
task
is
to
answer
a
question
given
a
context.
The
question
is
in
Twi,
while
the
context
is
in
English
or
French.Make
sure
you
respond
with
the
shortest
span
in
the
context
that
contains
the
answer.
Question:
{{question_lang}}
Context:
{{context}}
Answer:'
include
:
afriqa
task
:
afriqa_twi_prompt_2
lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_yor.yaml
0 → 100644
View file @
601be343
# Generated by utils.py
dataset_name
:
yor
doc_to_text
:
'
Your
task
is
to
answer
a
question
given
a
context.
The
question
is
in
Yoruba,
while
the
context
is
in
English
or
French.Make
sure
you
respond
with
the
shortest
span
in
the
context
that
contains
the
answer.
Question:
{{question_lang}}
Context:
{{context}}
Answer:'
include
:
afriqa
task
:
afriqa_yor_prompt_2
lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_zul.yaml
0 → 100644
View file @
601be343
# Generated by utils.py
dataset_name
:
zul
doc_to_text
:
'
Your
task
is
to
answer
a
question
given
a
context.
The
question
is
in
Zulu,
while
the
context
is
in
English
or
French.Make
sure
you
respond
with
the
shortest
span
in
the
context
that
contains
the
answer.
Question:
{{question_lang}}
Context:
{{context}}
Answer:'
include
:
afriqa
task
:
afriqa_zul_prompt_2
lm_eval/tasks/afrobench/afriqa/prompt_2/utils.py
0 → 100644
View file @
601be343
import
re
import
string
from
collections
import
Counter
def
normalize_answer
(
s
):
"""
Taken from the official evaluation script for v1.1 of the SQuAD dataset.
Lower text and remove punctuation, articles and extra whitespace.
"""
def
remove_articles
(
text
):
return
re
.
sub
(
r
"\b(a|an|the)\b"
,
" "
,
text
)
def
white_space_fix
(
text
):
return
" "
.
join
(
text
.
split
())
def
remove_punc
(
text
):
exclude
=
set
(
string
.
punctuation
)
return
""
.
join
(
ch
for
ch
in
text
if
ch
not
in
exclude
)
def
lower
(
text
):
return
text
.
lower
()
return
white_space_fix
(
remove_articles
(
remove_punc
(
lower
(
s
))))
def
f1
(
items
):
"""
Taken from the official evaluation script for v1.1 of the SQuAD dataset.
"""
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
f1_list
=
[]
for
i
in
range
(
len
(
golds
)):
prediction_tokens
=
normalize_answer
(
preds
[
i
]).
split
()
references_tokens
=
normalize_answer
(
golds
[
i
]).
split
()
common
=
Counter
(
prediction_tokens
)
&
Counter
(
references_tokens
)
num_same
=
sum
(
common
.
values
())
if
num_same
==
0
:
f1_score
=
0
else
:
precision
=
1.0
*
num_same
/
len
(
prediction_tokens
)
recall
=
1.0
*
num_same
/
len
(
references_tokens
)
f1_score
=
(
2
*
precision
*
recall
)
/
(
precision
+
recall
)
f1_list
.
append
(
f1_score
)
return
sum
(
f1_list
)
/
len
(
f1_list
)
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa
0 → 100644
View file @
601be343
tag:
- afrobench_xqa_tasks
- afriqa_prompt_3
dataset_kwargs: {trust_remote_code: True}
dataset_path: masakhane/afriqa-gold-passages
dataset_name: null
output_type: generate_until
test_split: test
fewshot_split: train
doc_to_target: answer_pivot
should_decontaminate: true
doc_to_decontamination_query: question_lang
generation_kwargs:
until:
- "\n"
do_sample: false
temperature: 0.0
filter_list:
- name: remove_whitespace
filter:
- function: remove_whitespace
- function: take_first
target_delimiter: " "
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
- "."
- ","
- "\\$"
- metric: f1
aggregation: !function utils.f1
higher_is_better: true
ignore_case: true
ignore_punctuation: true
- "."
- ","
- "\\$"
metadata:
version: 1.0
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_bem.yaml
0 → 100644
View file @
601be343
# Generated by utils.py
dataset_name
:
bem
doc_to_text
:
'
Given
the
context,
provide
the
answer
to
the
following
question.Ensure
your
response
is
concise
and
directly
from
the
context.
Question:
{{question_lang}}
Context:
{{context}}
Answer:'
include
:
afriqa
task
:
afriqa_bem_prompt_3
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_fon.yaml
0 → 100644
View file @
601be343
# Generated by utils.py
dataset_name
:
fon
doc_to_text
:
'
Given
the
context,
provide
the
answer
to
the
following
question.Ensure
your
response
is
concise
and
directly
from
the
context.
Question:
{{question_lang}}
Context:
{{context}}
Answer:'
include
:
afriqa
task
:
afriqa_fon_prompt_3
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_hau.yaml
0 → 100644
View file @
601be343
# Generated by utils.py
dataset_name
:
hau
doc_to_text
:
'
Given
the
context,
provide
the
answer
to
the
following
question.Ensure
your
response
is
concise
and
directly
from
the
context.
Question:
{{question_lang}}
Context:
{{context}}
Answer:'
include
:
afriqa
task
:
afriqa_hau_prompt_3
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_ibo.yaml
0 → 100644
View file @
601be343
# Generated by utils.py
dataset_name
:
ibo
doc_to_text
:
'
Given
the
context,
provide
the
answer
to
the
following
question.Ensure
your
response
is
concise
and
directly
from
the
context.
Question:
{{question_lang}}
Context:
{{context}}
Answer:'
include
:
afriqa
task
:
afriqa_ibo_prompt_3
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_kin.yaml
0 → 100644
View file @
601be343
# Generated by utils.py
dataset_name
:
kin
doc_to_text
:
'
Given
the
context,
provide
the
answer
to
the
following
question.Ensure
your
response
is
concise
and
directly
from
the
context.
Question:
{{question_lang}}
Context:
{{context}}
Answer:'
include
:
afriqa
task
:
afriqa_kin_prompt_3
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_swa.yaml
0 → 100644
View file @
601be343
# Generated by utils.py
dataset_name
:
swa
doc_to_text
:
'
Given
the
context,
provide
the
answer
to
the
following
question.Ensure
your
response
is
concise
and
directly
from
the
context.
Question:
{{question_lang}}
Context:
{{context}}
Answer:'
include
:
afriqa
fewshot_split
:
test
fewshot_config
:
sampler
:
first_n
task
:
afriqa_swa_prompt_3
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_twi.yaml
0 → 100644
View file @
601be343
# Generated by utils.py
dataset_name
:
twi
doc_to_text
:
'
Given
the
context,
provide
the
answer
to
the
following
question.Ensure
your
response
is
concise
and
directly
from
the
context.
Question:
{{question_lang}}
Context:
{{context}}
Answer:'
include
:
afriqa
task
:
afriqa_twi_prompt_3
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_yor.yaml
0 → 100644
View file @
601be343
# Generated by utils.py
dataset_name
:
yor
doc_to_text
:
'
Given
the
context,
provide
the
answer
to
the
following
question.Ensure
your
response
is
concise
and
directly
from
the
context.
Question:
{{question_lang}}
Context:
{{context}}
Answer:'
include
:
afriqa
task
:
afriqa_yor_prompt_3
lm_eval/tasks/afrobench/afriqa/prompt_3/afriqa_zul.yaml
0 → 100644
View file @
601be343
# Generated by utils.py
dataset_name
:
zul
doc_to_text
:
'
Given
the
context,
provide
the
answer
to
the
following
question.Ensure
your
response
is
concise
and
directly
from
the
context.
Question:
{{question_lang}}
Context:
{{context}}
Answer:'
include
:
afriqa
task
:
afriqa_zul_prompt_3
lm_eval/tasks/afrobench/afriqa/prompt_3/utils.py
0 → 100644
View file @
601be343
import
re
import
string
from
collections
import
Counter
def
normalize_answer
(
s
):
"""
Taken from the official evaluation script for v1.1 of the SQuAD dataset.
Lower text and remove punctuation, articles and extra whitespace.
"""
def
remove_articles
(
text
):
return
re
.
sub
(
r
"\b(a|an|the)\b"
,
" "
,
text
)
def
white_space_fix
(
text
):
return
" "
.
join
(
text
.
split
())
def
remove_punc
(
text
):
exclude
=
set
(
string
.
punctuation
)
return
""
.
join
(
ch
for
ch
in
text
if
ch
not
in
exclude
)
def
lower
(
text
):
return
text
.
lower
()
return
white_space_fix
(
remove_articles
(
remove_punc
(
lower
(
s
))))
def
f1
(
items
):
"""
Taken from the official evaluation script for v1.1 of the SQuAD dataset.
"""
unzipped_list
=
list
(
zip
(
*
items
))
golds
=
unzipped_list
[
0
]
preds
=
unzipped_list
[
1
]
f1_list
=
[]
for
i
in
range
(
len
(
golds
)):
prediction_tokens
=
normalize_answer
(
preds
[
i
]).
split
()
references_tokens
=
normalize_answer
(
golds
[
i
]).
split
()
common
=
Counter
(
prediction_tokens
)
&
Counter
(
references_tokens
)
num_same
=
sum
(
common
.
values
())
if
num_same
==
0
:
f1_score
=
0
else
:
precision
=
1.0
*
num_same
/
len
(
prediction_tokens
)
recall
=
1.0
*
num_same
/
len
(
references_tokens
)
f1_score
=
(
2
*
precision
*
recall
)
/
(
precision
+
recall
)
f1_list
.
append
(
f1_score
)
return
sum
(
f1_list
)
/
len
(
f1_list
)
lm_eval/tasks/afrobench/afriqa/prompt_4/afriqa
0 → 100644
View file @
601be343
tag:
- afrobench_xqa_tasks
- afriqa_prompt_4
dataset_kwargs: {trust_remote_code: True}
dataset_path: masakhane/afriqa-gold-passages
dataset_name: null
output_type: generate_until
test_split: test
fewshot_split: train
doc_to_target: answer_pivot
should_decontaminate: true
doc_to_decontamination_query: question_lang
generation_kwargs:
until:
- "\n"
do_sample: false
temperature: 0.0
filter_list:
- name: remove_whitespace
filter:
- function: remove_whitespace
- function: take_first
target_delimiter: " "
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
- "."
- ","
- "\\$"
- metric: f1
aggregation: !function utils.f1
higher_is_better: true
ignore_case: true
ignore_punctuation: true
- "."
- ","
- "\\$"
metadata:
version: 1.0
lm_eval/tasks/afrobench/afriqa/prompt_4/afriqa_bem.yaml
0 → 100644
View file @
601be343
# Generated by utils.py
dataset_name
:
bem
doc_to_text
:
'
You
are
an
AI
assistant
and
your
task
is
to
answer
the
question
based
on
the
provided
context.Your
answer
should
be
the
shortest
span
that
contains
the
answer
within
the
context.
Question:
{{question_lang}}
Context:
{{context}}
Answer:'
include
:
afriqa
task
:
afriqa_bem_prompt_4
lm_eval/tasks/afrobench/afriqa/prompt_4/afriqa_fon.yaml
0 → 100644
View file @
601be343
# Generated by utils.py
dataset_name
:
fon
doc_to_text
:
'
You
are
an
AI
assistant
and
your
task
is
to
answer
the
question
based
on
the
provided
context.Your
answer
should
be
the
shortest
span
that
contains
the
answer
within
the
context.
Question:
{{question_lang}}
Context:
{{context}}
Answer:'
include
:
afriqa
task
:
afriqa_fon_prompt_4
Prev
1
…
46
47
48
49
50
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment