Unverified Commit de496b80 authored by priverabsc's avatar priverabsc Committed by GitHub
Browse files

Add eqbench tasks in Spanish and Catalan (#3168)

* Add eqbench tasks in Spanish and Catalan

* Incremented catalan_bench and spanish_bench versions. Added 'multilingual' folder inside 'eq_bench' and moved the eqbench_ca and eqbench_es .yaml to that folder. Updated the tasks README with eqbench_es and eqbench_ca, expliciting inside each description both the Hugging Face link and the translation method.

* Fixed tasks table.

* remove test_task.sh and results folder

* Add utils.py to multilingual folder
parent a4752ccd
......@@ -7,6 +7,8 @@ provided to the individual README.md files for each subfolder.
| Task Family | Description | Language(s) |
|--------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
| [eq-bench_es](eq_bench/README.md) | Spanish version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. [Hugging Face](https://huggingface.co/datasets/BSC-LT/EQ-bench_es) |Spanish **Human Translated** |
| [eq-bench_ca](eq_bench/README.md) | Catalan version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. [Hugging Face](https://huggingface.co/datasets/BSC-LT/EQ-bench_ca)| Catalan **Human Translated** |
| [aclue](aclue/README.md) | Tasks focusing on ancient Chinese language understanding and cultural aspects. | Ancient Chinese |
| [acp_bench](acpbench/README.md) | Tasks evaluating the reasoning ability about Action, Change, and Planning | English |
| [acp_bench_hard](acpbench/README.md) | Tasks evaluating the reasoning ability about Action, Change, and Planning | English |
......
......@@ -6,6 +6,7 @@ task:
- copa_ca
- openbookqa_ca
- parafraseja
- eqbench_ca
- paws_ca
- piqa_ca
- siqa_ca
......
task: eqbench_ca
dataset_path: BSC-LT/EQ-bench_ca
output_type: generate_until
validation_split: test
doc_to_text: prompt
doc_to_target: reference_answer_fullscale
process_results: !function utils.calculate_score_fullscale
generation_kwargs:
do_sample: false
temperature: 0.0
max_gen_toks: 80
metric_list:
- metric: eqbench
aggregation: mean
higher_is_better: true
- metric: percent_parseable
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: eqbench_es
dataset_path: BSC-LT/EQ-bench_es
output_type: generate_until
validation_split: test
doc_to_text: prompt
doc_to_target: reference_answer_fullscale
process_results: !function utils.calculate_score_fullscale
generation_kwargs:
do_sample: false
temperature: 0.0
max_gen_toks: 80
metric_list:
- metric: eqbench
aggregation: mean
higher_is_better: true
- metric: percent_parseable
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
import math
import re
def calculate_score_fullscale(docs, results):
reference = eval(docs["reference_answer_fullscale"])
user = dict(re.findall(r"(\w+):\s+(\d+)", results[0]))
# First check that the emotions specified in the answer match those in the reference
if len(user.items()) != 4:
# print('! Error: 4 emotions were not returned')
# print(user)
return {"eqbench": 0, "percent_parseable": 0}
emotions_dict = {}
for emotion, user_emotion_score in user.items():
for i in range(1, 5):
if emotion == reference[f"emotion{i}"]:
emotions_dict[emotion] = True
if len(emotions_dict) != 4:
print("! Error: emotions did not match reference")
print(user)
return {"eqbench": 0, "percent_parseable": 0}
difference_tally = (
0 # Tally of differerence from reference answers for this question
)
# Iterate over each emotion in the user's answers.
for emotion, user_emotion_score in user.items():
# If this emotion is in the reference, calculate the difference between the user's score and the reference score.
for i in range(1, 5):
if emotion == reference[f"emotion{i}"]:
d = abs(
float(user_emotion_score) - float(reference[f"emotion{i}_score"])
)
# this will be a value between 0 and 10
if d == 0:
scaled_difference = 0
elif d <= 5:
# S-shaped scaling function
# https://www.desmos.com/calculator
# 6.5\cdot\ \frac{1}{\left(1\ +\ e^{\left(-1.2\cdot\left(x-4\right)\right)}\right)}
scaled_difference = 6.5 * (1 / (1 + math.e ** (-1.2 * (d - 4))))
else:
scaled_difference = d
difference_tally += scaled_difference
# Inverting the difference tally so that the closer the answer is to reference, the higher the score.
# The adjustment constant is chosen such that answering randomly produces a score of zero.
adjust_const = 0.7477
final_score = 10 - (difference_tally * adjust_const)
final_score_percent = final_score * 10
return {"eqbench": final_score_percent, "percent_parseable": 100}
......@@ -11,8 +11,9 @@ task:
- xlsum_es
- paws_es_spanish_bench
- mgsm_direct_es_spanish_bench
- eqbench_es
- flores_es
- phrases_es
- cocoteros_es
metadata:
version: 1.0
version: 1.1
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment