Add eqbench tasks in Spanish and Catalan (#3168)

* Add eqbench tasks in Spanish and Catalan * Incremented catalan_bench and spanish_bench versions. Added 'multilingual' folder inside 'eq_bench' and moved the eqbench_ca and eqbench_es .yaml to that folder. Updated the tasks README with eqbench_es and eqbench_ca, expliciting inside each description both the Hugging Face link and the translation method. * Fixed tasks table. * remove test_task.sh and results folder * Add utils.py to multilingual folder

Add eqbench tasks in Spanish and Catalan (#3168)
* Add eqbench tasks in Spanish and Catalan * Incremented catalan_bench and spanish_bench versions. Added 'multilingual' folder inside 'eq_bench' and moved the eqbench_ca and eqbench_es .yaml to that folder. Updated the tasks README with eqbench_es and eqbench_ca, expliciting inside each description both the Hugging Face link and the translation method. * Fixed tasks table. * remove test_task.sh and results folder * Add utils.py to multilingual folder
de496b80 · priverabsc · GitHub · a4752ccd · de496b80 · de496b80
Unverified Commit de496b80 authored Sep 22, 2025 by priverabsc Committed by GitHub Sep 22, 2025
6 changed files
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -7,6 +7,8 @@ provided to the individual README.md files for each subfolder.

 | Task Family                                                              | Description                                                                                                                                                                                                                                                                                                                            | Language(s)                                                                                                                   |
 |--------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
+| [eq-bench_es](eq_bench/README.md) | Spanish version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. [Hugging Face](https://huggingface.co/datasets/BSC-LT/EQ-bench_es) |Spanish **Human Translated** |
+| [eq-bench_ca](eq_bench/README.md) | Catalan version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. [Hugging Face](https://huggingface.co/datasets/BSC-LT/EQ-bench_ca)| Catalan                                                                                                                        **Human Translated** |
 | [aclue](aclue/README.md)                                                 | Tasks focusing on ancient Chinese language understanding and cultural aspects.                                                                                                                                                                                                                                                         | Ancient Chinese                                                                                                               |
 | [acp_bench](acpbench/README.md)                                          | Tasks evaluating the reasoning ability about Action, Change, and Planning                                                                                                                                                                                                                                                              | English                                                                                                                       |
 | [acp_bench_hard](acpbench/README.md)                                     | Tasks evaluating the reasoning ability about Action, Change, and Planning                                                                                                                                                                                                                                                              | English                                                                                                                       |

--- a/lm_eval/tasks/catalan_bench/catalan_bench.yaml
+++ b/lm_eval/tasks/catalan_bench/catalan_bench.yaml
@@ -6,6 +6,7 @@ task:
    - copa_ca
    - openbookqa_ca
    - parafraseja
+    - eqbench_ca
    - paws_ca
    - piqa_ca
    - siqa_ca

--- a/lm_eval/tasks/eq_bench/multilingual/eqbench_ca.yaml
+++ b/lm_eval/tasks/eq_bench/multilingual/eqbench_ca.yaml
+task: eqbench_ca
+dataset_path: BSC-LT/EQ-bench_ca
+output_type: generate_until
+validation_split: test
+doc_to_text: prompt
+doc_to_target: reference_answer_fullscale
+process_results: !function utils.calculate_score_fullscale
+generation_kwargs:
+  do_sample: false
+  temperature: 0.0
+  max_gen_toks: 80
+metric_list:
+  - metric: eqbench
+    aggregation: mean
+    higher_is_better: true
+  - metric: percent_parseable
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/eq_bench/multilingual/eqbench_es.yaml
+++ b/lm_eval/tasks/eq_bench/multilingual/eqbench_es.yaml
+task: eqbench_es
+dataset_path: BSC-LT/EQ-bench_es
+output_type: generate_until
+validation_split: test
+doc_to_text: prompt
+doc_to_target: reference_answer_fullscale
+process_results: !function utils.calculate_score_fullscale
+generation_kwargs:
+  do_sample: false
+  temperature: 0.0
+  max_gen_toks: 80
+metric_list:
+  - metric: eqbench
+    aggregation: mean
+    higher_is_better: true
+  - metric: percent_parseable
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/eq_bench/multilingual/utils.py
+++ b/lm_eval/tasks/eq_bench/multilingual/utils.py
+import math
+import re
+
+
+def calculate_score_fullscale(docs, results):
+    reference = eval(docs["reference_answer_fullscale"])
+    user = dict(re.findall(r"(\w+):\s+(\d+)", results[0]))
+    # First check that the emotions specified in the answer match those in the reference
+    if len(user.items()) != 4:
+        # print('! Error: 4 emotions were not returned')
+        # print(user)
+        return {"eqbench": 0, "percent_parseable": 0}
+    emotions_dict = {}
+    for emotion, user_emotion_score in user.items():
+        for i in range(1, 5):
+            if emotion == reference[f"emotion{i}"]:
+                emotions_dict[emotion] = True
+    if len(emotions_dict) != 4:
+        print("! Error: emotions did not match reference")
+        print(user)
+        return {"eqbench": 0, "percent_parseable": 0}
+
+    difference_tally = (
+        0  # Tally of differerence from reference answers for this question
+    )
+
+    # Iterate over each emotion in the user's answers.
+    for emotion, user_emotion_score in user.items():
+        # If this emotion is in the reference, calculate the difference between the user's score and the reference score.
+        for i in range(1, 5):
+            if emotion == reference[f"emotion{i}"]:
+                d = abs(
+                    float(user_emotion_score) - float(reference[f"emotion{i}_score"])
+                )
+                # this will be a value between 0 and 10
+                if d == 0:
+                    scaled_difference = 0
+                elif d <= 5:
+                    # S-shaped scaling function
+                    # https://www.desmos.com/calculator
+                    # 6.5\cdot\ \frac{1}{\left(1\ +\ e^{\left(-1.2\cdot\left(x-4\right)\right)}\right)}
+                    scaled_difference = 6.5 * (1 / (1 + math.e ** (-1.2 * (d - 4))))
+
+                else:
+                    scaled_difference = d
+                difference_tally += scaled_difference
+
+    # Inverting the difference tally so that the closer the answer is to reference, the higher the score.
+    # The adjustment constant is chosen such that answering randomly produces a score of zero.
+    adjust_const = 0.7477
+    final_score = 10 - (difference_tally * adjust_const)
+    final_score_percent = final_score * 10
+
+    return {"eqbench": final_score_percent, "percent_parseable": 100}
--- a/lm_eval/tasks/spanish_bench/spanish_bench.yaml
+++ b/lm_eval/tasks/spanish_bench/spanish_bench.yaml
@@ -11,8 +11,9 @@ task:
  - xlsum_es
  - paws_es_spanish_bench
  - mgsm_direct_es_spanish_bench
+  - eqbench_es
  - flores_es
  - phrases_es
  - cocoteros_es
 metadata:
-  version: 1.0
+  version: 1.1