Unverified Commit 22bd2bcb authored by Michele Resta's avatar Michele Resta Committed by GitHub
Browse files

Optimization for evalita-llm rouge computation (#2878)



* feat: initial commit with templates for evalita evaluation

* fix: change rule for generate_until

* feat: modified yaml to use reduced version of NER test datasets

* feat: added templates to use reduced dataset for summarization (fanpage and ilpost)

* Add Six Prompts for Each Multiple-Choice Task

* fix: fastest eval for summarization

* chore: linted with ruff

* chore: linted with ruff

---------
Co-authored-by: default avatarrzanoli <zanoli@fbk.eu>
parent 19ba1b16
...@@ -5,7 +5,7 @@ task_alias: prompt-1 ...@@ -5,7 +5,7 @@ task_alias: prompt-1
#doc_to_text: > #doc_to_text: >
# "Crea un sommario del seguente testo. Testo: {{source}}\nSommario: " # "Crea un sommario del seguente testo. Testo: {{source}}\nSommario: "
doc_to_text: "Riassumi il seguente articolo di giornale: '{{source}}'\nRiassunto:" doc_to_text: "Riassumi il seguente articolo di giornale: '{{source}}'\nRiassunto:"
process_results: !function utils.process_results_sum process_results: !function sum_utils.process_results_sum
metric_list: metric_list:
- metric: rouge1 - metric: rouge1
higher_is_better: true higher_is_better: true
......
...@@ -5,7 +5,7 @@ task_alias: prompt-2 ...@@ -5,7 +5,7 @@ task_alias: prompt-2
#doc_to_text: > #doc_to_text: >
# "Crea un sommario del seguente testo. Testo: {{source}}\nSommario: " # "Crea un sommario del seguente testo. Testo: {{source}}\nSommario: "
doc_to_text: "Devi risolvere un compito di sintesi automatica del testo. Riassumi il seguente articolo di giornale: '{{source}}'\nRiassunto:" doc_to_text: "Devi risolvere un compito di sintesi automatica del testo. Riassumi il seguente articolo di giornale: '{{source}}'\nRiassunto:"
process_results: !function utils.process_results_sum process_results: !function sum_utils.process_results_sum
metric_list: metric_list:
- metric: rouge1 - metric: rouge1
higher_is_better: true higher_is_better: true
......
...@@ -3,7 +3,7 @@ include: _sum_template_fp_yaml ...@@ -3,7 +3,7 @@ include: _sum_template_fp_yaml
task: evalita-sp_sum_task_fp_p1 task: evalita-sp_sum_task_fp_p1
task_alias: prompt-1 task_alias: prompt-1
doc_to_text: "Riassumi il seguente articolo di giornale: '{{source}}'\nRiassunto:" doc_to_text: "Riassumi il seguente articolo di giornale: '{{source}}'\nRiassunto:"
process_results: !function utils.process_results_sum process_results: !function sum_utils.process_results_sum
metric_list: metric_list:
- metric: rouge1 - metric: rouge1
higher_is_better: true higher_is_better: true
......
...@@ -3,7 +3,7 @@ include: _sum_template_fp_yaml ...@@ -3,7 +3,7 @@ include: _sum_template_fp_yaml
task: evalita-sp_sum_task_fp_p2 task: evalita-sp_sum_task_fp_p2
task_alias: prompt-2 task_alias: prompt-2
doc_to_text: "Devi risolvere un compito di sintesi automatica del testo. Riassumi il seguente articolo di giornale: '{{source}}'\nRiassunto:" doc_to_text: "Devi risolvere un compito di sintesi automatica del testo. Riassumi il seguente articolo di giornale: '{{source}}'\nRiassunto:"
process_results: !function utils.process_results_sum process_results: !function sum_utils.process_results_sum
metric_list: metric_list:
- metric: rouge1 - metric: rouge1
higher_is_better: true higher_is_better: true
......
from evaluate import load
rouge = load("rouge", keep_in_memory=True)
def rouge1_score(references, predictions, **kwargs):
"""
Optimized ROUGE-1 computation using a single loaded metric instance.
"""
return rouge.compute(predictions=predictions, references=references, **kwargs)[
"rouge1"
]
def process_results_sum(doc, results):
"""
Process the results of the summarization task efficiently.
"""
ref = doc.get("summary", doc.get("target")) # Get the reference summary
return {"rouge1": rouge.compute(predictions=results, references=[ref])["rouge1"]}
...@@ -523,33 +523,6 @@ def split_text_with_regex(text, pattern): ...@@ -523,33 +523,6 @@ def split_text_with_regex(text, pattern):
return result return result
# ---------------------- SUMMARIZATION ----------------------
def rouge1_score(references, predictions, **kwargs):
"""
suboptimal way of compute rouge because of the following issue:
https://github.com/EleutherAI/lm-evaluation-harness/issues/1302
"""
rouge = load("rouge")
return rouge.compute(predictions=predictions, references=references, **kwargs)[
"rouge1"
]
def process_results_sum(doc, results):
"""
Process the results of the Evalita summarization task
"""
ref = doc["summary"] if "summary" in doc.keys() else doc["target"]
rouge_scorer = load("rouge", keep_in_memory=True)
r1score = rouge_scorer.compute(predictions=results, references=[ref])["rouge1"]
return {
"rouge1": r1score,
}
def faq_doc_to_target(x): def faq_doc_to_target(x):
if x["correct_answer"] == "A": if x["correct_answer"] == "A":
return 0 return 0
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment