Commit 88486e57 authored by lintangsutawika's avatar lintangsutawika
Browse files

Merge branch 'group-agg-rework' of...

Merge branch 'group-agg-rework' of https://github.com/EleutherAI/lm-evaluation-harness into multiprompt
parents 5971f2ca ba73d131
include: "_default_template_yaml" include: "_default_template_yaml"
group: tag:
- french_bench - french_bench
- french_bench_gen - french_bench_gen
description: "D'après l'information dans le contexte donné, quelle question a été posée pour obtenir la réponse donnée ?" description: "D'après l'information dans le contexte donné, quelle question a été posée pour obtenir la réponse donnée ?"
......
include: "_default_template_yaml" include: "_default_template_yaml"
group: tag:
- french_bench - french_bench
- french_bench_gen - french_bench_gen
description: "D'après l'information dans le contexte donné, donne la réponse à la question en citant quelques mots du contexte. Si il est impossible de répondre avec les informations du contexte, répond 'Impossible'." description: "D'après l'information dans le contexte donné, donne la réponse à la question en citant quelques mots du contexte. Si il est impossible de répondre avec les informations du contexte, répond 'Impossible'."
......
include: "_default_template_yaml" include: "_default_template_yaml"
group: tag:
- french_bench - french_bench
- french_bench_mc - french_bench_mc
description: "Répond au mieux en complétant la question avec une des réponses proposées." description: "Répond au mieux en complétant la question avec une des réponses proposées."
......
group: tag:
- french_bench - french_bench
- french_bench_mc - french_bench_mc
task: french_bench_hellaswag task: french_bench_hellaswag
......
include: "_default_template_yaml" include: "_default_template_yaml"
group: tag:
- french_bench - french_bench
- french_bench_gen - french_bench_gen
description: "D'après l'information dans le contexte donné, donne la réponse à la question en citant quelques extraits du contexte." description: "D'après l'information dans le contexte donné, donne la réponse à la question en citant quelques extraits du contexte."
......
group: tag:
- french_bench_perplexity - french_bench_perplexity
task: french_bench_opus_perplexity task: french_bench_opus_perplexity
dataset_path: manu/opus100-en-fr dataset_path: manu/opus100-en-fr
......
include: "_default_template_yaml" include: "_default_template_yaml"
group: tag:
- french_bench - french_bench
- french_bench_gen - french_bench_gen
description: "Résume l'article en une phrase." description: "Résume l'article en une phrase."
......
include: "_default_template_yaml" include: "_default_template_yaml"
group: tag:
- french_bench - french_bench
- french_bench_extra - french_bench_extra
description: "Trouve le titre de l'article." description: "Trouve le titre de l'article."
......
include: "_default_template_yaml" include: "_default_template_yaml"
group: tag:
- french_bench - french_bench
- french_bench_extra - french_bench_extra
# description: "Répond au mieux en complétant la question avec une des réponses proposées." # description: "Répond au mieux en complétant la question avec une des réponses proposées."
......
include: "_default_template_yaml" include: "_default_template_yaml"
group: tag:
- french_bench - french_bench
- french_bench_extra - french_bench_extra
description: "A propos du thème spécifié, l'avis client est il positif, négatif, ou neutre ?" description: "A propos du thème spécifié, l'avis client est il positif, négatif, ou neutre ?"
......
include: "_default_template_yaml" include: "_default_template_yaml"
group: tag:
- french_bench - french_bench
- french_bench_gen - french_bench_gen
task: french_bench_trivia task: french_bench_trivia
......
include: "_default_template_yaml" include: "_default_template_yaml"
group: tag:
- french_bench - french_bench
- french_bench_mc - french_bench_mc
# description: "Répond au mieux en complétant la question avec une des réponses proposées." # description: "Répond au mieux en complétant la question avec une des réponses proposées."
......
group: tag:
- french_bench_perplexity - french_bench_perplexity
task: french_bench_wikitext_fr task: french_bench_wikitext_fr
dataset_path: asi/wikitext_fr dataset_path: asi/wikitext_fr
......
include: "_default_template_yaml" include: "_default_template_yaml"
group: tag:
- french_bench - french_bench
- french_bench_extra - french_bench_extra
description: "La prémisse et l'hypothèse sont elles en accord, neutres en elles, ou en contradiction ?" description: "La prémisse et l'hypothèse sont elles en accord, neutres en elles, ou en contradiction ?"
......
# Glianorex
The goal of this benchmark is to isolate the test answering capabilities from the content knowledge.
### Paper
Title: Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data
Abstract: https://arxiv.org/abs/2406.02394
To test the relevance of MCQs to assess LLM performance without prior data exposure, we created a fictional medical benchmark and knowledge base on a non-existent gland, the Glianorex. Using GPT-4 we generated a comprehensive textbook on the Glianorex in both English and French, and created multiple-choice questions in both English and French.
### Tasks
All tasks are multiple choice questions with 4 options, only one correct option.
- `glianorex`: Evaluates all tasks listed below.
- `glianorex_en`: Evaluates the accuracy on 264 questions in English.
- `glianorex_fr`: Evaluates the accuracy on 264 questions in French.
task: glianorex
dataset_path: maximegmd/glianorex
output_type: multiple_choice
test_split: train
doc_to_text: !function preprocess_glianorex.doc_to_text
doc_to_target: !function preprocess_glianorex.doc_to_target
doc_to_choice: [ 'A', 'B', 'C', 'D' ]
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
task: glianorex_en
dataset_path: maximegmd/glianorex
output_type: multiple_choice
test_split: train
doc_to_text: !function preprocess_glianorex.doc_to_text
doc_to_target: !function preprocess_glianorex.doc_to_target
process_docs: !function preprocess_glianorex.filter_english
doc_to_choice: [ 'A', 'B', 'C', 'D' ]
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
task: glianorex_fr
dataset_path: maximegmd/glianorex
output_type: multiple_choice
test_split: train
doc_to_text: !function preprocess_glianorex.doc_to_text
doc_to_target: !function preprocess_glianorex.doc_to_target
process_docs: !function preprocess_glianorex.filter_french
doc_to_choice: [ 'A', 'B', 'C', 'D' ]
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
import datasets
def doc_to_text(doc) -> str:
option_choices = doc["options"]
answers = "".join((f"{k}. {v}\n") for k, v in option_choices.items())
return f"Question: {doc['question']}\n{answers}Answer:"
def doc_to_target(doc) -> int:
return doc["answer_idx"]
def filter_dataset(dataset: datasets.Dataset, lang: str) -> datasets.Dataset:
return dataset.filter(lambda example: example["language"].startswith(lang))
def filter_french(dataset: datasets.Dataset) -> datasets.Dataset:
return filter_dataset(dataset, "fr")
def filter_english(dataset: datasets.Dataset) -> datasets.Dataset:
return filter_dataset(dataset, "en")
...@@ -41,10 +41,14 @@ Homepage: https://gluebenchmark.com/ ...@@ -41,10 +41,14 @@ Homepage: https://gluebenchmark.com/
} }
``` ```
### Groups and Tasks ### Groups, Tags, and Tasks
#### Groups #### Groups
None.
#### Tags
* `glue`: Run all Glue subtasks. * `glue`: Run all Glue subtasks.
#### Tasks #### Tasks
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment