Add The Arabic version of the PICA benchmark (#1917)

923852b0 · khalil · GitHub · f2843b2f · 923852b0 · 923852b0
Unverified Commit 923852b0 authored Jun 07, 2024 by khalil Committed by GitHub Jun 07, 2024
Hide whitespace changes
Inline Side-by-side

Showing with 64 additions and 0 deletions

lm_eval/tasks/piqa_ar/README.md lm_eval/tasks/piqa_ar/README.md +43 -0

lm_eval/tasks/piqa_ar/piqa_ar.yaml lm_eval/tasks/piqa_ar/piqa_ar.yaml +21 -0

No files found.
--- a/lm_eval/tasks/piqa_ar/README.md
+++ b/lm_eval/tasks/piqa_ar/README.md
+#Arabic PIQA
+
+### Paper
+
+Original Title: `PIQA: Reasoning about Physical Commonsense in Natural Language`
+
+Original paper: [PICA](https://arxiv.org/abs/1911.11641)
+
+Physical Interaction: Question Answering (PIQA) is a physical commonsense
+reasoning and a corresponding benchmark dataset. PIQA was designed to investigate
+the physical knowledge of existing models. To what extent are current approaches
+actually learning about the world?
+
+[Homepage](https://yonatanbisk.com/piqa)
+
+AlGhafa has translated this dataset to Arabic[AlGafa](https://aclanthology.org/2023.arabicnlp-1.21.pdf)
+
+The link to the Arabic version of the dataset [PICA](https://gitlab.com/tiiuae/alghafa/-/tree/main/arabic-eval/pica_ar)
+
+### Citation
+
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet.
+
+#### Tasks
+
+* `piqa_ar`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/piqa_ar/piqa_ar.yaml
+++ b/lm_eval/tasks/piqa_ar/piqa_ar.yaml
+task: piqa_ar
+dataset_path: Hennara/pica_ar
+dataset_name: null
+output_type: multiple_choice
+training_split: null
+validation_split: null
+test_split: test
+doc_to_text: "السؤال: {{goal}}\nالجواب:"
+doc_to_choice: "{{[sol1, sol2]}}"
+doc_to_target: label
+should_decontaminate: true
+doc_to_decontamination_query: goal
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0