Unverified Commit 28bb45fb authored by JorgeDeCorte's avatar JorgeDeCorte Committed by GitHub
Browse files

Add multilingual HellaSwag task (#1228)



* add hellaswag_nl

* add other languages and update readme to hellaswag

* refactor as new task

* update readme

* add endline to yaml files and readme.md

* add group, change folder location and update yaml file

* rename default hellaswag yaml file

* fix whitespace error in some labels

* downgrade log level of whitespace checking

---------
Co-authored-by: default avatarJorgeDeCorte <jorge.decorte@ravago.be>
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
parent e7c03d0c
include: _hellaswag_yaml
task: hellaswag_mr
dataset_path: alexandrainst/m_hellaswag
dataset_name: mr
training_split: null
validation_split: val
include: _hellaswag_yaml
task: hellaswag_ne
dataset_path: alexandrainst/m_hellaswag
dataset_name: ne
training_split: null
validation_split: val
include: _hellaswag_yaml
task: hellaswag_nl
dataset_path: alexandrainst/m_hellaswag
dataset_name: nl
training_split: null
validation_split: val
include: _hellaswag_yaml
task: hellaswag_pt
dataset_path: alexandrainst/m_hellaswag
dataset_name: pt
training_split: null
validation_split: val
include: _hellaswag_yaml
task: hellaswag_ro
dataset_path: alexandrainst/m_hellaswag
dataset_name: ro
training_split: null
validation_split: val
include: _hellaswag_yaml
task: hellaswag_ru
dataset_path: alexandrainst/m_hellaswag
dataset_name: ru
training_split: null
validation_split: val
include: _hellaswag_yaml
task: hellaswag_sk
dataset_path: alexandrainst/m_hellaswag
dataset_name: sk
training_split: null
validation_split: val
include: _hellaswag_yaml
task: hellaswag_sr
dataset_path: alexandrainst/m_hellaswag
dataset_name: sr
training_split: null
validation_split: val
include: _hellaswag_yaml
task: hellaswag_sv
dataset_path: alexandrainst/m_hellaswag
dataset_name: sv
training_split: null
validation_split: val
include: _hellaswag_yaml
task: hellaswag_ta
dataset_path: alexandrainst/m_hellaswag
dataset_name: ta
training_split: null
validation_split: val
include: _hellaswag_yaml
task: hellaswag_te
dataset_path: alexandrainst/m_hellaswag
dataset_name: te
training_split: null
validation_split: val
include: _hellaswag_yaml
task: hellaswag_uk
dataset_path: alexandrainst/m_hellaswag
dataset_name: uk
training_split: null
validation_split: val
include: _hellaswag_yaml
task: hellaswag_vi
dataset_path: alexandrainst/m_hellaswag
dataset_name: vi
training_split: null
validation_split: val
import datasets
import re
def preprocess(text):
text = text.strip()
# NOTE: Brackets are artifacts of the WikiHow dataset portion of HellaSwag.
text = text.replace(" [title]", ". ")
text = re.sub("\\[.*?\\]", "", text)
text = text.replace(" ", " ")
return text
def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
def _process_doc(doc):
ctx = doc["ctx_a"] + " " + doc["ctx_b"].capitalize()
out_doc = {
"query": preprocess(doc["activity_label"] + ": " + ctx),
"choices": [preprocess(ending) for ending in doc["endings"]],
"gold": int(doc["label"]),
}
return out_doc
return dataset.map(_process_doc)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment