Unverified Commit 26f607f5 authored by mtkachenko's avatar mtkachenko Committed by GitHub
Browse files

Add Japanese Leaderboard (#2439)

* add jaqket_v2 and jcommonsenseqa

* remove comments

* remove num_beams as it is incompatible with vllm

* add jnli + refactor

* rename jnla -> jnli

* add jsquad + replace colon chars with the Japanese unicode

* ignore whitespaces in generation tasks

* add marc_ja

* add xwinograd + simplify other yamls

* add mgsm and xlsum

* refactor xlsum

* add ja_leaderboard tag

* edit README.md

* update README.md

* add credit + minor changes

* run ruff format

* address review comments + add group

* remove aggregate_metric_list

* remove tags

* update tasks/README.md
parent fb2e4b59
......@@ -55,6 +55,7 @@
| [hendrycks_math](hendrycks_math/README.md) | Mathematical problem-solving tasks to test numerical reasoning and problem-solving. | English |
| [ifeval](ifeval/README.md) | Interactive fiction evaluation tasks for narrative understanding and reasoning. | English |
| [inverse_scaling](inverse_scaling/README.md) | Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse. | English |
| [japanese_leaderboard](japanese_leaderboard/README.md) | Japanese language understanding tasks to benchmark model performance on various linguistic aspects. | Japanese |
| [kmmlu](kmmlu/README.md) | Knowledge-based multi-subject multiple choice questions for academic evaluation. | Korean |
| [kobest](kobest/README.md) | A collection of tasks designed to evaluate understanding in Korean language. | Korean |
| [kormedmcqa](kormedmcqa/README.md) | Medical question answering tasks in Korean to test specialized domain knowledge. | Korean |
......
# Japanese Leaderboard
The Japanese LLM Leaderboard evaluates language models based on a wide range of NLP tasks that reflect the characteristics of the Japanese language.
### Groups, Tags, and Tasks
#### Groups
* `japanese_leaderboard`: runs all tasks defined in this directory
#### Tasks
##### Generation Evaluation
* `ja_leaderboard_jaqket_v2`: The JAQKET dataset is designed for Japanese question answering research, featuring quiz-like questions with answers derived from Wikipedia article titles. [Source](https://github.com/kumapo/JAQKET-dataset)
* `ja_leaderboard_mgsm`: Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems, proposed in the paper Language models are multilingual chain-of-thought reasoners. [Source](https://huggingface.co/datasets/juletxara/mgsm)
* `ja_leaderboard_xlsum`: This is the filtered Japanese subset of XL-Sum. [Source](https://github.com/csebuetnlp/xl-sum)
* `ja_leaderboard_jsquad`: JSQuAD is a Japanese version of SQuAD, a reading comprehension dataset. Each instance in the dataset consists of a question regarding a given context (Wikipedia article) and its answer. JSQuAD is based on SQuAD 1.1 (there are no unanswerable questions). [Source](https://github.com/yahoojapan/JGLUE)
##### Multi-Choice/Classification Evaluation
* `ja_leaderboard_jcommonsenseqa`: JCommonsenseQA is a Japanese version of CommonsenseQA, which is a multiple-choice question answering dataset that requires commonsense reasoning ability. [Source](https://github.com/yahoojapan/JGLUE)
* `ja_leaderboard_jnli`: JNLI is a Japanese version of the NLI (Natural Language Inference) dataset. The inference relations are entailment (含意), contradiction (矛盾), and neutral (中立). [Source](https://github.com/yahoojapan/JGLUE)
* `ja_leaderboard_marc_ja`: MARC-ja is a text classification dataset based on the Japanese portion of Multilingual Amazon Reviews Corpus (MARC). [Source](https://github.com/yahoojapan/JGLUE)
* `ja_leaderboard_xwinograd`: This is the Japanese portion of XWinograd. [Source](https://huggingface.co/datasets/polm-stability/xwinograd-ja)
### Citation
```bibtex
@inproceedings{ja_leaderboard_jaqket_v2,
title = {JAQKET: クイズを題材にした日本語 QA データセットの構築},
author = {鈴木正敏 and 鈴木潤 and 松田耕史 and ⻄田京介 and 井之上直也},
year = 2020,
booktitle = {言語処理学会第26回年次大会},
url = {https://www.anlp.jp/proceedings/annual_meeting/2020/pdf_dir/P2-24.pdf}
}
@article{ja_leaderboard_mgsm_1,
title = {Training Verifiers to Solve Math Word Problems},
author = {
Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and
Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro
and Hesse, Christopher and Schulman, John
},
year = 2021,
journal = {arXiv preprint arXiv:2110.14168}
}
@misc{ja_leaderboard_mgsm_2,
title = {Language Models are Multilingual Chain-of-Thought Reasoners},
author = {
Freda Shi and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush
Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and
Jason Wei
},
year = 2022,
eprint = {2210.03057},
archiveprefix = {arXiv},
primaryclass = {cs.CL}
}
@inproceedings{ja_leaderboard_xlsum,
title = {{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages},
author = {
Hasan, Tahmid and Bhattacharjee, Abhik and Islam, Md. Saiful and Mubasshir, Kazi and Li,
Yuan-Fang and Kang, Yong-Bin and Rahman, M. Sohel and Shahriyar, Rifat
},
year = 2021,
month = aug,
booktitle = {Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021},
publisher = {Association for Computational Linguistics},
address = {Online},
pages = {4693--4703},
url = {https://aclanthology.org/2021.findings-acl.413}
}
@article{jglue_2023,
title = {JGLUE: 日本語言語理解ベンチマーク},
author = {栗原 健太郎 and 河原 大輔 and 柴田 知秀},
year = 2023,
journal = {自然言語処理},
volume = 30,
number = 1,
pages = {63--87},
doi = {10.5715/jnlp.30.63},
url = {https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_article/-char/ja}
}
@inproceedings{jglue_kurihara-etal-2022-jglue,
title = {{JGLUE}: {J}apanese General Language Understanding Evaluation},
author = {Kurihara, Kentaro and Kawahara, Daisuke and Shibata, Tomohide},
year = 2022,
month = jun,
booktitle = {Proceedings of the Thirteenth Language Resources and Evaluation Conference},
publisher = {European Language Resources Association},
address = {Marseille, France},
pages = {2957--2966},
url = {https://aclanthology.org/2022.lrec-1.317}
}
@inproceedings{jglue_kurihara_nlp2022,
title = {JGLUE: 日本語言語理解ベンチマーク},
author = {栗原健太郎 and 河原大輔 and 柴田知秀},
year = 2022,
booktitle = {言語処理学会第28回年次大会},
url = {https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/E8-4.pdf},
note = {in Japanese}
}
@misc{xwinograd_muennighoff2022crosslingual,
title = {Crosslingual Generalization through Multitask Finetuning},
author = {
Niklas Muennighoff and Thomas Wang and Lintang Sutawika and Adam Roberts and Stella Biderman
and Teven Le Scao and M Saiful Bari and Sheng Shen and Zheng-Xin Yong and Hailey Schoelkopf and
Xiangru Tang and Dragomir Radev and Alham Fikri Aji and Khalid Almubarak and Samuel Albanie and
Zaid Alyafeai and Albert Webson and Edward Raff and Colin Raffel
},
year = 2022,
eprint = {2211.01786},
archiveprefix = {arXiv},
primaryclass = {cs.CL}
}
@misc{xwinograd_tikhonov2021heads,
title = {
It's All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in
Commonsense Reasoning
},
author = {Alexey Tikhonov and Max Ryabinin},
year = 2021,
eprint = {2106.12066},
archiveprefix = {arXiv},
primaryclass = {cs.CL}
}
```
### Credit
* Prompts: https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable/lm_eval/tasks/ja
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group: japanese_leaderboard
task:
- ja_leaderboard_jaqket_v2
- ja_leaderboard_jcommonsenseqa
- ja_leaderboard_jnli
- ja_leaderboard_jsquad
- ja_leaderboard_marc_ja
- ja_leaderboard_mgsm
- ja_leaderboard_xlsum
- ja_leaderboard_xwinograd
metadata:
version: 1.0
task: ja_leaderboard_jaqket_v2
dataset_path: kumapo/JAQKET
dataset_name: v2.0
training_split: train
validation_split: validation
test_split: null
fewshot_split: train
num_fewshot: 1
description: "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n\n"
doc_to_text: "### 指示:\n与えられた文脈から、質問に対する答えを抜き出してください。\n\n### 入力:\n文脈:{{ ctxs['text'] | join('\n') }}\n質問:{{ question }}\n\n### 応答:"
doc_to_target: "{{ answers['text'][0] }}"
target_delimiter: "\n"
output_type: generate_until
generation_kwargs:
until:
- "\n\n"
do_sample: false
metric_list:
- metric: exact_match
regexes_to_ignore:
- '^\s+'
- '\s+$'
aggregation: mean
higher_is_better: true
filter_list:
- name: whitespaces
filter:
- function: remove_whitespace
- function: take_first
metadata:
version: 2.0
def process_docs(dataset):
def _add_choices(doc):
doc["choices"] = [doc[f"choice{i}"] for i in range(5)]
return doc
return dataset.map(_add_choices)
task: ja_leaderboard_jcommonsenseqa
dataset_path: Rakuten/JGLUE
dataset_name: JCommonsenseQA
training_split: train
validation_split: validation
test_split: null
fewshot_split: train
num_fewshot: 3
description: "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n\n"
process_docs: !function ja_leaderboard_jcommonsenseqa.process_docs
doc_to_text: "### 指示:\n出力は以下から選択してください:\n{% for choice in choices %}- {{ choice }}\n{% endfor %}\n### 入力:\n{{ question }}\n\n### 応答:"
doc_to_target: label
doc_to_choice: choices
target_delimiter: "\n"
output_type: multiple_choice
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: ja_leaderboard_jnli
dataset_path: Rakuten/JGLUE
dataset_name: JNLI
training_split: train
validation_split: validation
test_split: null
fewshot_split: train
num_fewshot: 3
description: "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n\n"
doc_to_text: "### 指示:\n与えられた前提と仮説の関係を回答してください。\n\n出力は以下から選択してください:\n含意\n矛盾\n中立\n\n### 入力:\n前提:{{ sentence1 }}\n仮説:{{ sentence2 }}\n\n### 応答:"
doc_to_target: label
doc_to_choice: ["含意", "矛盾", "中立"]
target_delimiter: "\n"
output_type: multiple_choice
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: ja_leaderboard_jsquad
dataset_path: Rakuten/JGLUE
dataset_name: JSQuAD
training_split: train
validation_split: validation
test_split: null
fewshot_split: train
num_fewshot: 2
description: "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n\n"
doc_to_text: "### 指示:\n与えられた文脈から、質問に対する答えを抜き出してください。\n\n### 入力:\n文脈:{% set _context = context.split('[SEP]')[-1] %}{{ _context | trim }}\n質問:{{ question }}\n\n### 応答:"
doc_to_target: "{{ answers['text'][0] }}"
target_delimiter: "\n"
output_type: generate_until
generation_kwargs:
until:
- "\n\n"
do_sample: false
metric_list:
- metric: exact_match
regexes_to_ignore:
- '^\s+'
- '\s+$'
aggregation: mean
higher_is_better: true
filter_list:
- name: whitespaces
filter:
- function: remove_whitespace
- function: take_first
metadata:
version: 1.0
task: ja_leaderboard_marc_ja
dataset_path: Rakuten/JGLUE
dataset_name: MARC-ja
training_split: train
validation_split: validation
test_split: null
fewshot_split: train
num_fewshot: 3
description: "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n\n"
doc_to_text: "### 指示:\n以下の製品レビューを、ポジティブまたはネガティブの感情クラスのいずれかに分類してください。\n\n### 入力:\n{{ sentence }}\n\n### 応答:"
doc_to_target: label
doc_to_choice: ["ポジティブ", "ネガティブ"]
target_delimiter: "\n"
output_type: multiple_choice
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
import re
_INVALID_ANSWER = "[invalid]"
_ANSWER_REGEX = re.compile(r"(\-?[0-9\.\,]+)")
def _extract_answer(completion):
matches = _ANSWER_REGEX.findall(completion)
if matches:
match_str = matches[-1].strip(".")
match_str = match_str.replace(",", "")
try:
match_float = float(match_str)
except ValueError:
return _INVALID_ANSWER
if match_float.is_integer():
return int(match_float)
return _INVALID_ANSWER
def process_results(doc, results):
assert (
len(results) == 1
), f"results should be a list with 1 str element, but is {results}"
completion = results[0]
extracted_answer = _extract_answer(completion)
answer = doc["answer_number"]
acc = extracted_answer == answer
return {
"acc": acc,
}
task: ja_leaderboard_mgsm
dataset_path: juletxara/mgsm
dataset_name: ja
training_split: train
validation_split: null
test_split: test
fewshot_split: train
num_fewshot: 5
description: "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n\n"
doc_to_text: "### 指示:\n与えられた問題に対して、ステップごとに答えを導き出してください。\n\n### 入力:\n{{ question | replace('問題:', '') }}\n\n### 応答:"
doc_to_target: "{{ answer | replace('ステップごとの答え:', '') }}"
target_delimiter: "\n"
output_type: generate_until
process_results: !function ja_leaderboard_mgsm.process_results
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
generation_kwargs:
until:
- "\n\n"
do_sample: false
metadata:
version: 1.0
import re
def _missing_module_message(name):
return f"`{name}` is required for `japanese_leaderboard`, please install `{name}` via pip install lm_eval[japanese_leaderboard] or pip install -e .[japanese_leaderboard]"
try:
import emoji
import neologdn
from fugashi import Tagger
from rouge_score import rouge_scorer, scoring
except ModuleNotFoundError as err:
raise ModuleNotFoundError(_missing_module_message(err.name)) from err
class MecabTokenizer:
def __init__(self) -> None:
self.tagger = Tagger("-Owakati")
def normalize_answer(self, text):
"""Lower case text, remove punctuation and extra whitespace, etc."""
def white_space_fix(text):
return " ".join(text.split())
def remove_emoji(text):
text = "".join(["" if emoji.is_emoji(c) else c for c in text])
emoji_pattern = re.compile(
"["
"\U0001f600-\U0001f64f" # emoticons
"\U0001f300-\U0001f5ff" # symbols & pictographs
"\U0001f680-\U0001f6ff" # transport & map symbols
"\U0001f1e0-\U0001f1ff" # flags (iOS)
"\U00002702-\U000027b0"
"]+",
flags=re.UNICODE,
)
return emoji_pattern.sub(r"", text)
text = remove_emoji(text)
# see neologdn docs for details, but handles things like full/half width variation
text = neologdn.normalize(text)
text = white_space_fix(text)
return text
def tokenize(self, text):
return self.tagger.parse(self.normalize_answer(text)).split()
def rouge2(items):
return items
def rouge2_agg(items):
tokenizer = MecabTokenizer()
refs = list(zip(*items))[0]
preds = list(zip(*items))[1]
rouge_type = "rouge2"
# mecab-based rouge
scorer = rouge_scorer.RougeScorer(
rouge_types=[rouge_type],
tokenizer=tokenizer,
)
# Acumulate confidence intervals.
aggregator = scoring.BootstrapAggregator()
for ref, pred in zip(refs, preds):
aggregator.add_scores(scorer.score(ref, pred))
result = aggregator.aggregate()
return result[rouge_type].mid.fmeasure
task: ja_leaderboard_xlsum
dataset_path: mkshing/xlsum_ja
dataset_name: null
training_split: train
validation_split: validation
test_split: test
fewshot_split: train
num_fewshot: 1
description: "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n\n"
doc_to_text: "### 指示:\n与えられたニュース記事を要約してください。\n\n### 入力:\nニュース記事:{{ text }}\n\n### 応答:"
doc_to_target: "{{ summary }}"
target_delimiter: "\n"
output_type: generate_until
metric_list:
- metric: !function ja_leaderboard_xlsum.rouge2
aggregation: !function ja_leaderboard_xlsum.rouge2_agg
higher_is_better: true
filter_list:
- name: whitespaces
filter:
- function: remove_whitespace
- function: take_first
generation_kwargs:
until:
- "\n\n"
do_sample: false
metadata:
version: 1.0
def process_docs(dataset):
def _add_choices_and_label(doc):
doc["label"] = int(doc["answer"]) - 1
doc["choices"] = [doc["sentence1"].strip(), doc["sentence2"].strip()]
return doc
return dataset.map(_add_choices_and_label)
task: ja_leaderboard_xwinograd
dataset_path: polm-stability/xwinograd-ja
dataset_name: null
training_split: null
validation_split: null
test_split: test
num_fewshot: null
process_docs: !function ja_leaderboard_xwinograd.process_docs
doc_to_target: "label"
doc_to_choice: "choices"
doc_to_text: ""
target_delimiter: ""
output_type: multiple_choice
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
emoji==2.14.0
fugashi[unidic-lite]
neologdn==0.5.3
rouge_score>=0.1.2
......@@ -77,6 +77,7 @@ vllm = ["vllm>=0.4.2"]
zeno = ["pandas", "zeno-client"]
wandb = ["wandb>=0.16.3", "pandas", "numpy"]
gptqmodel = ["gptqmodel>=1.0.9"]
japanese_leaderboard = ["emoji==2.14.0", "neologdn==0.5.3", "fugashi[unidic-lite]", "rouge_score>=0.1.2"]
all = [
"lm_eval[anthropic]",
"lm_eval[dev]",
......@@ -96,6 +97,7 @@ all = [
"lm_eval[vllm]",
"lm_eval[zeno]",
"lm_eval[wandb]",
"lm_eval[japanese_leaderboard]",
]
[tool.ruff.lint]
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment