Add Japanese Leaderboard (#2439)

* add jaqket_v2 and jcommonsenseqa * remove comments * remove num_beams as it is incompatible with vllm * add jnli + refactor * rename jnla -> jnli * add jsquad + replace colon chars with the Japanese unicode * ignore whitespaces in generation tasks * add marc_ja * add xwinograd + simplify other yamls * add mgsm and xlsum * refactor xlsum * add ja_leaderboard tag * edit README.md * update README.md * add credit + minor changes * run ruff format * address review comments + add group * remove aggregate_metric_list * remove tags * update tasks/README.md

Add Japanese Leaderboard (#2439)
* add jaqket_v2 and jcommonsenseqa * remove comments * remove num_beams as it is incompatible with vllm * add jnli + refactor * rename jnla -> jnli * add jsquad + replace colon chars with the Japanese unicode * ignore whitespaces in generation tasks * add marc_ja * add xwinograd + simplify other yamls * add mgsm and xlsum * refactor xlsum * add ja_leaderboard tag * edit README.md * update README.md * add credit + minor changes * run ruff format * address review comments + add group * remove aggregate_metric_list * remove tags * update tasks/README.md
26f607f5 · mtkachenko · GitHub · fb2e4b59 · 26f607f5 · 26f607f5
Unverified Commit 26f607f5 authored Nov 05, 2024 by mtkachenko Committed by GitHub Nov 05, 2024
17 changed files
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -55,6 +55,7 @@
 | [hendrycks_math](hendrycks_math/README.md) | Mathematical problem-solving tasks to test numerical reasoning and problem-solving. | English |
 | [ifeval](ifeval/README.md) | Interactive fiction evaluation tasks for narrative understanding and reasoning. | English |
 | [inverse_scaling](inverse_scaling/README.md) | Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse. | English |
+| [japanese_leaderboard](japanese_leaderboard/README.md) | Japanese language understanding tasks to benchmark model performance on various linguistic aspects. | Japanese |
 | [kmmlu](kmmlu/README.md) | Knowledge-based multi-subject multiple choice questions for academic evaluation. | Korean |
 | [kobest](kobest/README.md) | A collection of tasks designed to evaluate understanding in Korean language. | Korean |
 | [kormedmcqa](kormedmcqa/README.md) | Medical question answering tasks in Korean to test specialized domain knowledge. | Korean |

--- a/lm_eval/tasks/japanese_leaderboard/README.md
+++ b/lm_eval/tasks/japanese_leaderboard/README.md
+# Japanese Leaderboard
+
+The Japanese LLM Leaderboard evaluates language models based on a wide range of NLP tasks that reflect the characteristics of the Japanese language.
+
+### Groups, Tags, and Tasks
+
+#### Groups
+
+* `japanese_leaderboard`: runs all tasks defined in this directory
+
+#### Tasks
+
+##### Generation Evaluation
+
+* `ja_leaderboard_jaqket_v2`: The JAQKET dataset is designed for Japanese question answering research, featuring quiz-like questions with answers derived from Wikipedia article titles. [Source](https://github.com/kumapo/JAQKET-dataset)
+* `ja_leaderboard_mgsm`: Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems, proposed in the paper Language models are multilingual chain-of-thought reasoners. [Source](https://huggingface.co/datasets/juletxara/mgsm)
+* `ja_leaderboard_xlsum`: This is the filtered Japanese subset of XL-Sum. [Source](https://github.com/csebuetnlp/xl-sum)
+* `ja_leaderboard_jsquad`: JSQuAD is a Japanese version of SQuAD, a reading comprehension dataset. Each instance in the dataset consists of a question regarding a given context (Wikipedia article) and its answer. JSQuAD is based on SQuAD 1.1 (there are no unanswerable questions). [Source](https://github.com/yahoojapan/JGLUE)
+
+##### Multi-Choice/Classification Evaluation
+
+* `ja_leaderboard_jcommonsenseqa`: JCommonsenseQA is a Japanese version of CommonsenseQA, which is a multiple-choice question answering dataset that requires commonsense reasoning ability. [Source](https://github.com/yahoojapan/JGLUE)
+* `ja_leaderboard_jnli`: JNLI is a Japanese version of the NLI (Natural Language Inference) dataset. The inference relations are entailment (含意), contradiction (矛盾), and neutral (中立). [Source](https://github.com/yahoojapan/JGLUE)
+* `ja_leaderboard_marc_ja`: MARC-ja is a text classification dataset based on the Japanese portion of Multilingual Amazon Reviews Corpus (MARC). [Source](https://github.com/yahoojapan/JGLUE)
+* `ja_leaderboard_xwinograd`: This is the Japanese portion of XWinograd. [Source](https://huggingface.co/datasets/polm-stability/xwinograd-ja)
+
+### Citation
+
+```bibtex
+@inproceedings{ja_leaderboard_jaqket_v2,
+  title         = {JAQKET: クイズを題材にした日本語 QA データセットの構築},
+  author        = {鈴木正敏 and 鈴木潤 and 松田耕史 and ⻄田京介 and 井之上直也},
+  year          = 2020,
+  booktitle     = {言語処理学会第26回年次大会},
+  url           = {https://www.anlp.jp/proceedings/annual_meeting/2020/pdf_dir/P2-24.pdf}
+}
+
+@article{ja_leaderboard_mgsm_1,
+  title         = {Training Verifiers to Solve Math Word Problems},
+  author        = {
+    Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and
+    Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro
+    and Hesse, Christopher and Schulman, John
+  },
+  year          = 2021,
+  journal       = {arXiv preprint arXiv:2110.14168}
+}
+
+@misc{ja_leaderboard_mgsm_2,
+  title         = {Language Models are Multilingual Chain-of-Thought Reasoners},
+  author        = {
+    Freda Shi and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush
+    Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and
+    Jason Wei
+  },
+  year          = 2022,
+  eprint        = {2210.03057},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.CL}
+}
+
+@inproceedings{ja_leaderboard_xlsum,
+  title         = {{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages},
+  author        = {
+    Hasan, Tahmid  and Bhattacharjee, Abhik  and Islam, Md. Saiful  and Mubasshir, Kazi  and Li,
+    Yuan-Fang  and Kang, Yong-Bin  and Rahman, M. Sohel  and Shahriyar, Rifat
+  },
+  year          = 2021,
+  month         = aug,
+  booktitle     = {Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021},
+  publisher     = {Association for Computational Linguistics},
+  address       = {Online},
+  pages         = {4693--4703},
+  url           = {https://aclanthology.org/2021.findings-acl.413}
+}
+
+@article{jglue_2023,
+  title         = {JGLUE: 日本語言語理解ベンチマーク},
+  author        = {栗原 健太郎 and 河原 大輔 and 柴田 知秀},
+  year          = 2023,
+  journal       = {自然言語処理},
+  volume        = 30,
+  number        = 1,
+  pages         = {63--87},
+  doi           = {10.5715/jnlp.30.63},
+  url           = {https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_article/-char/ja}
+}
+
+@inproceedings{jglue_kurihara-etal-2022-jglue,
+  title         = {{JGLUE}: {J}apanese General Language Understanding Evaluation},
+  author        = {Kurihara, Kentaro  and Kawahara, Daisuke  and Shibata, Tomohide},
+  year          = 2022,
+  month         = jun,
+  booktitle     = {Proceedings of the Thirteenth Language Resources and Evaluation Conference},
+  publisher     = {European Language Resources Association},
+  address       = {Marseille, France},
+  pages         = {2957--2966},
+  url           = {https://aclanthology.org/2022.lrec-1.317}
+}
+
+@inproceedings{jglue_kurihara_nlp2022,
+  title         = {JGLUE: 日本語言語理解ベンチマーク},
+  author        = {栗原健太郎 and 河原大輔 and 柴田知秀},
+  year          = 2022,
+  booktitle     = {言語処理学会第28回年次大会},
+  url           = {https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/E8-4.pdf},
+  note          = {in Japanese}
+}
+
+@misc{xwinograd_muennighoff2022crosslingual,
+  title         = {Crosslingual Generalization through Multitask Finetuning},
+  author        = {
+    Niklas Muennighoff and Thomas Wang and Lintang Sutawika and Adam Roberts and Stella Biderman
+    and Teven Le Scao and M Saiful Bari and Sheng Shen and Zheng-Xin Yong and Hailey Schoelkopf and
+    Xiangru Tang and Dragomir Radev and Alham Fikri Aji and Khalid Almubarak and Samuel Albanie and
+    Zaid Alyafeai and Albert Webson and Edward Raff and Colin Raffel
+  },
+  year          = 2022,
+  eprint        = {2211.01786},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.CL}
+}
+
+@misc{xwinograd_tikhonov2021heads,
+  title         = {
+    It's All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in
+    Commonsense Reasoning
+  },
+  author        = {Alexey Tikhonov and Max Ryabinin},
+  year          = 2021,
+  eprint        = {2106.12066},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.CL}
+}
+```
+
+### Credit
+
+* Prompts: https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable/lm_eval/tasks/ja
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/japanese_leaderboard/_ja_leaderboard.yaml
+++ b/lm_eval/tasks/japanese_leaderboard/_ja_leaderboard.yaml
+group: japanese_leaderboard
+
+task:
+  - ja_leaderboard_jaqket_v2
+  - ja_leaderboard_jcommonsenseqa
+  - ja_leaderboard_jnli
+  - ja_leaderboard_jsquad
+  - ja_leaderboard_marc_ja
+  - ja_leaderboard_mgsm
+  - ja_leaderboard_xlsum
+  - ja_leaderboard_xwinograd
+
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_jaqket_v2.yaml
+++ b/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_jaqket_v2.yaml
+task: ja_leaderboard_jaqket_v2
+
+dataset_path: kumapo/JAQKET
+dataset_name: v2.0
+
+training_split: train
+validation_split: validation
+test_split: null
+
+fewshot_split: train
+num_fewshot: 1
+
+description: "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n\n"
+doc_to_text: "### 指示：\n与えられた文脈から、質問に対する答えを抜き出してください。\n\n### 入力：\n文脈：{{ ctxs['text'] | join('\n') }}\n質問：{{ question }}\n\n### 応答："
+doc_to_target: "{{ answers['text'][0] }}"
+target_delimiter: "\n"
+
+output_type: generate_until
+
+generation_kwargs:
+  until:
+    - "\n\n"
+  do_sample: false
+
+metric_list:
+  - metric: exact_match
+    regexes_to_ignore:
+      - '^\s+'
+      - '\s+$'
+    aggregation: mean
+    higher_is_better: true
+
+filter_list:
+  - name: whitespaces
+    filter:
+      - function: remove_whitespace
+      - function: take_first
+
+metadata:
+  version: 2.0
--- a/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_jcommonsenseqa.py
+++ b/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_jcommonsenseqa.py
+def process_docs(dataset):
+    def _add_choices(doc):
+        doc["choices"] = [doc[f"choice{i}"] for i in range(5)]
+        return doc
+
+    return dataset.map(_add_choices)
--- a/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_jcommonsenseqa.yaml
+++ b/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_jcommonsenseqa.yaml
+task: ja_leaderboard_jcommonsenseqa
+
+dataset_path: Rakuten/JGLUE
+dataset_name: JCommonsenseQA
+training_split: train
+validation_split: validation
+test_split: null
+
+fewshot_split: train
+num_fewshot: 3
+
+description: "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n\n"
+process_docs: !function ja_leaderboard_jcommonsenseqa.process_docs
+doc_to_text: "### 指示：\n出力は以下から選択してください：\n{% for choice in choices %}- {{ choice }}\n{% endfor %}\n### 入力：\n{{ question }}\n\n### 応答："
+doc_to_target: label
+doc_to_choice: choices
+target_delimiter: "\n"
+
+output_type: multiple_choice
+
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_jnli.yaml
+++ b/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_jnli.yaml
+task: ja_leaderboard_jnli
+
+dataset_path: Rakuten/JGLUE
+dataset_name: JNLI
+training_split: train
+validation_split: validation
+test_split: null
+
+fewshot_split: train
+num_fewshot: 3
+
+description: "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n\n"
+
+doc_to_text: "### 指示：\n与えられた前提と仮説の関係を回答してください。\n\n出力は以下から選択してください：\n含意\n矛盾\n中立\n\n### 入力：\n前提：{{ sentence1 }}\n仮説：{{ sentence2 }}\n\n### 応答："
+doc_to_target: label
+doc_to_choice: ["含意", "矛盾", "中立"]
+target_delimiter: "\n"
+
+output_type: multiple_choice
+
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_jsquad.yaml
+++ b/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_jsquad.yaml
+task: ja_leaderboard_jsquad
+
+dataset_path: Rakuten/JGLUE
+dataset_name: JSQuAD
+
+training_split: train
+validation_split: validation
+test_split: null
+
+fewshot_split: train
+num_fewshot: 2
+
+description: "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n\n"
+doc_to_text: "### 指示：\n与えられた文脈から、質問に対する答えを抜き出してください。\n\n### 入力：\n文脈：{% set _context = context.split('[SEP]')[-1] %}{{ _context | trim }}\n質問：{{ question }}\n\n### 応答："
+doc_to_target: "{{ answers['text'][0] }}"
+target_delimiter: "\n"
+
+output_type: generate_until
+
+generation_kwargs:
+  until:
+    - "\n\n"
+  do_sample: false
+
+metric_list:
+  - metric: exact_match
+    regexes_to_ignore:
+      - '^\s+'
+      - '\s+$'
+    aggregation: mean
+    higher_is_better: true
+
+filter_list:
+  - name: whitespaces
+    filter:
+      - function: remove_whitespace
+      - function: take_first
+
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_marc_ja.yaml
+++ b/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_marc_ja.yaml
+task: ja_leaderboard_marc_ja
+
+dataset_path: Rakuten/JGLUE
+dataset_name: MARC-ja
+training_split: train
+validation_split: validation
+test_split: null
+
+fewshot_split: train
+num_fewshot: 3
+
+description: "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n\n"
+doc_to_text: "### 指示：\n以下の製品レビューを、ポジティブまたはネガティブの感情クラスのいずれかに分類してください。\n\n### 入力：\n{{ sentence }}\n\n### 応答："
+doc_to_target: label
+doc_to_choice: ["ポジティブ", "ネガティブ"]
+target_delimiter: "\n"
+
+output_type: multiple_choice
+
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_mgsm.py
+++ b/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_mgsm.py
+import re
+
+
+_INVALID_ANSWER = "[invalid]"
+
+_ANSWER_REGEX = re.compile(r"(\-?[0-9\.\,]+)")
+
+
+def _extract_answer(completion):
+    matches = _ANSWER_REGEX.findall(completion)
+    if matches:
+        match_str = matches[-1].strip(".")
+        match_str = match_str.replace(",", "")
+        try:
+            match_float = float(match_str)
+        except ValueError:
+            return _INVALID_ANSWER
+
+        if match_float.is_integer():
+            return int(match_float)
+
+    return _INVALID_ANSWER
+
+
+def process_results(doc, results):
+    assert (
+        len(results) == 1
+    ), f"results should be a list with 1 str element, but is {results}"
+
+    completion = results[0]
+    extracted_answer = _extract_answer(completion)
+    answer = doc["answer_number"]
+    acc = extracted_answer == answer
+    return {
+        "acc": acc,
+    }
--- a/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_mgsm.yaml
+++ b/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_mgsm.yaml
+task: ja_leaderboard_mgsm
+
+dataset_path: juletxara/mgsm
+dataset_name: ja
+
+training_split: train
+validation_split: null
+test_split: test
+
+fewshot_split: train
+num_fewshot: 5
+
+description: "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n\n"
+doc_to_text: "### 指示：\n与えられた問題に対して、ステップごとに答えを導き出してください。\n\n### 入力：\n{{ question | replace('問題：', '') }}\n\n### 応答："
+doc_to_target: "{{ answer | replace('ステップごとの答え：', '') }}"
+target_delimiter: "\n"
+
+output_type: generate_until
+process_results: !function ja_leaderboard_mgsm.process_results
+
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+
+generation_kwargs:
+  until:
+    - "\n\n"
+  do_sample: false
+
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_xlsum.py
+++ b/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_xlsum.py
+import re
+
+
+def _missing_module_message(name):
+    return f"`{name}` is required for `japanese_leaderboard`, please install `{name}` via pip install lm_eval[japanese_leaderboard] or pip install -e .[japanese_leaderboard]"
+
+
+try:
+    import emoji
+    import neologdn
+    from fugashi import Tagger
+    from rouge_score import rouge_scorer, scoring
+except ModuleNotFoundError as err:
+    raise ModuleNotFoundError(_missing_module_message(err.name)) from err
+
+
+class MecabTokenizer:
+    def __init__(self) -> None:
+        self.tagger = Tagger("-Owakati")
+
+    def normalize_answer(self, text):
+        """Lower case text, remove punctuation and extra whitespace, etc."""
+
+        def white_space_fix(text):
+            return " ".join(text.split())
+
+        def remove_emoji(text):
+            text = "".join(["" if emoji.is_emoji(c) else c for c in text])
+            emoji_pattern = re.compile(
+                "["
+                "\U0001f600-\U0001f64f"  # emoticons
+                "\U0001f300-\U0001f5ff"  # symbols & pictographs
+                "\U0001f680-\U0001f6ff"  # transport & map symbols
+                "\U0001f1e0-\U0001f1ff"  # flags (iOS)
+                "\U00002702-\U000027b0"
+                "]+",
+                flags=re.UNICODE,
+            )
+            return emoji_pattern.sub(r"", text)
+
+        text = remove_emoji(text)
+        # see neologdn docs for details, but handles things like full/half width variation
+        text = neologdn.normalize(text)
+        text = white_space_fix(text)
+        return text
+
+    def tokenize(self, text):
+        return self.tagger.parse(self.normalize_answer(text)).split()
+
+
+def rouge2(items):
+    return items
+
+
+def rouge2_agg(items):
+    tokenizer = MecabTokenizer()
+
+    refs = list(zip(*items))[0]
+    preds = list(zip(*items))[1]
+
+    rouge_type = "rouge2"
+
+    # mecab-based rouge
+    scorer = rouge_scorer.RougeScorer(
+        rouge_types=[rouge_type],
+        tokenizer=tokenizer,
+    )
+
+    # Acumulate confidence intervals.
+    aggregator = scoring.BootstrapAggregator()
+    for ref, pred in zip(refs, preds):
+        aggregator.add_scores(scorer.score(ref, pred))
+    result = aggregator.aggregate()
+
+    return result[rouge_type].mid.fmeasure
--- a/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_xlsum.yaml
+++ b/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_xlsum.yaml
+task: ja_leaderboard_xlsum
+
+dataset_path: mkshing/xlsum_ja
+dataset_name: null
+
+training_split: train
+validation_split: validation
+test_split: test
+
+fewshot_split: train
+num_fewshot: 1
+
+description: "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n\n"
+doc_to_text: "### 指示：\n与えられたニュース記事を要約してください。\n\n### 入力：\nニュース記事：{{ text }}\n\n### 応答："
+doc_to_target: "{{ summary }}"
+target_delimiter: "\n"
+
+output_type: generate_until
+
+metric_list:
+  - metric: !function ja_leaderboard_xlsum.rouge2
+    aggregation: !function ja_leaderboard_xlsum.rouge2_agg
+    higher_is_better: true
+
+filter_list:
+  - name: whitespaces
+    filter:
+      - function: remove_whitespace
+      - function: take_first
+
+generation_kwargs:
+  until:
+    - "\n\n"
+  do_sample: false
+
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_xwinograd.py
+++ b/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_xwinograd.py
+def process_docs(dataset):
+    def _add_choices_and_label(doc):
+        doc["label"] = int(doc["answer"]) - 1
+        doc["choices"] = [doc["sentence1"].strip(), doc["sentence2"].strip()]
+        return doc
+
+    return dataset.map(_add_choices_and_label)
--- a/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_xwinograd.yaml
+++ b/lm_eval/tasks/japanese_leaderboard/ja_leaderboard_xwinograd.yaml
+task: ja_leaderboard_xwinograd
+
+dataset_path: polm-stability/xwinograd-ja
+dataset_name: null
+
+training_split: null
+validation_split: null
+test_split: test
+
+num_fewshot: null
+
+process_docs: !function ja_leaderboard_xwinograd.process_docs
+doc_to_target: "label"
+doc_to_choice: "choices"
+doc_to_text: ""
+target_delimiter: ""
+
+output_type: multiple_choice
+
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/japanese_leaderboard/requirements.txt
+++ b/lm_eval/tasks/japanese_leaderboard/requirements.txt
+emoji==2.14.0
+fugashi[unidic-lite]
+neologdn==0.5.3
+rouge_score>=0.1.2
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -77,6 +77,7 @@ vllm = ["vllm>=0.4.2"]
 zeno = ["pandas", "zeno-client"]
 wandb = ["wandb>=0.16.3", "pandas", "numpy"]
 gptqmodel = ["gptqmodel>=1.0.9"]
+japanese_leaderboard = ["emoji==2.14.0", "neologdn==0.5.3", "fugashi[unidic-lite]", "rouge_score>=0.1.2"]
 all = [
    "lm_eval[anthropic]",
    "lm_eval[dev]",
@@ -96,6 +97,7 @@ all = [
    "lm_eval[vllm]",
    "lm_eval[zeno]",
    "lm_eval[wandb]",
+    "lm_eval[japanese_leaderboard]",
 ]

 [tool.ruff.lint]