Commit b58e5556 authored by Baber's avatar Baber
Browse files

Merge branch 'main' into tasklist

# Conflicts:
#	pyproject.toml
parents 6e1866f5 4f8195f1
tag:
- multiple_choice
task: hellaswag
dataset_path: hellaswag
dataset_path: Rowan/hellaswag
dataset_name: null
output_type: multiple_choice
training_split: train
......@@ -20,5 +20,3 @@ metric_list:
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
......@@ -7,7 +7,7 @@ dataset_name: algebra
output_type: generate_until
training_split: train
test_split: test
doc_to_text: "Problem: {{problem}}\nAnswer:"
doc_to_text: "Problem: {{problem}}\nAnswer:"
process_results: !function utils.process_results
doc_to_target: "{{answer}}"
generation_kwargs:
......@@ -21,5 +21,3 @@ metric_list:
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
......@@ -8,6 +8,15 @@ Large language models (LLMs) demonstrate exceptional performance on complex reas
Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K
### Evaluation Settings
The authors suggest (Appendix B) using:
* **Sampling temperature:** `0.7`
* **Top‑p:** `0.95`
* **Output length:** *min* `8` tokens, *max* `2048` tokens (`max_gen_toks`)
We default to greedy decoding.
### Citation
......@@ -20,7 +29,7 @@ Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K
}
```
### Groups and and Tasks
### Groups and Tasks
#### Groups
......@@ -37,10 +46,13 @@ Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
### Changelog
- 2025-07-22: v1.1: increased `max_gen_toks` to 2048
......@@ -11,7 +11,7 @@ generation_kwargs:
- "<|end_of_text|>"
- "<|endoftext|>"
- "<|im_end|>"
max_gen_toks: 512
max_gen_toks: 2048
do_sample: false
temperature: 0
metric_list:
......@@ -19,4 +19,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
version: 1.1
......@@ -111,7 +111,7 @@ def parse_math_answer(raw_string):
return retval
def get_answer_with_dollar_sign(s):
first_pattern = "\$(.*)\$"
first_pattern = r"\$(.*)\$"
last_match = None
matches = re.findall(first_pattern, s)
if matches:
......@@ -127,7 +127,7 @@ def parse_math_answer(raw_string):
if "\\n" in last_match:
last_match = last_match.split("\\n")[0]
else:
pattern = "(?:\\$)?\d+(?:\.\d+)?(?![\w\d])"
pattern = "(?:\\$)?\\d+(?:\\.\\d+)?(?![\\w\\d])"
matches = re.findall(pattern, s)
if matches:
last_match = matches[-1]
......@@ -250,7 +250,7 @@ def _strip_string(string):
# remove percentage
string = string.replace("\\%", "")
string = string.replace("\%", "")
string = string.replace(r"\%", "")
# " 0." equivalent to " ." and "{0." equivalent to "{." Alternatively, add "0" if "." is the start of the string
string = string.replace(" .", " 0.")
......
......@@ -11,7 +11,7 @@ generation_kwargs:
- "<|end_of_text|>"
- "<|endoftext|>"
- "<|im_end|>"
max_gen_toks: 512
max_gen_toks: 2048
do_sample: false
temperature: 0
metric_list:
......@@ -19,4 +19,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
version: 1.1
......@@ -111,7 +111,7 @@ def parse_math_answer(raw_string):
return retval
def get_answer_with_dollar_sign(s):
first_pattern = "\$(.*)\$"
first_pattern = r"\$(.*)\$"
last_match = None
matches = re.findall(first_pattern, s)
if matches:
......@@ -127,7 +127,7 @@ def parse_math_answer(raw_string):
if "\\n" in last_match:
last_match = last_match.split("\\n")[0]
else:
pattern = "(?:\\$)?\d+(?:\.\d+)?(?![\w\d])"
pattern = "(?:\\$)?\\d+(?:\\.\\d+)?(?![\\w\\d])"
matches = re.findall(pattern, s)
if matches:
last_match = matches[-1]
......@@ -250,7 +250,7 @@ def _strip_string(string):
# remove percentage
string = string.replace("\\%", "")
string = string.replace("\%", "")
string = string.replace(r"\%", "")
# " 0." equivalent to " ." and "{0." equivalent to "{." Alternatively, add "0" if "." is the start of the string
string = string.replace(" .", " 0.")
......
......@@ -14,7 +14,5 @@ metric_list:
- metric: acc_norm
aggregation: mean
higher_is_better: true
dataset_kwargs:
trust_remote_code: true
metadata:
version: 0
......@@ -19,5 +19,3 @@ metric_list:
higher_is_better: True
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
......@@ -19,5 +19,3 @@ metric_list:
higher_is_better: True
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
......@@ -29,5 +29,3 @@ generation_kwargs:
max_gen_toks: 1024
metadata:
version: 2.0
dataset_kwargs:
trust_remote_code: true
......@@ -18,5 +18,3 @@ metric_list:
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
......@@ -22,8 +22,6 @@ metric_list:
num_fewshot: 4
metadata:
version: 3.0
dataset_kwargs:
trust_remote_code: true
fewshot_config:
sampler: first_n
samples: !function utils.list_fewshot_samples
# Task-name
### Paper
Title: `LIBRA: Long Input Benchmark for Russian Analysis`
Abstract: `Datasets for proper evaluation of long-context understanding in Russian. For the Russian language LIBRA comprises 21 adapted datasets to study the LLM's abilities to understand long texts thoroughly. The tests are divided into four complexity groups and allow the evaluation of models across various context lengths ranging from 4k up to 128k tokens.`
Homepage: `https://huggingface.co/datasets/ai-forever/LIBRA`
### Citation
```
@misc{churin2024longinputbenchmarkrussian,
title={Long Input Benchmark for Russian Analysis},
author={Igor Churin and Murat Apishev and Maria Tikhonova and Denis Shevelev and Aydar Bulatov and Yuri Kuratov and Sergei Averkiev and Alena Fenogenova},
year={2024},
eprint={2408.02439},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2408.02439},
}
```
### Groups, Tags, and Tasks
#### Groups
* `libra_simple_information_retrieval`
* `libra_question_answering_and_multiple_choice`
* `libra_multi_hop_question_answering`
* `libra_complex_reasoning_and_mathematical_problems`
#### Tags
* `libra`
#### Tasks
* `passkey`
* `passkey_with_librusec`
* `matreshka_yes_no`
* `matreshka_names`
* `librusec_history`
* `ru_trec`
* `ru_sci_abstract_retrieval`
* `ru_sci_fi`
* `ru_quality`
* `ru_tpo`
* `ru_babilong_qa1`
* `ru_babilong_qa2`
* `ru_babilong_qa3`
* `ru_babilong_qa4`
* `ru_babilong_qa5`
* `long_context_multiq`
* `librusec_mhqa`
* `ru_2wikimultihopqa`
* `ru_sci_passage_count`
* `ru_qasper`
* `ru_gsm100`
# Variants
Usage (**all with `--apply_chat_template`**):
```
lm_eval --model hf --model_args pretrained=EleutherAI/gpt-j-6B --tasks libra_simple_information_retrieval --device cpu --apply_chat_template --log_samples --output_path test_libra_simple_information_retrieval
```
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group: libra_complex_reasoning_and_mathematical_problems
group_alias: Complex Reasoning and Mathematical Problems
task:
- ru_sci_passage_count
- ru_qasper
- ru_gsm100
group: libra_multi_hop_question_answering
group_alias: Multi-hop Question Answering
task:
- ru_babilong_qa1
- ru_babilong_qa2
- ru_babilong_qa3
- ru_babilong_qa4
- ru_babilong_qa5
- long_context_multiq
- librusec_mhqa
- ru_2wikimultihopqa
group: libra_question_answering_and_multiple_choice
group_alias: Question Answering and Multiple Choice
task:
- matreshka_yes_no
- matreshka_names
- librusec_history
- ru_sci_abstract_retrieval
- ru_quality
group: libra_simple_information_retrieval
group_alias: Simple Information Retrieval
task:
- passkey
- passkey_with_librusec
dataset_path: ai-forever/LIBRA
custom_dataset: !function utils.filter_dataset_by_page_lengths
test_split: test
output_type: generate_until
doc_to_target: positive_outputs
process_results: !function utils.process_results
tag:
- libra
task: librusec_history
task_alias: LibrusecHistory
dataset_name: librusec_history
dataset_kwargs:
dataset_name: librusec_history
include: _template_yaml
doc_to_text: 'Тебе предоставляется длинный текст, в котором нужно найти ответ на вопрос.
{{context}}
Найди ответ в тексте на следующий вопрос.
Вопрос:{{input}}
Ответ:'
generation_kwargs:
do_sample: false
temperature: 0.0
max_gen_toks: 32
metric_list:
- metric: libra_score
aggregation: !function utils.aggregate_results_em
higher_is_better: true
weight_by_size: true
metadata:
version: 0.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment