Commit e6b798f9 authored by Baber's avatar Baber
Browse files

Merge branch 'main' into metrics

# Conflicts:
#	.pre-commit-config.yaml
#	lm_eval/api/task.py
#	lm_eval/models/huggingface.py
#	lm_eval/models/vllm_causallms.py
#	pyproject.toml
parents 14a29ade 4f8195f1
......@@ -22,5 +22,3 @@ metric_list:
higher_is_better: true
metadata:
version: 3.0
dataset_kwargs:
trust_remote_code: true
......@@ -20,5 +20,3 @@ metric_list:
higher_is_better: true
metadata:
version: 0.0
dataset_kwargs:
trust_remote_code: true
......@@ -23,5 +23,3 @@ metric_list:
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
......@@ -14,5 +14,4 @@ metric_list:
aggregation: mean
higher_is_better: true
dataset_kwargs:
trust_remote_code: true
streaming: true
tag:
- multiple_choice
task: hellaswag
dataset_path: hellaswag
dataset_path: Rowan/hellaswag
dataset_name: null
output_type: multiple_choice
training_split: train
......@@ -20,5 +20,3 @@ metric_list:
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
......@@ -7,7 +7,7 @@ dataset_name: algebra
output_type: generate_until
training_split: train
test_split: test
doc_to_text: "Problem: {{problem}}\nAnswer:"
doc_to_text: "Problem: {{problem}}\nAnswer:"
process_results: !function utils.process_results
doc_to_target: "{{answer}}"
generation_kwargs:
......@@ -21,5 +21,3 @@ metric_list:
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
......@@ -8,6 +8,15 @@ Large language models (LLMs) demonstrate exceptional performance on complex reas
Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K
### Evaluation Settings
The authors suggest (Appendix B) using:
* **Sampling temperature:** `0.7`
* **Top‑p:** `0.95`
* **Output length:** *min* `8` tokens, *max* `2048` tokens (`max_gen_toks`)
We default to greedy decoding.
### Citation
......@@ -20,7 +29,7 @@ Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K
}
```
### Groups and and Tasks
### Groups and Tasks
#### Groups
......@@ -37,10 +46,13 @@ Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
### Changelog
- 2025-07-22: v1.1: increased `max_gen_toks` to 2048
......@@ -11,7 +11,7 @@ generation_kwargs:
- "<|end_of_text|>"
- "<|endoftext|>"
- "<|im_end|>"
max_gen_toks: 512
max_gen_toks: 2048
do_sample: false
temperature: 0
metric_list:
......@@ -19,4 +19,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
version: 1.1
......@@ -111,7 +111,7 @@ def parse_math_answer(raw_string):
return retval
def get_answer_with_dollar_sign(s):
first_pattern = "\$(.*)\$"
first_pattern = r"\$(.*)\$"
last_match = None
matches = re.findall(first_pattern, s)
if matches:
......@@ -127,7 +127,7 @@ def parse_math_answer(raw_string):
if "\\n" in last_match:
last_match = last_match.split("\\n")[0]
else:
pattern = "(?:\\$)?\d+(?:\.\d+)?(?![\w\d])"
pattern = "(?:\\$)?\\d+(?:\\.\\d+)?(?![\\w\\d])"
matches = re.findall(pattern, s)
if matches:
last_match = matches[-1]
......@@ -250,7 +250,7 @@ def _strip_string(string):
# remove percentage
string = string.replace("\\%", "")
string = string.replace("\%", "")
string = string.replace(r"\%", "")
# " 0." equivalent to " ." and "{0." equivalent to "{." Alternatively, add "0" if "." is the start of the string
string = string.replace(" .", " 0.")
......
......@@ -11,7 +11,7 @@ generation_kwargs:
- "<|end_of_text|>"
- "<|endoftext|>"
- "<|im_end|>"
max_gen_toks: 512
max_gen_toks: 2048
do_sample: false
temperature: 0
metric_list:
......@@ -19,4 +19,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
version: 1.1
......@@ -111,7 +111,7 @@ def parse_math_answer(raw_string):
return retval
def get_answer_with_dollar_sign(s):
first_pattern = "\$(.*)\$"
first_pattern = r"\$(.*)\$"
last_match = None
matches = re.findall(first_pattern, s)
if matches:
......@@ -127,7 +127,7 @@ def parse_math_answer(raw_string):
if "\\n" in last_match:
last_match = last_match.split("\\n")[0]
else:
pattern = "(?:\\$)?\d+(?:\.\d+)?(?![\w\d])"
pattern = "(?:\\$)?\\d+(?:\\.\\d+)?(?![\\w\\d])"
matches = re.findall(pattern, s)
if matches:
last_match = matches[-1]
......@@ -250,7 +250,7 @@ def _strip_string(string):
# remove percentage
string = string.replace("\\%", "")
string = string.replace("\%", "")
string = string.replace(r"\%", "")
# " 0." equivalent to " ." and "{0." equivalent to "{." Alternatively, add "0" if "." is the start of the string
string = string.replace(" .", " 0.")
......
......@@ -14,7 +14,5 @@ metric_list:
- metric: acc_norm
aggregation: mean
higher_is_better: true
dataset_kwargs:
trust_remote_code: true
metadata:
version: 0
......@@ -19,5 +19,3 @@ metric_list:
higher_is_better: True
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
......@@ -19,5 +19,3 @@ metric_list:
higher_is_better: True
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
......@@ -29,5 +29,3 @@ generation_kwargs:
max_gen_toks: 1024
metadata:
version: 2.0
dataset_kwargs:
trust_remote_code: true
......@@ -18,5 +18,3 @@ metric_list:
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
......@@ -22,8 +22,6 @@ metric_list:
num_fewshot: 4
metadata:
version: 3.0
dataset_kwargs:
trust_remote_code: true
fewshot_config:
sampler: first_n
samples: !function utils.list_fewshot_samples
# Task-name
### Paper
Title: `LIBRA: Long Input Benchmark for Russian Analysis`
Abstract: `Datasets for proper evaluation of long-context understanding in Russian. For the Russian language LIBRA comprises 21 adapted datasets to study the LLM's abilities to understand long texts thoroughly. The tests are divided into four complexity groups and allow the evaluation of models across various context lengths ranging from 4k up to 128k tokens.`
Homepage: `https://huggingface.co/datasets/ai-forever/LIBRA`
### Citation
```
@misc{churin2024longinputbenchmarkrussian,
title={Long Input Benchmark for Russian Analysis},
author={Igor Churin and Murat Apishev and Maria Tikhonova and Denis Shevelev and Aydar Bulatov and Yuri Kuratov and Sergei Averkiev and Alena Fenogenova},
year={2024},
eprint={2408.02439},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2408.02439},
}
```
### Groups, Tags, and Tasks
#### Groups
* `libra_simple_information_retrieval`
* `libra_question_answering_and_multiple_choice`
* `libra_multi_hop_question_answering`
* `libra_complex_reasoning_and_mathematical_problems`
#### Tags
* `libra`
#### Tasks
* `passkey`
* `passkey_with_librusec`
* `matreshka_yes_no`
* `matreshka_names`
* `librusec_history`
* `ru_trec`
* `ru_sci_abstract_retrieval`
* `ru_sci_fi`
* `ru_quality`
* `ru_tpo`
* `ru_babilong_qa1`
* `ru_babilong_qa2`
* `ru_babilong_qa3`
* `ru_babilong_qa4`
* `ru_babilong_qa5`
* `long_context_multiq`
* `librusec_mhqa`
* `ru_2wikimultihopqa`
* `ru_sci_passage_count`
* `ru_qasper`
* `ru_gsm100`
# Variants
Usage (**all with `--apply_chat_template`**):
```
lm_eval --model hf --model_args pretrained=EleutherAI/gpt-j-6B --tasks libra_simple_information_retrieval --device cpu --apply_chat_template --log_samples --output_path test_libra_simple_information_retrieval
```
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group: libra_complex_reasoning_and_mathematical_problems
group_alias: Complex Reasoning and Mathematical Problems
task:
- ru_sci_passage_count
- ru_qasper
- ru_gsm100
group: libra_multi_hop_question_answering
group_alias: Multi-hop Question Answering
task:
- ru_babilong_qa1
- ru_babilong_qa2
- ru_babilong_qa3
- ru_babilong_qa4
- ru_babilong_qa5
- long_context_multiq
- librusec_mhqa
- ru_2wikimultihopqa
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment