Commit 2106fbeb authored by Baber's avatar Baber
Browse files

Merge branch 'main' into mathvista

# Conflicts:
#	lm_eval/models/openai_completions.py
parents 4354fe46 703fbffd
# Generated by utils.py
dataset_name: ja
doc_to_choice: '{{[sentence1+", ですね? い, "+sentence2, sentence1+", ですね? いえ, "+sentence2]}}'
doc_to_choice: '{{[sentence1+", ですね? いえ, "+sentence2, sentence1+", ですね? い, "+sentence2]}}'
doc_to_text: ''
include: pawsx_template_yaml
task: paws_ja
# Generated by utils.py
dataset_name: ko
doc_to_choice: '{{[sentence1+", 맞죠? , "+sentence2, sentence1+", 맞죠? 아니요, "+sentence2]}}'
doc_to_choice: '{{[sentence1+", 맞죠? 아니요, "+sentence2, sentence1+", 맞죠? , "+sentence2]}}'
doc_to_text: ''
include: pawsx_template_yaml
task: paws_ko
# Generated by utils.py
dataset_name: zh
doc_to_choice: '{{[sentence1+", 对吧? 是, "+sentence2, sentence1+", 对吧? 是, "+sentence2]}}'
doc_to_choice: '{{[sentence1+", 对吧? 是, "+sentence2, sentence1+", 对吧? 是, "+sentence2]}}'
doc_to_text: ''
include: pawsx_template_yaml
task: paws_zh
......@@ -16,4 +16,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
version: 0.0
version: 1.0
```
Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
````
# Non Greedy Evaluation
This task checks for model's consistency towards seed changes during generation.
More particularly it evaluates the model's accuracy and consistancy rate with 5
different seeds (seed = 1, 2,...,5) for a fixed prompt with temperature set to 0.7.
## How to run the Non-Greedy evaluation of SCORE?
Evaluation for non greedy tasks differs a bit from other score tasks as it is required to pass different seeds as an argument manually. Below you can find the step-by-step guide on how to correctly run the **Score Non-Greedy** evaluation.
To run the evaluation of the Non-Greedy tasks with 5 different seeds you should:
1. For a given dataset run the evaluation by
* specifying the task as `score_non_greedy_robustness_{DATASET_NAME}` (`DATASET_NAME` being either`agieval`, `mmlu_pro` or `math`)
* fixing the seed with the run argument `--seed=1`
* passing the `--log_samples` argument*
* specifying an output with `--output_path=SOME_OUTPUT_PATH/seed_1`
* if running with vllm it is important to set the seed in the `--model_args` just by specifying the `seed` parameter\
2. Repeat the process for 5 times**, changing the `--seed` and the `--output_path` arguments accordingly from 1 to 5.
3. When all 5 runs are finished and logs are saved, run the `./lm_eval/tasks/score/non_greedy_summarizer.py` script by passing the the output directory of the above runs to the `--log_dir` argument***, and by specifying the dataset name for which the evaluations were run with `--dataset` argument(`agieval`, `mmlu_pro` or `math`). \
4. The script will return the default lm_evaluation_harness table where accuracies for each seed and the consistancy rate are calculated.
\* _As this evaluation requires `--log_samples` to be True, it will need some extra disk space to save the prediction results for each seed._
\*\* _Refer to [`./lm_eval/tasks/score/non_greedy.sh`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/score/non_greedy.sh) to see an example of non greedy evaluation command for each seed._
\*\*\* _To `--log_dir` argument one should pass the path of the parent folder of `"seed_1", "seed_2", ...` directories, that is not necessarily the `--output_path` passed to the evaulater in the 1st step._
```
Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
````
# SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models
## Citation
```bib
[Citation placeholder]
```
## Groups
- `score_robustness_mmlu_pro`: two 0-shot robutstness tasks on MMLU-PRO dataset [[1](#mmlu_pro)]
- `score_robustness_agieval`: two 0-shot robutstness tasks on the AGIEVAL datasets [[2](#agi_eval)] multiple choice questions subsets: `'agieval-sat-math'`, `'agieval-lsat-lr'`, `'agieval-lsat-rc'`, `'agieval-logiqa-en'`, `'agieval-aqua-rat'`, `'agieval-sat-en'`, `'agieval-lsat-ar'`
- `score_robustness_math`: one 0-shot robutstness tasks on Hendryk's MATH dataset [[3](#math)]
## Tasks
Both `score_robustness_mmlu_pro` and `score_robustness_agieval` contain the following 3 tasks:
* Option order robustness:
`score_option_order_robustness_mmlu_pro`,
`score_option_order_robustness_agieval`
* Prompt robustness:
`score_prompt_robustness_mmlu_pro`,
`score_prompt_robustness_agieval`,
* Non greedy robustness
`score_non_greedy_robustness_mmlu_pro`,
`score_non_greedy_robustness_agieval`,
Whereas math contains the following 2:
* Prompt robustness:
`score_prompt_robustness_math`
`score_non_greedy_robustness_math`,
### Option order robustness
Measures the model's robustness to the placement of the correct answer in the options list by swapping the correct answer with all the other possible options.
### Prompt robustness
Measures the model's robustness to 10 different prompts. list of the prompts can be found in the `./prompt_templates.json` file under the key `prompt_robustness`.
### Non greedy robustness
Measures the model's robustness to 5 different seeds: seeds = \[1-5\]. For evaluating on the non greedy task, please, refer to [NON_GREEDY.md](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/score/NON_GREEDY.md)
## Metrics
All robustness tasks calculate 2 metrics: *Accuracy* and *Consistency Rate(CR)* [[4](#cr)].
$CR = \frac{1}{|Q|} \sum_{Q_k \in Q} \sum_{y_i \in Y_k} \sum_{\substack{y_j \in Y_k \\ j \neq i}}\frac{\text{sim}(y_i, y_j)}{\binom{|Y_k|}{2}}$
## Notes
- All tasks are designed for **Instruct** models for which we recommend to pass "`--apply_chat_template`" flag.
## References
<a name=mmlu_pro></a>[1] Wang, et al. "Mmlu-pro: A more robust and challenging multi-task language understanding benchmark." arXiv preprint arXiv:2406.01574 (2024).
<a name=agi_eval></a>[2] Zhong, et al. "Agieval: A human-centric benchmark for evaluating foundation models." arXiv preprint arXiv:2304.06364 (2023).
<a name=math></a>[3] Hendrycks et al. "Measuring Mathematical Problem Solving With the MATH Dataset." arXiv:2103.03874 (2021).
<a name=cr></a>[4] Yukun et al. "Improving the robustness of large language models via consistency alignment." arXiv:2403.14221 (2024).
## Checklist
For adding novel benchmarks/datasets to the library:
* [-] Is the task an existing benchmark in the literature?
* [-] Have you referenced the original paper that introduced the task? - Will be referenced as soon as the paper is published
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
task: non_greedy_robustness_agieval_aqua_rat
dataset_path: hails/agieval-aqua-rat
dataset_name: default
output_type: generate_until
test_split: test
process_docs: !function utils_agieval.non_greedy_robustness_process_docs
doc_to_text: !function utils_agieval.agi_eval_robustness_doc_to_text
doc_to_target: answer
generation_kwargs:
max_gen_toks: 1024
do_sample: true
temperature: 0.7
until: []
process_results: !function utils_agieval.non_greedy_robustness_process_results
metric_list:
- metric: non_greedy_accuracy
aggregation: !function utils_agieval.non_greedy_accuracy
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_logiqa_en
dataset_path: hails/agieval-logiqa-en
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_lsat_rc
dataset_path: hails/agieval-lsat-rc
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_lsat_ar
dataset_path: hails/agieval-lsat-ar
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_lsat_lr
dataset_path: hails/agieval-lsat-lr
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_sat_en
dataset_path: hails/agieval-sat-en
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_sat_math
dataset_path: hails/agieval-sat-math
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
task: option_order_robustness_agieval_aqua_rat
dataset_path: hails/agieval-aqua-rat
dataset_name: default
output_type: generate_until
test_split: test
process_docs: !function utils_agieval.option_order_robustness_process_docs
doc_to_text: !function utils_agieval.agi_eval_robustness_doc_to_text
doc_to_target: answer
generation_kwargs:
until: []
max_gen_toks: 1024
do_sample: False
process_results: !function utils_agieval.option_order_robustness_process_results
metric_list:
- metric: per_option_accuracy_A
aggregation: !function utils_agieval.per_option_accuracy_a
higher_is_better: true
- metric: per_option_accuracy_B
aggregation: !function utils_agieval.per_option_accuracy_b
higher_is_better: true
- metric: per_option_accuracy_C
aggregation: !function utils_agieval.per_option_accuracy_c
higher_is_better: true
- metric: per_option_accuracy_D
aggregation: !function utils_agieval.per_option_accuracy_d
higher_is_better: true
- metric: options_consistency_rate
aggregation: !function utils_agieval.options_consistency_rate
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: option_order_robustness_agieval_aqua_rat.yaml
task: option_order_robustness_agieval_logiqa_en
dataset_path: hails/agieval-logiqa-en
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: option_order_robustness_agieval_aqua_rat.yaml
task: option_order_robustness_agieval_lsat_ar
dataset_path: hails/agieval-lsat-ar
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: option_order_robustness_agieval_aqua_rat.yaml
task: option_order_robustness_agieval_lsat_lr
dataset_path: hails/agieval-lsat-lr
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: option_order_robustness_agieval_aqua_rat.yaml
task: option_order_robustness_agieval_lsat_rc
dataset_path: hails/agieval-lsat-rc
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: option_order_robustness_agieval_aqua_rat.yaml
task: option_order_robustness_agieval_sat_en
dataset_path: hails/agieval-sat-en
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: option_order_robustness_agieval_aqua_rat.yaml
task: option_order_robustness_agieval_sat_math
dataset_path: hails/agieval-sat-math
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment