Unverified Commit 976d8a0b authored by Rima Shahbazyan's avatar Rima Shahbazyan Committed by GitHub
Browse files

Adding new subtask to SCORE tasks: non greedy robustness (#2558)

* score readme added

* generate until task's "until" parameter's default value fixed.

* score mmlu-pro and agieval added

* changed macro accuracy to micro for agieval

* Always E removed from agi eval

* redundancies removed

* MATH added

* minor cosmetic changes for math

* Licenses added Readme updated

* changes for flake8 + license header on math

* Score added to readme and precommit was run.

* Score added to readme and precommit was run.

* Import error fixed

* math task bugfix
postprocess minor fix

* CR for math added

* math CR

* math task bugfix
postprocess minor fix

CR for math added

* Math cr fixed

* mmlu_pro non_greedy task added

* non greedy summarizer added

* Non greedy for all score tasks

* Bugfixes for non-greedy

* fixing the until argument

* undoing the change to "until" arguments default behaviour

* minor fix in summarizer

* log naming changes for better readability

* math subtasks naming fix

* agieval subtask naming fix

* logging added for debugging

* path issue fixed

* minor fix

* path fix

* path fix

* non_greedy_math minor fix

* final changes

* changed readme for non-greedy
added Nvidia header
added wxample script for non_greedy
changed prompts to match that fo trt runs

* non greedy summarizer bugfix

* non_greedy summarizer fixed
parent 8de772f9
```
Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
````
# Non Greedy Evaluation
This task checks for model's consistency towards seed changes during generation.
More particularly it evaluates the model's accuracy and consistancy rate with 5
different seeds (seed = 1, 2,...,5) for a fixed prompt with temperature set to 0.7.
## How to run the Non-Greedy evaluation of SCORE?
Evaluation for non greedy tasks differs a bit from other score tasks as it is required to pass different seeds as an argument manually. Below you can find the step-by-step guide on how to correctly run the **Score Non-Greedy** evaluation.
To run the evaluation of the Non-Greedy tasks with 5 different seeds you should:
1. For a given dataset run the evaluation by
* specifying the task as `score_non_greedy_robustness_{DATASET_NAME}` (`DATASET_NAME` being either`agieval`, `mmlu_pro` or `math`)
* fixing the seed with the run argument `--seed=1`
* passing the `--log_samples` argument*
* specifying an output with `--output_path=SOME_OUTPUT_PATH/seed_1`
* if running with vllm it is important to set the seed in the `--model_args` just by specifying the `seed` parameter\
2. Repeat the process for 5 times**, changing the `--seed` and the `--output_path` arguments accordingly from 1 to 5.
3. When all 5 runs are finished and logs are saved, run the `./lm_eval/tasks/score/non_greedy_summarizer.py` script by passing the the output directory of the above runs to the `--log_dir` argument***, and by specifying the dataset name for which the evaluations were run with `--dataset` argument(`agieval`, `mmlu_pro` or `math`). \
4. The script will return the default lm_evaluation_harness table where accuracies for each seed and the consistancy rate are calculated.
\* _As this evaluation requires `--log_samples` to be True, it will need some extra disk space to save the prediction results for each seed._
\*\* _Refer to [`./lm_eval/tasks/score/non_greedy.sh`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/score/non_greedy.sh) to see an example of non greedy evaluation command for each seed._
\*\*\* _To `--log_dir` argument one should pass the path of the parent folder of `"seed_1", "seed_2", ...` directories, that is not necessarily the `--output_path` passed to the evaulater in the 1st step._
......@@ -31,7 +31,7 @@ limitations under the License.
## Tasks
Both `score_robustness_mmlu_pro` and `score_robustness_agieval` contain the following 2 tasks:
Both `score_robustness_mmlu_pro` and `score_robustness_agieval` contain the following 3 tasks:
* Option order robustness:
`score_option_order_robustness_mmlu_pro`,
......@@ -41,10 +41,14 @@ Both `score_robustness_mmlu_pro` and `score_robustness_agieval` contain the foll
`score_prompt_robustness_mmlu_pro`,
`score_prompt_robustness_agieval`,
Whereas math contains only
* Non greedy robustness
`score_non_greedy_robustness_mmlu_pro`,
`score_non_greedy_robustness_agieval`,
Whereas math contains the following 2:
* Prompt robustness:
`score_prompt_robustness_math`
`score_non_greedy_robustness_math`,
### Option order robustness
......@@ -55,6 +59,10 @@ Measures the model's robustness to the placement of the correct answer in the op
Measures the model's robustness to 10 different prompts. list of the prompts can be found in the `./prompt_templates.json` file under the key `prompt_robustness`.
### Non greedy robustness
Measures the model's robustness to 5 different seeds: seeds = \[1-5\]. For evaluating on the non greedy task, please, refer to [NON_GREEDY.md](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/score/NON_GREEDY.md)
## Metrics
All robustness tasks calculate 2 metrics: *Accuracy* and *Consistency Rate(CR)* [[4](#cr)].
......
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
task: non_greedy_robustness_agieval_aqua_rat
dataset_path: hails/agieval-aqua-rat
dataset_name: default
output_type: generate_until
test_split: test
process_docs: !function utils_agieval.non_greedy_robustness_process_docs
doc_to_text: !function utils_agieval.agi_eval_robustness_doc_to_text
doc_to_target: answer
generation_kwargs:
max_gen_toks: 1024
do_sample: true
temperature: 0.7
until: []
process_results: !function utils_agieval.non_greedy_robustness_process_results
metric_list:
- metric: non_greedy_accuracy
aggregation: !function utils_agieval.non_greedy_accuracy
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_logiqa_en
dataset_path: hails/agieval-logiqa-en
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_lsat_rc
dataset_path: hails/agieval-lsat-rc
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_lsat_ar
dataset_path: hails/agieval-lsat-ar
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_lsat_lr
dataset_path: hails/agieval-lsat-lr
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_sat_en
dataset_path: hails/agieval-sat-en
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_sat_math
dataset_path: hails/agieval-sat-math
{
"option_order_robustness":{
"prompt": "For the multiple-choice question, which option (A-E) is correct?.\n\nQuestion: {question}{options}\n\nEnd the answer with the following:\nThe best answer is (the_answer_letter) where the (the_answer_letter) is one of 'A', 'B', 'C', 'D' or 'E'.",
"prompt": "For the multiple-choice question, which option (A-E) is correct?.\n\nQuestion:{question}{options}\nEnd the answer with the following:\nThe best answer is (the_answer_letter) where the (the_answer_letter) is one of 'A', 'B', 'C', 'D' or 'E'.",
"options_format": "\n{letter}: {option}"
},
"non_greedy_robustness":{
"prompt": "For the multiple-choice question, which option (A-E) is correct?.\n\nQuestion:{question}{options}\nEnd the answer with the following:\nThe best answer is (the_answer_letter) where the (the_answer_letter) is one of 'A', 'B', 'C', 'D' or 'E'.",
"options_format": "\n{letter}: {option}"
},
"prompt_robustness":[
{
......
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
group: score_non_greedy_robustness_agieval
task:
- non_greedy_robustness_agieval_aqua_rat
- non_greedy_robustness_agieval_logiqa_en
- non_greedy_robustness_agieval_lsat_ar
- non_greedy_robustness_agieval_lsat_lr
- non_greedy_robustness_agieval_lsat_rc
- non_greedy_robustness_agieval_sat_en
- non_greedy_robustness_agieval_sat_math
aggregate_metric_list:
- metric: non_greedy_accuracy
aggregation: mean
weight_by_size: true
metadata:
version: 1.0
......@@ -16,5 +16,6 @@ group: score_robustness_agieval
task:
- score_prompt_robustness_agieval
- score_option_order_robustness_agieval
- score_non_greedy_robustness_agieval
metadata:
version: 1.0
......@@ -29,6 +29,7 @@ TEMPLATE_FILE_PATH = os.path.join(os.path.dirname(__file__), "prompt_templates.j
PROMPT_ROBUSTNESS_TEMPLATE_KEY = "prompt_robustness"
OPTION_ORDER_ROBUSTNESS_TEMPLATE_KEY = "option_order_robustness"
NON_GREEDY_ROBUSTNESS_TEMPLATE_KEY = "non_greedy_robustness"
QUESTION_KEY = "query"
ANSWER_INDEX_KEY = "gold"
......@@ -93,6 +94,13 @@ option_order_robustness_process_docs = partial(
dataset_specific_preprocess=initial_process_docs,
)
non_greedy_robustness_process_docs = partial(
utils.non_greedy_robustness_process_docs,
templates_key=NON_GREEDY_ROBUSTNESS_TEMPLATE_KEY,
template_file_path=TEMPLATE_FILE_PATH,
dataset_specific_preprocess=initial_process_docs,
)
def prompt_robustness_process_results(doc, results) -> Dict[str, float]:
final_answer = utils.__postprocess_pred(results[0])
......@@ -135,6 +143,17 @@ def option_order_robustness_process_results(doc, results) -> Dict[str, float]:
}
def non_greedy_robustness_process_results(doc, results) -> Dict[str, float]:
final_answer = utils.__postprocess_pred(results[0])
final_answer = utils.translate_model_answer_to_labels(
final_answer, option_format=doc["options_format"], labels=LABELS
)
question_id = doc["question_id"]
gt = LABELS[doc["answer_index"]]
return {"non_greedy_accuracy": (question_id, final_answer, gt, None)}
def per_prompt_accuracy(results: List[Dict[str, Any]], p_id=0) -> float:
accuracies = []
for result in results:
......@@ -181,3 +200,16 @@ per_option_accuracy_c = partial(per_option_accuracy, always_opt="C")
per_option_accuracy_d = partial(per_option_accuracy, always_opt="D")
options_consistency_rate = partial(utils.options_consistency_rate, labels=LABELS)
def non_greedy_accuracy(results: List[Dict[str, Any]]) -> float:
accuracies = []
for result in results:
question_id, final_answer, gt, category = result
accuracies.append(final_answer == gt)
accuracy = sum(accuracies) / len(accuracies)
eval_logger.info(f"Non greedy accuracy: {accuracy}")
return np.round(accuracy, 4)
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
task: non_greedy_robustness_math_algebra
dataset_path: EleutherAI/hendrycks_math
dataset_name: algebra
output_type: generate_until
test_split: test
process_docs: !function utils_math.non_greedy_robustness_process_docs
doc_to_text: !function utils_math.math_robustness_doc_to_text
doc_to_target: answer
generation_kwargs:
max_gen_toks: 1024
do_sample: true
temperature: 0.7
until: []
process_results: !function utils_math.non_greedy_robustness_process_results
metric_list:
- metric: non_greedy_accuracy
aggregation: !function utils_math.non_greedy_accuracy
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_math_algebra.yaml
dataset_name: counting_and_probability
task: non_greedy_robustness_math_counting_and_prob
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_math_algebra.yaml
dataset_name: geometry
task: non_greedy_robustness_math_geometry
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_math_algebra.yaml
dataset_name: intermediate_algebra
task: non_greedy_robustness_math_intermediate_algebra
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_math_algebra.yaml
dataset_name: number_theory
task: non_greedy_robustness_math_num_theory
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_math_algebra.yaml
dataset_name: prealgebra
task: non_greedy_robustness_math_prealgebra
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include: non_greedy_robustness_math_algebra.yaml
dataset_name: precalculus
task: non_greedy_robustness_math_precalc
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment