Unverified Commit d855d0ba authored by Hanwool Albert Lee's avatar Hanwool Albert Lee Committed by GitHub
Browse files

#1442 inverse scaling tasks implementation (#1589)



* initial_implementation (test has to be proceeded)

* minor fix

* revised task name and implemented new task

* minor fixes

* new tasks implement

* minor fix

* added 'prompt injection' task

* delete prompt injection task (will be implemented at next PR)

* trust remote code

* Update lm_eval/tasks/inverse_scaling/README.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* added readme

* Update lm_eval/tasks/README.md

* Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml

* Update lm_eval/tasks/inverse_scaling/README.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update README.md

* precommit?

* run precommit on readme

---------
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
parent 3c8db1bb
...@@ -49,6 +49,7 @@ ...@@ -49,6 +49,7 @@
| [hendrycks_ethics](hendrycks_ethics/README.md) | Tasks designed to evaluate the ethical reasoning capabilities of models. | English | | [hendrycks_ethics](hendrycks_ethics/README.md) | Tasks designed to evaluate the ethical reasoning capabilities of models. | English |
| [hendrycks_math](hendrycks_math/README.md) | Mathematical problem-solving tasks to test numerical reasoning and problem-solving. | English | | [hendrycks_math](hendrycks_math/README.md) | Mathematical problem-solving tasks to test numerical reasoning and problem-solving. | English |
| [ifeval](ifeval/README.md) | Interactive fiction evaluation tasks for narrative understanding and reasoning. | English | | [ifeval](ifeval/README.md) | Interactive fiction evaluation tasks for narrative understanding and reasoning. | English |
| [inverse_scaling](inverse_scaling/README.md) | Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse. | English |
| [kmmlu](kmmlu/README.md) | Knowledge-based multi-subject multiple choice questions for academic evaluation. | Korean | | [kmmlu](kmmlu/README.md) | Knowledge-based multi-subject multiple choice questions for academic evaluation. | Korean |
| [kobest](kobest/README.md) | A collection of tasks designed to evaluate understanding in Korean language. | Korean | | [kobest](kobest/README.md) | A collection of tasks designed to evaluate understanding in Korean language. | Korean |
| [kormedmcqa](kormedmcqa/README.md) | Medical question answering tasks in Korean to test specialized domain knowledge. | Korean | | [kormedmcqa](kormedmcqa/README.md) | Medical question answering tasks in Korean to test specialized domain knowledge. | Korean |
......
# inverse_scaling
### Paper
Title: `Inverse Scaling: When Bigger Isn't Better`
Abstract: `Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool. Through analysis of the datasets, along with other examples found in the literature, we identify four potential causes of inverse scaling: (i) preference to repeat memorized sequences over following in-context instructions, (ii) imitation of undesirable patterns in the training data, (iii) tasks containing an easy distractor task which LMs could focus on, rather than the harder real task, and (iv) correct but misleading few-shot demonstrations of the task. We release the winning datasets at this https URL to allow for further investigation of inverse scaling. Our tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood. Overall, our results suggest that there are tasks for which increased model scale alone may not lead to progress, and that more careful thought needs to go into the data and objectives for training language models.`
Note: This is not official implementation of inverse scaling prize. Implemented by h-albert-lee with permission from the authors of the paper.
Homepage: https://github.com/inverse-scaling/prize
### Citation
@article{mckenzie2023inverse,
title={Inverse Scaling: When Bigger Isn't Better},
author={Ian R. McKenzie and Alexander Lyzhov and Michael Pieler and Alicia Parrish and Aaron Mueller and Ameya Prabhu and Euan McLean and Aaron Kirtland and Alexis Ross and Alisa Liu and Andrew Gritsevskiy and Daniel Wurgaft and Derik Kauffman and Gabriel Recchia and Jiacheng Liu and Joe Cavanagh and Max Weiss and Sicong Huang and The Floating Droid and Tom Tseng and Tomasz Korbak and Xudong Shen and Yuhui Zhang and Zhengping Zhou and Najoung Kim and Samuel R. Bowman and Ethan Perez},
journal={arXiv preprint arXiv:2306.09479},
year={2023}
}
### Groups and Tasks
#### Groups
* `inverse_scaling_mc`: all tasks of Inverse Scaling Prize (currently aside from Prompt Injection), matching their implementations on OPT for multiple-choice type classification tasks. **These match the published dataset versions from the prize, which may slightly differ from numbers in the paper (but have been tested for equivalence to the OPT numbers reported at https://huggingface.co/inverse-scaling/opt-1.3b_eval for multiple sizes.**
#### Tasks
- `inverse_scaling_hindsight_neglect_10shot`
- `inverse_scaling_redefine_math`
- `inverse_scaling_quote_repetition`
- `inverse_scaling_neqa`
- `inverse_scaling_winobias_antistereotype`: not an official Inverse Scaling prize winner, but eval results reported on it at https://huggingface.co/inverse-scaling/opt-1.3b_eval .
- `inverse_scaling_into_the_unknown`
- `inverse_scaling_memo_trap`
- `inverse_scaling_modus_tollens`
- `inverse_scaling_pattern_matching_suppression`
- `inverse_scaling_repetitive_algebra`
- `inverse_scaling_sig_figs`
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group:
- inverse_scaling_mc
output_type: multiple_choice
test_split: train
doc_to_text: prompt
doc_to_choice: classes
doc_to_target: answer_index
target_delimiter: ""
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 0
include: _inverse_scaling_mc_yaml
task: inverse_scaling_hindsight_neglect_10shot
dataset_path: inverse-scaling/hindsight-neglect-10shot
include: _inverse_scaling_mc_yaml
task: inverse_scaling_into_the_unknown
dataset_path: Albertmade/into-the-unknown
include: _inverse_scaling_mc_yaml
task: inverse_scaling_memo_trap
dataset_path: Albertmade/memo-trap
include: _inverse_scaling_mc_yaml
task: inverse_scaling_modus_tollens
dataset_path: Albertmade/modus-tollens
include: _inverse_scaling_mc_yaml
task: inverse_scaling_neqa
dataset_path: inverse-scaling/NeQA
include: _inverse_scaling_mc_yaml
task: inverse_scaling_pattern_matching_suppression
dataset_path: Albertmade/pattern-matching-suppression
include: _inverse_scaling_mc_yaml
task: inverse_scaling_quote_repetition
dataset_path: inverse-scaling/quote-repetition
include: _inverse_scaling_mc_yaml
task: inverse_scaling_redefine_math
dataset_path: inverse-scaling/redefine-math
include: _inverse_scaling_mc_yaml
task: inverse_scaling_repetitive_algebra
dataset_path: Albertmade/repetitive-algebra
include: _inverse_scaling_mc_yaml
task: inverse_scaling_sig_figs
dataset_path: Albertmade/sig-figs
group:
- inverse_scaling_mc
task: inverse_scaling_winobias_antistereotype
dataset_path: mathemakitten/winobias_antistereotype_test_v5
output_type: multiple_choice
test_split: test
doc_to_text: text
doc_to_choice: classes
doc_to_target: target
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
dataset_kwargs:
trust_remote_code: true
metadata:
version: 0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment