Commits · d272c19fe7a35fac79b15a236fdbfe7096dd1ad1 · gaoqiong / lm-evaluation-harness

01 Mar, 2024 1 commit
- Add multilingual truthfulqa targets (#1499) · d272c19f
  Zehan Li authored Mar 01, 2024
  
  d272c19f
27 Feb, 2024 2 commits
- update name of val split in truthfulqa multilingual (#1488) · a08eb870
  Hailey Schoelkopf authored Feb 27, 2024
  
  a08eb870
- add multilingual mmlu eval (#1484) · 7cd004c4
  Zehan Li authored Feb 27, 2024
  
  7cd004c4
26 Feb, 2024 6 commits

Lintang Sutawika authored Feb 26, 2024



* add brier_score

* process brier_score

* brier score is working for N-sized class

* fxied brier score

* add TED to BigBench and Brier score to MMLU

* format

* Update metrics.py

* Update task.py

* Update generate_until_template_yaml

* Delete lm_eval/tasks/bigbench/aux_metric.py

* Update generate_until_template_yaml

* Update _default_template_yaml

* Update _generate_configs.py

* Update _generate_configs.py

* Update _generate_configs.py

* fix (format?)

* format?

* format, once more

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

96d185fa

Create a means for caching task registration and request building. Ad… (#1372) · 1e6c9272

Aaron V authored Feb 26, 2024



* Create a means for caching task registration and request building. Add the ability to specify an args dict for simple_evaluate().

* Remove extra S in cache path in caching module
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Rename requests cache args, make model_args polymorphic so that a dict can also be accepted.

* Update docs to reflect new caching behavior, add CLI args for requests caching. Create a function for deleting items in the cache.

* Update documentation, fix minor bug with arg parsing for requests caching where an undefined variable was used.

* Remove line from gitignore, add to cli for caching datasets.

* Add hashing suffix to .pickles. Update test script typo.

* Favor isinstance() over type() in evaluator.py

* Add tests for caching, gets tests working, remove unneeded arg from build_all_requests().

* Update arg description to simple_evaluate.

* Update pyproject.toml

* Fix typehint

* Remove the use of random() for creating default cache pickle hash.

* Check that cache dir exists before clearing it in request cache tests.

* Fix linting problems.

* Fix additional formatting errors.

* Remove trailing whitespace.

* Add new line to the end of .gitignore.

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

1e6c9272

Revert "setting trust_remote_code (#1467)" (#1474) · f6befdb9
Hailey Schoelkopf authored Feb 26, 2024
```
This reverts commit c1145dfd.
```
f6befdb9
add arabic mmlu (#1402) · 7de7b27e
khalil authored Feb 26, 2024
```
* add arabic mmlu

* update the description

* add readme file
```
7de7b27e
setting trust_remote_code (#1467) · c1145dfd
Vicki Boykis authored Feb 26, 2024

c1145dfd
Apply code autoformatting with Ruff to tasks/*.py an *__init__.py (#1469) · d27c0c08
LSinev authored Feb 26, 2024

d27c0c08

23 Feb, 2024 1 commit
- update parsing logic of mgsm following gsm8k (#1462) · 8371662c
  thnkinbtfly authored Feb 24, 2024
  
  8371662c
22 Feb, 2024 1 commit

PR fixing the issue #1391 (wrong contexts in the mgsm task) (#1440) · a72babbf

Lei Chen authored Feb 22, 2024



* fix the issue #1391, wrong contexts in mgsm tasks

* fix yaml issue for having two target_delimiter lines. For COT tasks, keep the one with a space (default)

* regenerate all task yaml files
- change naming so that file name will match with task name
- task|file follows a consistent naming way, mgsm_(mode)_(lang) for three modes, i.e., direct, en_cot, and native_cot

* English CoTs should have a space as target_delimiter

* Update utils.py

* Apply suggestions from code review

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

a72babbf

21 Feb, 2024 1 commit

Added KMMLU evaluation method and changed ReadMe (#1447) · c26a6ac7

Hanwool Albert Lee authored Feb 21, 2024



* update kmmlu default formatting

* Update _default_kmmlu_yaml

* Delete lm_eval/tasks/kmmlu/utils.py

* new tasks implemented

* add direct tasks

* update direct evaluate

* update direct eval

* add cot sample

* update cot

* add cot

* Update _cot_kmmlu_yaml

* add kmmlu90

* Update and rename _cot_kmmlu.yaml to _cot_kmmlu_yaml

* Create kmmlu90.yaml

* Update _cot_kmmlu_yaml

* add direct

* Update _cot_kmmlu_yaml

* Update and rename kmmlu90.yaml to kmmlu90_cot.yaml

* Update kmmlu90_direct.yaml

* add kmmlu hard

* Update _cot_kmmlu_yaml

* Update _cot_kmmlu_yaml

* update cot

* update cot

* erase typo

* Update _cot_kmmlu_yaml

* update cot

* Rename dataset to match k-mmlu-hard

* removed kmmlu90

* fixed name 'kmmlu_cot' to 'kmmlu_hard_cot' and revised README

* applied pre-commit before pull requests

* rename datasets and add notes

* Remove DS_Store cache

* Update lm_eval/tasks/kmmlu/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Change citations and reflect reviews on version

* Added kmmlu_hard and fixed other errors

* fixing minor errors

* remove duplicated

* Rename files

* try ".index"

* minor fix

* minor fix again

* fix revert.

* minor fix. thank for hailey

---------
Co-authored-by: GUIJIN SON <spthsrbwls123@yonsei.ac.kr>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

c26a6ac7

20 Feb, 2024 3 commits

Add a new task GPQA (the part without CoT) (#1434) · 5ab295c8

Uanu authored Feb 21, 2024



* add new task GPQA_n_shot

* add new task GPQA_zeroshot

* correct GPQA_zeroshot filename

* Add randomly shuffle choices

* Correct missing parentheses

* delete wrong tasks

* Add README

* Update lm_eval/tasks/gpqa/zeroshot/_gpqa_zeroshot_yaml

* Update lm_eval/tasks/gpqa/n_shot/utils.py

* Update lm_eval/tasks/gpqa/n_shot/utils.py

* Update lm_eval/tasks/gpqa/README.md

* placate linter

* linter

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

5ab295c8

Group reqs by context (#1425) · 45941c67

Baber Abbasi authored Feb 20, 2024



* add key lookup for same contexts

* nit

* appease pre-commit

* nit

* use `expand` (in-place view) rather than `repeat`

* try mixed grouping

* add docs.

* nit

* nit

* nits

* fix tests

* Move greedy_tokens calculation out of cache loop

* nit

* nits

* add test

* nits

* fix name conflict

* fix name conflict

* chunk tensor

* move Collator

* nits/docstring

* fixup

* fixup

* group contexts only for decoders

* pre-commit

* fix `generate_until` test

* fix `generate_until` test

* Update lm_eval/models/huggingface.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add docs

* nit

* add docs

* add docs

* add 'logits_cache' arg

* bugfix

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

45941c67

Add a new task HaeRae-Bench (#1445) · 8680e938

Hanwool Albert Lee authored Feb 20, 2024

* haerae_reimplementation

* edited Readme and add few_shot settings

* edited readme

* newlines at end of each files

* Modifying the README file

* applied pre-commit

8680e938

19 Feb, 2024 2 commits

update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) (#1356) · 89deeeaf

thnkinbtfly authored Feb 20, 2024



* update bbh, gsm8k, mmlu parsing logic and prompts

* remove the formatting prompt (bbh) + minor update (mmlu)

* update bbh, gsm8k, mmlu zeroshot, revert fewshots

* update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot

* remove take_last, update to use docs parameters

* add newline

* ruff formatting

* Update pyproject.toml

* fix format

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

89deeeaf

Correct typo in task name (#1443) · 19cbb292
larekrow authored Feb 19, 2024

19cbb292

15 Feb, 2024 1 commit
- Update README.md (#1430) · f3b79170
  David Hoffmann authored Feb 15, 2024
  
  f3b79170
13 Feb, 2024 1 commit
- Fix: task weighting by subtask size ; update Pooled Stderr formula slightly (#1427) · 620d6a15
  Hailey Schoelkopf authored Feb 13, 2024
```
* fix weight_by_size condition

* add tests, update stderr formula slightly

* apply pre-commit
```
  620d6a15
12 Feb, 2024 1 commit
- [m_mmul] added multilingual evaluation from alexandrainst/m_mmlu (#1358) · b69c67c1
  Alessandro Ercolani authored Feb 12, 2024
  
  b69c67c1
11 Feb, 2024 2 commits
- Add multilingual TruthfulQA task (#1420) · 7397b965
  Uanu authored Feb 11, 2024
  
  7397b965
- Add multilingual ARC task (#1419) · 0256c682
  Uanu authored Feb 11, 2024
  
  0256c682
02 Feb, 2024 1 commit
- Fix for https://github.com/EleutherAI/lm-evaluation-harness/issues/1383 (#1384) · 9a902155
  Pasquale Minervini authored Feb 02, 2024
```
Fixes https://github.com/EleutherAI/lm-evaluation-harness/issues/1383

If this is okay, it will need to be propagated to SCROLLS
```
  9a902155
01 Feb, 2024 2 commits

Faster Task and Group Loading, Allow Recursive Groups (#1321) · d714fc95

Lintang Sutawika authored Feb 01, 2024



* add trust_remote_code as default

* task for testing recursive

* changed source of ALL_TASKS

* tasks should only accept TaskObjects

* initialize_tasks returns list of tasks and groups

* remove trust_remote_code for now

* moved constructor process to inside load_yaml_config

* more comprehensive way to index tasks and groups

* pre-commit format

* add exit after error

* adjust how task objects are called

* no need to use get_task_dict

* load_task_or_group works but only for tasks

* pre-commit format

* half working for nested groups

* changed variable names

* allow groups and tasks to work

* temp save

* indexing and loading are part of a task_manager object

* adapted initialize_tasks

* iron out bugs

* fixed typo

* fixed typo

* simplified code

* further tidy up

* remove lines for testing

* removed test lines

* removed unused code

* remove unused import

* fixed bug

* removed comments

* group in a list of group can accept parameter changes like `num_fewshot`

* add trust_remote_code as default

* task for testing recursive

* changed source of ALL_TASKS

* tasks should only accept TaskObjects

* initialize_tasks returns list of tasks and groups

* remove trust_remote_code for now

* moved constructor process to inside load_yaml_config

* more comprehensive way to index tasks and groups

* pre-commit format

* add exit after error

* adjust how task objects are called

* no need to use get_task_dict

* load_task_or_group works but only for tasks

* pre-commit format

* half working for nested groups

* changed variable names

* allow groups and tasks to work

* temp save

* indexing and loading are part of a task_manager object

* adapted initialize_tasks

* iron out bugs

* fixed typo

* fixed typo

* simplified code

* further tidy up

* remove lines for testing

* removed test lines

* removed unused code

* remove unused import

* fixed bug

* removed comments

* group in a list of group can accept parameter changes like `num_fewshot`

* check if config is task update

* add GroupConfig object

* edit test yaml

* remove args

* testing returning to python task list

* add weight_by_size config

* describe weight_by_size in docs

* fix weight by size potential error

* can load individual custom python class task

* moved import_function into the config loading file

* remove print lines

* add squadv2 yaml

* temporary scroll implementation

* revert back to use load_yaml_config but with modes

* fix group being loaded with a None

* reformat

* can load unregistered tasks from a group

* update scrolls

* edit scrolls multiplechoice task

* adjust class initialization

* fix initialization

* changed how to identify group and python tasks, fix logger

* allow loading "include" that is nested in a group config

* reworked flan benchmark

* allow duplicate task in the same group to co-exist

* process group_alias

* removed group_alias

* allow parameters set in group_config to apply to all tasks in tasklist

* add function, but comment for now

* reworked processing dict-base config

* fixed how configs in group are processed

* update to allow root group to have its alias used

* remove unused classes

* remove unused classes

* revert some parts to original

* forgot to change one variable

* adapt the new process to use get_task_dict

* fix for singular group call

* fix variable names

* add TaskManager into the evaluator

* format

* changed how dict tasks are loaded

* add docs

* Update docs/new_task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update evaluator.py

* Update evaluator.py

* remove groupconfig for now

* changed _config to config

* update interface.md to explain TaskManager

* added property functions

* adjusted logger

* update write_out.py

* updated tests

* added documentation and some modifications

* added docstring documentation

* precommit format

* updated task loading for tests

* updates tests

* changed arg order for load_yaml_config

* update to handle scrolls and edit log message

* remove unused lines

* return a list of task classes and not a dict

* Update __init__.py

* Delete lm_eval/tasks/benchmarks/test.yaml

* Update task.py

* Update lm_eval/utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update utils.py

* re-added old functions with new log message

* Update docs/new_task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update new_task_guide.md

* added infor regarding `get_task_dict` and documentation

* add get_config for Task

* pre-commit formatting

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

d714fc95

Enable override of printed `n-shot` in table (#1379) · 17191063
Hailey Schoelkopf authored Feb 01, 2024
```
* allow tasks to specify printed fewshot val

* fix to belebele

* update metadata field's documentation
```
17191063

31 Jan, 2024 1 commit

Fix unintuitive `--gen_kwargs` behavior (#1329) · bd7d265a

Hailey Schoelkopf authored Jan 31, 2024

* don't override do_sample if no value for it is passed

* Update gen_kwargs override condition

* Update huggingface.py

* Update huggingface.py

* run linters

* silence an erroneous warning

bd7d265a

28 Jan, 2024 1 commit

Apply some best practices and guideline recommendations to code (#1363) · 488759d2

LSinev authored Jan 28, 2024

* raise Exception, not a string

Additional info https://peps.python.org/pep-0352/#exception-hierarchy-changes
https://docs.python.org/3.8/tutorial/errors.html#raising-exceptions

* Apply PEP8 recommendation to prefer isinstance

"Object type comparisons should always use isinstance() instead of comparing types directly"
https://peps.python.org/pep-0008/

* Remove dangerous default mutable values in arguments

https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/dangerous-default-value.html

* Format logging messages with fstring (not with format)

Additional info
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/logging-format-interpolation.html
There are also discussions about the speed of formatting while logging or some unintended code executions
https://github.com/pylint-dev/pylint/issues/2395
https://stackoverflow.com/a/54368109
but at least one format (fstring one) will be used throughout the project

* Specify utf-8 encoding for `open` explicitly

If not specified, it may be supposed differently in different environments, OSes, and Python versions. See
https://peps.python.org/pep-0597/
https://docs.python.org/3.11/library/locale.html#locale.getencoding
https://docs.python.org/3.10/library/os.html#utf8-mode
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/unspecified-encoding.html

Helps also if some code from English language tasks is taken as inspiration for tasks in non-English languages.

* Use inline-ignoring comments to pass pre-commit instead of identity process

https://flake8.pycqa.org/en/3.0.1/user/ignoring-errors.html#in-line-ignoring-errors
https://www.flake8rules.com/rules/F841.html

flake8 comments are supported by ruff: https://docs.astral.sh/ruff/linter/#error-suppression

488759d2

25 Jan, 2024 1 commit

`Filter` docs not offset by `doc_id` (#1349) · a0f1cacd

Baber Abbasi authored Jan 26, 2024

* get `doc` from instance

* acceletate bugfix: get ground doc from instance

* convert filter to `process_result`

* get docs from instances in `FilterEnsemble`

* rename

* nit

* better looping

* fix typehint

a0f1cacd

23 Jan, 2024 2 commits

Don't use `get_task_dict()` in task registration / initialization (#1331) · 969b48bf

Hailey Schoelkopf authored Jan 23, 2024



* don't use get_task_dict() as a helper, it will download the dataset!

* pre-commit

* Update README.md

---------
Co-authored-by: lintangsutawika <lintang@eleuther.ai>

969b48bf

Update migrated HF dataset paths (#1332) · 45a8f709

Hailey Schoelkopf authored Jan 22, 2024



* Update arc_easy.yaml

* Update flan_cot.yaml

* update HF dataset path

* Update freeform.yaml

* Update flan_cot.yaml

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

45a8f709

19 Jan, 2024 1 commit
- Update polemo2_in.yaml (#1318) · 4477d572
  Lintang Sutawika authored Jan 20, 2024
  
  4477d572
18 Jan, 2024 3 commits
- Fix group register (#1315) · 72ea626e
  Lintang Sutawika authored Jan 19, 2024
```
* tuple should be considered as well

* set option to keep callable as callable
```
  72ea626e
- Fix polemo2_in.yaml config name (#1313) · b8cbc425
  Quentin Lhoest authored Jan 18, 2024
  
  b8cbc425
- Update nq_open.yaml (#1305) · 10488d0d
  Hannibal046 authored Jan 18, 2024
```
* Update nq_open.yaml

change regex

* Bump NQ version

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
```
  10488d0d
16 Jan, 2024 1 commit
- Update nq_open.yaml (#1289) · 032e879b
  Hailey Schoelkopf authored Jan 15, 2024
  
  032e879b
15 Jan, 2024 2 commits
- Allow parameter edits for registered tasks when listed in a benchmark (#1273) · 03e7df51
  Lintang Sutawika authored Jan 15, 2024
```
* benchmark yamls allow minor edits of already registered tasks

* add documentation

* removed print
```
  03e7df51
- fix whitespace in target + prompt for CoT gsm8k (#1275) · ace4393e
  Hailey Schoelkopf authored Jan 15, 2024
  
  ace4393e
12 Jan, 2024 1 commit

add Kobest (#1263) · 653217a7

jp authored Jan 12, 2024

* Add: kobest config file

* Add: kobest utils

* Add: README

* Update utils.py

653217a7

11 Jan, 2024 2 commits

Fix bug in multi-token Stop Sequences (#1268) · ff739414
Hailey Schoelkopf authored Jan 11, 2024
```
* fix incorrect lookback protections

* bump generate_until task versions
```
ff739414

MultiMedQA (#1198) · 818c056b

Tanishq Abraham authored Jan 10, 2024



* multimedqa

* Update medqa.yaml

* move to benchmarks folder

* add README.md

---------
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

818c056b