Commits · a72babbfbddd9195748351892dced4f82fccbc0d · gaoqiong / lm-evaluation-harness

22 Feb, 2024 1 commit

PR fixing the issue #1391 (wrong contexts in the mgsm task) (#1440) · a72babbf

Lei Chen authored Feb 22, 2024



* fix the issue #1391, wrong contexts in mgsm tasks

* fix yaml issue for having two target_delimiter lines. For COT tasks, keep the one with a space (default)

* regenerate all task yaml files
- change naming so that file name will match with task name
- task|file follows a consistent naming way, mgsm_(mode)_(lang) for three modes, i.e., direct, en_cot, and native_cot

* English CoTs should have a space as target_delimiter

* Update utils.py

* Apply suggestions from code review

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

a72babbf

21 Feb, 2024 1 commit

Added KMMLU evaluation method and changed ReadMe (#1447) · c26a6ac7

Hanwool Albert Lee authored Feb 21, 2024



* update kmmlu default formatting

* Update _default_kmmlu_yaml

* Delete lm_eval/tasks/kmmlu/utils.py

* new tasks implemented

* add direct tasks

* update direct evaluate

* update direct eval

* add cot sample

* update cot

* add cot

* Update _cot_kmmlu_yaml

* add kmmlu90

* Update and rename _cot_kmmlu.yaml to _cot_kmmlu_yaml

* Create kmmlu90.yaml

* Update _cot_kmmlu_yaml

* add direct

* Update _cot_kmmlu_yaml

* Update and rename kmmlu90.yaml to kmmlu90_cot.yaml

* Update kmmlu90_direct.yaml

* add kmmlu hard

* Update _cot_kmmlu_yaml

* Update _cot_kmmlu_yaml

* update cot

* update cot

* erase typo

* Update _cot_kmmlu_yaml

* update cot

* Rename dataset to match k-mmlu-hard

* removed kmmlu90

* fixed name 'kmmlu_cot' to 'kmmlu_hard_cot' and revised README

* applied pre-commit before pull requests

* rename datasets and add notes

* Remove DS_Store cache

* Update lm_eval/tasks/kmmlu/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Change citations and reflect reviews on version

* Added kmmlu_hard and fixed other errors

* fixing minor errors

* remove duplicated

* Rename files

* try ".index"

* minor fix

* minor fix again

* fix revert.

* minor fix. thank for hailey

---------
Co-authored-by: GUIJIN SON <spthsrbwls123@yonsei.ac.kr>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

c26a6ac7

20 Feb, 2024 3 commits

Add a new task GPQA (the part without CoT) (#1434) · 5ab295c8

Uanu authored Feb 21, 2024



* add new task GPQA_n_shot

* add new task GPQA_zeroshot

* correct GPQA_zeroshot filename

* Add randomly shuffle choices

* Correct missing parentheses

* delete wrong tasks

* Add README

* Update lm_eval/tasks/gpqa/zeroshot/_gpqa_zeroshot_yaml

* Update lm_eval/tasks/gpqa/n_shot/utils.py

* Update lm_eval/tasks/gpqa/n_shot/utils.py

* Update lm_eval/tasks/gpqa/README.md

* placate linter

* linter

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

5ab295c8

Group reqs by context (#1425) · 45941c67

Baber Abbasi authored Feb 20, 2024



* add key lookup for same contexts

* nit

* appease pre-commit

* nit

* use `expand` (in-place view) rather than `repeat`

* try mixed grouping

* add docs.

* nit

* nit

* nits

* fix tests

* Move greedy_tokens calculation out of cache loop

* nit

* nits

* add test

* nits

* fix name conflict

* fix name conflict

* chunk tensor

* move Collator

* nits/docstring

* fixup

* fixup

* group contexts only for decoders

* pre-commit

* fix `generate_until` test

* fix `generate_until` test

* Update lm_eval/models/huggingface.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add docs

* nit

* add docs

* add docs

* add 'logits_cache' arg

* bugfix

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

45941c67

Add a new task HaeRae-Bench (#1445) · 8680e938

Hanwool Albert Lee authored Feb 20, 2024

* haerae_reimplementation

* edited Readme and add few_shot settings

* edited readme

* newlines at end of each files

* Modifying the README file

* applied pre-commit

8680e938

19 Feb, 2024 2 commits

update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) (#1356) · 89deeeaf

thnkinbtfly authored Feb 20, 2024



* update bbh, gsm8k, mmlu parsing logic and prompts

* remove the formatting prompt (bbh) + minor update (mmlu)

* update bbh, gsm8k, mmlu zeroshot, revert fewshots

* update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot

* remove take_last, update to use docs parameters

* add newline

* ruff formatting

* Update pyproject.toml

* fix format

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

89deeeaf

Correct typo in task name (#1443) · 19cbb292
larekrow authored Feb 19, 2024

19cbb292

15 Feb, 2024 1 commit
- Update README.md (#1430) · f3b79170
  David Hoffmann authored Feb 15, 2024
  
  f3b79170
13 Feb, 2024 1 commit
- Fix: task weighting by subtask size ; update Pooled Stderr formula slightly (#1427) · 620d6a15
  Hailey Schoelkopf authored Feb 13, 2024
```
* fix weight_by_size condition

* add tests, update stderr formula slightly

* apply pre-commit
```
  620d6a15
12 Feb, 2024 1 commit
- [m_mmul] added multilingual evaluation from alexandrainst/m_mmlu (#1358) · b69c67c1
  Alessandro Ercolani authored Feb 12, 2024
  
  b69c67c1
11 Feb, 2024 2 commits
- Add multilingual TruthfulQA task (#1420) · 7397b965
  Uanu authored Feb 11, 2024
  
  7397b965
- Add multilingual ARC task (#1419) · 0256c682
  Uanu authored Feb 11, 2024
  
  0256c682
02 Feb, 2024 1 commit
- Fix for https://github.com/EleutherAI/lm-evaluation-harness/issues/1383 (#1384) · 9a902155
  Pasquale Minervini authored Feb 02, 2024
```
Fixes https://github.com/EleutherAI/lm-evaluation-harness/issues/1383

If this is okay, it will need to be propagated to SCROLLS
```
  9a902155
01 Feb, 2024 2 commits

Faster Task and Group Loading, Allow Recursive Groups (#1321) · d714fc95

Lintang Sutawika authored Feb 01, 2024

* add trust_remote_code as default

* task for testing recursive

* changed source of ALL_TASKS

* tasks should only accept TaskObjects

* initialize_tasks returns list of tasks and groups

* remove trust_remote_code for now

* moved constructor process to inside load_yaml_config

* more comprehensive way to index tasks and groups

* pre-commit format

* add exit after error

* adjust how task objects are called

* no need to use get_task_dict

* load_task_or_group works but only for tasks

* pre-commit format

* half working for nested groups

* changed variable names

* allow groups and tasks to work

* temp save

* indexing and loading are part of a task_manager object

* adapted initialize_tasks

* iron out bugs

* fixed typo

* fixed typo

* simplified code

* further tidy up

* remove lines for testing

* removed test lines

* removed unused code

* remove unused import

* fixed bu...

d714fc95

Enable override of printed `n-shot` in table (#1379) · 17191063
Hailey Schoelkopf authored Feb 01, 2024
```
* allow tasks to specify printed fewshot val

* fix to belebele

* update metadata field's documentation
```
17191063

31 Jan, 2024 1 commit

Fix unintuitive `--gen_kwargs` behavior (#1329) · bd7d265a

Hailey Schoelkopf authored Jan 31, 2024

* don't override do_sample if no value for it is passed

* Update gen_kwargs override condition

* Update huggingface.py

* Update huggingface.py

* run linters

* silence an erroneous warning

bd7d265a

28 Jan, 2024 1 commit

Apply some best practices and guideline recommendations to code (#1363) · 488759d2

LSinev authored Jan 28, 2024

* raise Exception, not a string

Additional info https://peps.python.org/pep-0352/#exception-hierarchy-changes
https://docs.python.org/3.8/tutorial/errors.html#raising-exceptions

* Apply PEP8 recommendation to prefer isinstance

"Object type comparisons should always use isinstance() instead of comparing types directly"
https://peps.python.org/pep-0008/

* Remove dangerous default mutable values in arguments

https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/dangerous-default-value.html

* Format logging messages with fstring (not with format)

Additional info
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/logging-format-interpolation.html
There are also discussions about the speed of formatting while logging or some unintended code executions
https://github.com/pylint-dev/pylint/issues/2395
https://stackoverflow.com/a/54368109
but at least one format (fstring one) will be used throughout the project

* Specify utf-8 encoding for `open` explicitly

If not specified, it may be supposed differently in different environments, OSes, and Python versions. See
https://peps.python.org/pep-0597/
https://docs.python.org/3.11/library/locale.html#locale.getencoding
https://docs.python.org/3.10/library/os.html#utf8-mode
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/unspecified-encoding.html

Helps also if some code from English language tasks is taken as inspiration for tasks in non-English languages.

* Use inline-ignoring comments to pass pre-commit instead of identity process

https://flake8.pycqa.org/en/3.0.1/user/ignoring-errors.html#in-line-ignoring-errors
https://www.flake8rules.com/rules/F841.html

flake8 comments are supported by ruff: https://docs.astral.sh/ruff/linter/#error-suppression

488759d2

25 Jan, 2024 1 commit

`Filter` docs not offset by `doc_id` (#1349) · a0f1cacd

Baber Abbasi authored Jan 26, 2024

* get `doc` from instance

* acceletate bugfix: get ground doc from instance

* convert filter to `process_result`

* get docs from instances in `FilterEnsemble`

* rename

* nit

* better looping

* fix typehint

a0f1cacd

23 Jan, 2024 2 commits

Don't use `get_task_dict()` in task registration / initialization (#1331) · 969b48bf

Hailey Schoelkopf authored Jan 23, 2024



* don't use get_task_dict() as a helper, it will download the dataset!

* pre-commit

* Update README.md

---------
Co-authored-by: lintangsutawika <lintang@eleuther.ai>

969b48bf

Update migrated HF dataset paths (#1332) · 45a8f709

Hailey Schoelkopf authored Jan 22, 2024



* Update arc_easy.yaml

* Update flan_cot.yaml

* update HF dataset path

* Update freeform.yaml

* Update flan_cot.yaml

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

45a8f709

19 Jan, 2024 1 commit
- Update polemo2_in.yaml (#1318) · 4477d572
  Lintang Sutawika authored Jan 20, 2024
  
  4477d572
18 Jan, 2024 3 commits
- Fix group register (#1315) · 72ea626e
  Lintang Sutawika authored Jan 19, 2024
```
* tuple should be considered as well

* set option to keep callable as callable
```
  72ea626e
- Fix polemo2_in.yaml config name (#1313) · b8cbc425
  Quentin Lhoest authored Jan 18, 2024
  
  b8cbc425
- Update nq_open.yaml (#1305) · 10488d0d
  Hannibal046 authored Jan 18, 2024
```
* Update nq_open.yaml

change regex

* Bump NQ version

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
```
  10488d0d
16 Jan, 2024 1 commit
- Update nq_open.yaml (#1289) · 032e879b
  Hailey Schoelkopf authored Jan 15, 2024
  
  032e879b
15 Jan, 2024 2 commits
- Allow parameter edits for registered tasks when listed in a benchmark (#1273) · 03e7df51
  Lintang Sutawika authored Jan 15, 2024
```
* benchmark yamls allow minor edits of already registered tasks

* add documentation

* removed print
```
  03e7df51
- fix whitespace in target + prompt for CoT gsm8k (#1275) · ace4393e
  Hailey Schoelkopf authored Jan 15, 2024
  
  ace4393e
12 Jan, 2024 1 commit

add Kobest (#1263) · 653217a7

jp authored Jan 12, 2024

* Add: kobest config file

* Add: kobest utils

* Add: README

* Update utils.py

653217a7

11 Jan, 2024 2 commits

Fix bug in multi-token Stop Sequences (#1268) · ff739414
Hailey Schoelkopf authored Jan 11, 2024
```
* fix incorrect lookback protections

* bump generate_until task versions
```
ff739414

MultiMedQA (#1198) · 818c056b

Tanishq Abraham authored Jan 10, 2024



* multimedqa

* Update medqa.yaml

* move to benchmarks folder

* add README.md

---------
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

818c056b

10 Jan, 2024 1 commit
- fixed belebele (#1267) · 9b0b15b1
  James A. Michaelov authored Jan 10, 2024
  
  9b0b15b1
05 Jan, 2024 1 commit

Add multilingual HellaSwag task (#1228) · 28bb45fb

JorgeDeCorte authored Jan 05, 2024



* add hellaswag_nl

* add other languages and update readme to hellaswag

* refactor as new task

* update readme

* add endline to yaml files and readme.md

* add group, change folder location and update yaml file

* rename default hellaswag yaml file

* fix whitespace error in some labels

* downgrade log level of whitespace checking

---------
Co-authored-by: JorgeDeCorte <jorge.decorte@ravago.be>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

28bb45fb

02 Jan, 2024 1 commit
- Update README.md (#1230) · a12ef445
  Pasquale Minervini authored Jan 02, 2024
  
  a12ef445
29 Dec, 2023 1 commit

Don't silence errors when loading tasks (#1148) · 34b563b1

Paul McCann authored Dec 30, 2023



* Add example failing task

This task includes an invalid import. This will cause an exception and
the task will not be loaded. But this just results in a DEBUG level log
message, so in normal usage you'll see no error, and will be told the
task doesn't exist.

Here's an example command line to run the task:

    python -m lm_eval --model hf --model_args pretrained=rinna/japanese-gpt-1b --tasks fail

This task is based on a Japanese Winograd task, but that's not
important, and was just used due to familiarity.

* Do not ignore errors when loading tasks

* Change how task errors are logged

This makes the proposed changes from PR discussion.

1. Exceptions not related to missing modules/imports are logged as
   warnings.

2. module/import related exceptions are still logged at debug level, but
   if any of them happen there is a warning about it with instructions
   on how to show logs.

* Remove intentionally failing task

---------
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

34b563b1

27 Dec, 2023 1 commit

nits + fix siqa (#1216) · 6a1c19ed

Baber Abbasi authored Dec 27, 2023

* fix group

* siqa: default.yml -> default.yaml

* max_gen_toks -> self.max_gen_toks

* add ids to task tests

* fix siqa

* fix gen_kwargs for openai-chat

6a1c19ed

24 Dec, 2023 1 commit

Add remove_whitespace to FLD benchmark (#1206) · 8ffbe58a

MorishT authored Dec 24, 2023



* Add remove_whitespace to FLD benchmark

* bump task version

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

8ffbe58a

21 Dec, 2023 1 commit

Correctly Print Task Versioning (#1173) · 9cd79897

Hailey Schoelkopf authored Dec 21, 2023

* change version field formatting in metadata

* mention versioning in new task guide

* add instructions for changelog

* run linters

9cd79897

20 Dec, 2023 2 commits

Error in --num_fewshot option for K-MMLU Evaluation Harness (#1178) · 12f2c5ea
GUIJIN SON authored Dec 21, 2023
```
* update kmmlu default formatting

* Update _default_kmmlu_yaml

* Delete lm_eval/tasks/kmmlu/utils.py
```
12f2c5ea

Switch Linting to `ruff` (#1166) · 65b8761d

Baber Abbasi authored Dec 20, 2023

* add ruff and isort. remove black and flake8

* remove unnecessary dependencies

* remove dependency from table

* change order

* ran ruff

* check 3.9

* exclude evaluator

* update CI workflow

* use ruff config in pyproject.toml

* test

* add isort rules to ruff

* sort imports

* import `make_table`

* try stages for no-commit-to-branch

* turn on mypy for pre-commit

* test

* test

* test

* change no-commit-to-branch to default

* nits

* fixed dependency

65b8761d

19 Dec, 2023 1 commit

Fix Column Naming and Dataset Naming Conventions in K-MMLU Evaluation (#1171) · 9e03d9d0

seungduk.kim.2304 authored Dec 20, 2023



* Correct column names and dataset names

* Remove kmmlu_general_physics.yaml and kmmlu_korean_language.yaml

* Update _default_kmmlu_yaml

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

9e03d9d0