Commits · e6b798f941c5fb5129fb6e936e992e81b0acf05d · gaoqiong / lm-evaluation-harness

23 Jul, 2025 1 commit
- remove trust-remote-code in configs; fix escape sequences (#3180) · 314f7176
  Baber Abbasi authored Jul 23, 2025
```
* remove trust-remote-code

* add W605 rule
```
  314f7176
16 Jul, 2025 1 commit

`bbh_cot_fewshot`: Removed repeated "Let''s think step by step." text from bbh cot prompts (#3140) · c2be7211

philipdoldo authored Jul 16, 2025



* Removed the 'Let''s think step by step.' text from the start of the target entry in each of the samples to prevent this phrase from being repeated twice in the few-shot prompts and to match the behavior from the original bbh repository. Worth noting that this applied to only 26 out of 27 subtasks, the only one it did not apply to is boolean_expressions.yaml. When it comes to boolean_expressions.yaml, in my opinion there is an error in that it doesn't say the 'Remember that (i) ...' text after the final 'A: Let's think step by step.' in the prompt. Models like EleutherAI/gpt-neo-125m seem to always begin answers with this string anyway (copying what was done in the few-shot prompts), but I think it really should've been part of the prompt, much like how 'A: Let's think step by step.' is included in the prompt for all of the cot tasks. However, the original bbh repo also has this issue, so I think it is fine to keep it this way for consistency, but just thought I'd point it out anyway.

* feat: remove extra space from answers; add changelog

---------
Co-authored-by: Baber <baber@hey.com>

c2be7211

26 Mar, 2025 1 commit
- changed dataset to parquet version (#2845) · 908ac2b2
  Baber Abbasi authored Mar 26, 2025
  
  908ac2b2
09 Aug, 2024 1 commit

keep new line for task description (#2116) · 8ad598df

Jungwhan Kim authored Aug 10, 2024



* add keep trailing newline

* apply ruff-format

* add prompt unit test

* increment the version of tasks that have description with whitespace

* remove white spaces of leaderboard bbh

* update MMLU expected versions in output

* CI run does display the expected version=1 for mmlu subtasks, fix expected test output again

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

8ad598df

08 Jul, 2024 1 commit

Group agg rework (#1741) · 517aadc4

Lintang Sutawika authored Jul 08, 2024



* add greoup_config arg

* add a group config that allows disabling table for group score and group aggregate in general

* fixed size configuration

* adjust config

* add group config

* adjust mmlu to use group_config

* fixed args input in aggregate_subtask_metrics

* fixed issues related to printing alias of group and updated yaml

* update all mmlu variants to include group_config

* edit format

* modify mmlu tasks

* adjust group to also be a configurable group

* add configurable group

* simplify get_task_list

* adjust group scoring with using ConfigurableGroup

* adjust args

* update mmlu

* update mmlu

* update to work with new group and task configuration

* readd group_agg

* readd files

* move prepare_print_tasks to evaluator_utils

* sort set to False by default, fix predict_only arg

* add version for groups

* reversed task list

* update additional condition when loading a group in a group yaml

* update truthfulqa

* add description regarding tags replacing group

* replace group to tag

* fixed conditional statement

* remove warning

* update loading of task group and newly added tags

* reformat with pre-commit

* fixed info log

* update

* fix bug

* fix bug

* use task id to differentiate tasks

* convert all groups to configurable groups

* use task_id

* reformat

* add task_id for python tasks as well

* add task_id for python tasks as well

* add task_id for python tasks as well

* revert truthfulqa

* revert mmlu tasks

* new mmlu config

* new group config parameter `tag_to_task`

* Update truthfulqa_mc2.yaml

* reformate

* add _process_group_config

* adjust task_id

* add get_subtask_list function to get proper subtask list

* group config to_dict update

* remove tag check

* update mmlu

* fix config passing issues

* add test yaml

* format fix

* add documentation

* corner case for single tag being called

* fix indentation

* formatting

* update all mmlu variants

* Update docs/task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove group_alias

* Update docs/task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove version for metadata

* Update docs/task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* update mmlu/

* removed " " in make_table

* change how aggregate_metric is loaded

* change how aggregate_metric is loaded

* update aggregate_metric arg

* update format

* update format

* some docs fixes

* add groups for agieval, aexams, aclue

* add more explicit aggregation groups

* add more groupings / tags distinctions

* add more groupings

* more groupings

* add many explicit group configs

* add many explicit group configs

* add more explicit group configs

* add more explicit group configs

* add more error msgs, agg_metric -> agg_metric_list

* some docs updates

* update task_id to be updateable and uses group:task format

* make KMMLU a tag for now

* update docs

* don't duplicate task names

* fix merge conflicts?

* giving this a try

* clean up diff

* switch mmlu variants over to using

* don't use to-be-deprecated group: config field in overview notebook

* Python tasks which subclass ConfigurableTask now run

* update mmlu

* pre-commit format

* fixed sorting for multi-level printing

* move group api to separate file

* fix bbh aggregation filter usage

* track api/group.py

* adjust group and tags loading

* make explicit group configs for leaderboard and other newer tasks

* fix arabicmmlu

* update

* change arabicmmlu template name???

* update group alias

* fix printing bugs

* check table printing is correct ; update tests

* use mmlu_stem to have a group included in print tests

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

517aadc4

13 Jun, 2024 1 commit

`samples` is newline delimited (#1930) · 3850e21a

Baber Abbasi authored Jun 13, 2024



* `samples` is newline delimited

* updated git and pre-commit

* appease pre-commit

* nit

* Revert back for now

* Revert for now

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

3850e21a

31 May, 2024 1 commit

Making hardcoded few shots compatible with the chat template mechanism (#1895) · 4902aaaf

Clémentine Fourrier authored May 31, 2024



* init test 1

* fix

* this format seems to be working - need to update all other tasks with the new format

* bbh with few shot format

* fix fewshot bbh

* add mmlu flan cot

* samples of cot

* kmmlu

* fix gsm8k

* update keys for mmlu

* minerva math

* bbh

* fix

* fix samples

* small fixes to templates

* last prompt format change

* fixing prompt

* fixed minerva math format

* rm accidental commited file

* added doc for few shot samples

* Update lm_eval/loggers/evaluation_tracker.py

* Update lm_eval/loggers/evaluation_tracker.py

* Update docs/new_task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* added check in sampler per code review

* added the system from a function, plus an example in minerva math

* style

* Apply suggestions from code review
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix unit tests 1

* forcing use of test split

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

4902aaaf

26 Feb, 2024 1 commit
- Apply code autoformatting with Ruff to tasks/*.py an *__init__.py (#1469) · d27c0c08
  LSinev authored Feb 26, 2024
  
  d27c0c08
20 Feb, 2024 1 commit

Group reqs by context (#1425) · 45941c67

Baber Abbasi authored Feb 20, 2024



* add key lookup for same contexts

* nit

* appease pre-commit

* nit

* use `expand` (in-place view) rather than `repeat`

* try mixed grouping

* add docs.

* nit

* nit

* nits

* fix tests

* Move greedy_tokens calculation out of cache loop

* nit

* nits

* add test

* nits

* fix name conflict

* fix name conflict

* chunk tensor

* move Collator

* nits/docstring

* fixup

* fixup

* group contexts only for decoders

* pre-commit

* fix `generate_until` test

* fix `generate_until` test

* Update lm_eval/models/huggingface.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add docs

* nit

* add docs

* add docs

* add 'logits_cache' arg

* bugfix

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

45941c67

19 Feb, 2024 1 commit

update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) (#1356) · 89deeeaf

thnkinbtfly authored Feb 20, 2024



* update bbh, gsm8k, mmlu parsing logic and prompts

* remove the formatting prompt (bbh) + minor update (mmlu)

* update bbh, gsm8k, mmlu zeroshot, revert fewshots

* update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot

* remove take_last, update to use docs parameters

* add newline

* ruff formatting

* Update pyproject.toml

* fix format

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

89deeeaf

01 Feb, 2024 1 commit
- Enable override of printed `n-shot` in table (#1379) · 17191063
  Hailey Schoelkopf authored Feb 01, 2024
```
* allow tasks to specify printed fewshot val

* fix to belebele

* update metadata field's documentation
```
  17191063
28 Jan, 2024 1 commit

Apply some best practices and guideline recommendations to code (#1363) · 488759d2

LSinev authored Jan 28, 2024

* raise Exception, not a string

Additional info https://peps.python.org/pep-0352/#exception-hierarchy-changes
https://docs.python.org/3.8/tutorial/errors.html#raising-exceptions

* Apply PEP8 recommendation to prefer isinstance

"Object type comparisons should always use isinstance() instead of comparing types directly"
https://peps.python.org/pep-0008/

* Remove dangerous default mutable values in arguments

https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/dangerous-default-value.html

* Format logging messages with fstring (not with format)

Additional info
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/logging-format-interpolation.html
There are also discussions about the speed of formatting while logging or some unintended code executions
https://github.com/pylint-dev/pylint/issues/2395
https://stackoverflow.com/a/54368109
but at least one format (fstring one) will be used throughout the project

* Specify utf-8 encoding for `open` explicitly

If not specified, it may be supposed differently in different environments, OSes, and Python versions. See
https://peps.python.org/pep-0597/
https://docs.python.org/3.11/library/locale.html#locale.getencoding
https://docs.python.org/3.10/library/os.html#utf8-mode
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/unspecified-encoding.html

Helps also if some code from English language tasks is taken as inspiration for tasks in non-English languages.

* Use inline-ignoring comments to pass pre-commit instead of identity process

https://flake8.pycqa.org/en/3.0.1/user/ignoring-errors.html#in-line-ignoring-errors
https://www.flake8rules.com/rules/F841.html

flake8 comments are supported by ruff: https://docs.astral.sh/ruff/linter/#error-suppression

488759d2

11 Jan, 2024 1 commit
- Fix bug in multi-token Stop Sequences (#1268) · ff739414
  Hailey Schoelkopf authored Jan 11, 2024
```
* fix incorrect lookback protections

* bump generate_until task versions
```
  ff739414
21 Dec, 2023 1 commit

Correctly Print Task Versioning (#1173) · 9cd79897

Hailey Schoelkopf authored Dec 21, 2023

* change version field formatting in metadata

* mention versioning in new task guide

* add instructions for changelog

* run linters

9cd79897

20 Dec, 2023 1 commit

Switch Linting to `ruff` (#1166) · 65b8761d

Baber Abbasi authored Dec 20, 2023

* add ruff and isort. remove black and flake8

* remove unnecessary dependencies

* remove dependency from table

* change order

* ran ruff

* check 3.9

* exclude evaluator

* update CI workflow

* use ruff config in pyproject.toml

* test

* add isort rules to ruff

* sort imports

* import `make_table`

* try stages for no-commit-to-branch

* turn on mypy for pre-commit

* test

* test

* test

* change no-commit-to-branch to default

* nits

* fixed dependency

65b8761d

14 Dec, 2023 1 commit
- fix: _generate_configs.py · c314246d
  momotori authored Dec 14, 2023
  
  c314246d
13 Dec, 2023 4 commits
- bump version on cot_fewshot tasks · a7707c76
  haileyschoelkopf authored Dec 13, 2023
  
  a7707c76
- update regeneration script, bump bbh_cot_fewshot version · 3fbdfea1
  haileyschoelkopf authored Dec 13, 2023
  
  3fbdfea1
- fix: enlarge max_gen_toks to make output of bbh_cot_fewshot complete · 7ec42165
  momotori authored Dec 13, 2023
  
  7ec42165
- fix: fix bug in the "doc_to_text" of BBH_cot_fewshot · 33dcbd49
  momotori authored Dec 13, 2023
  
  33dcbd49
07 Dec, 2023 4 commits
- Update _cot_zeroshot_template_yaml · ba7ba910
  Hailey Schoelkopf authored Dec 06, 2023
  
  ba7ba910
- Update _fewshot_template_yaml · a2cc877b
  Hailey Schoelkopf authored Dec 06, 2023
  
  a2cc877b
- Update _zeroshot_template_yaml · 361ba192
  Hailey Schoelkopf authored Dec 06, 2023
  
  361ba192
- Update _cot_fewshot_template_yaml · a6d28ea7
  Lintang Sutawika authored Dec 07, 2023
```
BBH cot fewshot already has fewshot examples in the description. So num_fewshot needs to be set to 0 so that users won't mistakenly set other num_fewshot values.
```
  a6d28ea7
28 Nov, 2023 3 commits
- reformat · 37a46351
  lintangsutawika authored Nov 28, 2023
  
  37a46351
- add versions · 0d03a9f3
  lintangsutawika authored Nov 28, 2023
  
  0d03a9f3
- removed flan from the naming · 3b9640b8
  lintangsutawika authored Nov 28, 2023
  
  3b9640b8
27 Nov, 2023 2 commits
- 'flan' in groupnames optional for all variants · 1e5ca7a4
  haileyschoelkopf authored Nov 27, 2023
  
  1e5ca7a4
- make 'bbh' a group name for flan cot 3-shot, add stopseqs · e408c2ce
  haileyschoelkopf authored Nov 27, 2023
  
  e408c2ce
17 Oct, 2023 1 commit
- change all mentions of `greedy_until` to `generate_until` · c64bf9a9
  lintangsutawika authored Oct 17, 2023
  
  c64bf9a9
05 Oct, 2023 1 commit
- removed print messages, added cot extraction strings · 20a54b3a
  lintangsutawika authored Oct 05, 2023
  
  20a54b3a
26 Sep, 2023 1 commit
- update loading prompts · b2d16321
  lintangsutawika authored Sep 26, 2023
  
  b2d16321
21 Sep, 2023 1 commit
- update · f0d8b559
  lintangsutawika authored Sep 21, 2023
  
  f0d8b559
05 Sep, 2023 1 commit
- edit to fix cot filter · 7601d828
  lintangsutawika authored Sep 05, 2023
  
  7601d828
04 Sep, 2023 6 commits
- fixed template · 4f5b72bc
  lintangsutawika authored Sep 04, 2023
  
  4f5b72bc
- udpate · 03be40e2
  lintangsutawika authored Sep 04, 2023
  
  03be40e2
- updates · e795efcf
  lintangsutawika authored Sep 04, 2023
  
  e795efcf
- add flan_zeroshot · 0d195e90
  lintangsutawika authored Sep 04, 2023
  
  0d195e90
- add flan_fewshot · 3531d9c1
  lintangsutawika authored Sep 04, 2023
  
  3531d9c1
- add flan_cot_zeroshot · c06b0d6e
  lintangsutawika authored Sep 04, 2023
  
  c06b0d6e