Commits · mmlu-test · gaoqiong / lm-evaluation-harness

20 Feb, 2024 3 commits
- updated to appease the pre-commit · f692caa9
  lintangsutawika authored Feb 20, 2024
  
  f692caa9
- merged with latest update · ab96fc7e
  lintangsutawika authored Feb 20, 2024
  
  ab96fc7e
- Add a new task HaeRae-Bench (#1445) · 8680e938
  Hanwool Albert Lee authored Feb 20, 2024
```
* haerae_reimplementation

* edited Readme and add few_shot settings

* edited readme

* newlines at end of each files

* Modifying the README file

* applied pre-commit
```
  8680e938
19 Feb, 2024 2 commits

update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) (#1356) · 89deeeaf

thnkinbtfly authored Feb 20, 2024



* update bbh, gsm8k, mmlu parsing logic and prompts

* remove the formatting prompt (bbh) + minor update (mmlu)

* update bbh, gsm8k, mmlu zeroshot, revert fewshots

* update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot

* remove take_last, update to use docs parameters

* add newline

* ruff formatting

* Update pyproject.toml

* fix format

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

89deeeaf

Correct typo in task name (#1443) · 19cbb292
larekrow authored Feb 19, 2024

19cbb292

18 Feb, 2024 1 commit
- improve hf_hub activation (#1438) · a604f05c
  Michael Feil authored Feb 18, 2024
  
  a604f05c
15 Feb, 2024 1 commit
- Update README.md (#1430) · f3b79170
  David Hoffmann authored Feb 15, 2024
  
  f3b79170
14 Feb, 2024 1 commit
- Refactor utilities into a separate model utils file. (#1429) · 2d0a6460
  Baber Abbasi authored Feb 14, 2024
  
  2d0a6460
13 Feb, 2024 1 commit
- Fix: task weighting by subtask size ; update Pooled Stderr formula slightly (#1427) · 620d6a15
  Hailey Schoelkopf authored Feb 13, 2024
```
* fix weight_by_size condition

* add tests, update stderr formula slightly

* apply pre-commit
```
  620d6a15
12 Feb, 2024 3 commits
- Added seeds to `evaluator.simple_evaluate` signature (#1412) · bfbd0325
  Amine Elhattami authored Feb 12, 2024
```
* Added seeds to `evaluator.simple_evaluate` signature

* Added  CLI argument

* Updated  to add  arg.
```
  bfbd0325
- [m_mmul] added multilingual evaluation from alexandrainst/m_mmlu (#1358) · b69c67c1
  Alessandro Ercolani authored Feb 12, 2024
  
  b69c67c1
- update latest · bf2517cc
  lintangsutawika authored Feb 12, 2024
  
  bf2517cc
11 Feb, 2024 3 commits

Add multilingual TruthfulQA task (#1420) · 7397b965
Uanu authored Feb 11, 2024

7397b965
Add multilingual ARC task (#1419) · 0256c682
Uanu authored Feb 11, 2024

0256c682

Evaluate (#1385) · 1ff84897

Baber Abbasi authored Feb 11, 2024

* un-exclude `evaluate.py` from linting

* readability

* readability

* add task name to build info message

* fix link

* nit

* add functions for var and mean pooling

* add functions for var and mean pooling

* metadata compatibility with task

* rename `override_config` to `set_config` and move to `Task`

* add unit test

* nit

* nit

* bugfix

* nit

* nit

* nit

* add docstrings

* fix metadata-fewshot

* revert metric refactor

* nit

* type checking

* type hints

* type hints

* move `override_metric` to `Task`

* change metadata

* change name

* pre-commit

* rename

* remove

* remove

* `override_metric` backwards compatible with `Task`

* type hints

* use generic

* type hint

1ff84897

10 Feb, 2024 2 commits

Fix watchdog timeout (#1404) · 1e6825da
Jeevan authored Feb 10, 2024
```
* Fix watchdog timeout

* Pre-commit fix

* Timedelta
```
1e6825da

Fixes https://github.com/EleutherAI/lm-evaluation-harness/issues/1416 (#1418) · 921eab86

Pasquale Minervini authored Feb 10, 2024

* Fixes https://github.com/EleutherAI/lm-evaluation-harness/issues/1416

Sets `do_sample = False` if `temperature == 0.0` and `do_sample = None`

* Update huggingface.py

* Update huggingface.py

making linter happy

921eab86

09 Feb, 2024 1 commit
- use reversed task hierarchy for print (#1414) · ab4dba8f
  Hailey Schoelkopf authored Feb 09, 2024
  
  ab4dba8f
07 Feb, 2024 1 commit
- `batch_size` with `auto` defaults to 1 if `No executable batch size found` (#1405) · 4c17c55c
  Pasquale Minervini authored Feb 07, 2024
```
Fixes https://github.com/EleutherAI/lm-evaluation-harness/issues/1323
```
  4c17c55c
06 Feb, 2024 4 commits

adding hf_transfer (#1400) · 756eeb6f

Michael Feil authored Feb 06, 2024



* add hf_transfer

* update dependencies

* Delete stale `[linting]` extra

* Update README.md with extras table

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

756eeb6f

Use Pooled rather than Combined Variance for calculating stderr of task groupings (#1390) · 94cc1850

Hailey Schoelkopf authored Feb 06, 2024

* update formula for stderr aggregation

* hack: see what happens when using stderr_for_metric bootstrapping on a group

* undo bootstrap_for_stderr test

* factor out variance-aggregation formulas into api.metrics

* fix failing tests

* remove stray print

* update comment

* further detail in comment

* add back initialize_tasks() call

* fix format

94cc1850

Fix confusing `write_out.py` instructions in README (#1371) · df01adf6
Hailey Schoelkopf authored Feb 06, 2024

df01adf6

Update README.md (#1398) · 4d7d2f64

Michael Chen authored Feb 05, 2024

Add instructions for non-MacOS users on how to compile janitor_util.cpp so that janitor.py can use it.

4d7d2f64

05 Feb, 2024 3 commits
- Support for Inf2 optimum class [WIP] (#1364) · d17dcea0
  Michael Feil authored Feb 05, 2024
```
* initial commit

* remove overwrite bs

* adding neuronx dependencies

* Update README.md

* update neuronx
```
  d17dcea0
- merged with latest update from main · 8bca751c
  lintangsutawika authored Feb 05, 2024
  
  8bca751c
- merged with latest update from main · cb8889cc
  lintangsutawika authored Feb 05, 2024
  
  cb8889cc
02 Feb, 2024 2 commits
- fix on --task list (#1387) · 74119471
  Lintang Sutawika authored Feb 02, 2024
  
  74119471
- Fix for https://github.com/EleutherAI/lm-evaluation-harness/issues/1383 (#1384) · 9a902155
  Pasquale Minervini authored Feb 02, 2024
```
Fixes https://github.com/EleutherAI/lm-evaluation-harness/issues/1383

If this is okay, it will need to be propagated to SCROLLS
```
  9a902155
01 Feb, 2024 4 commits

Faster Task and Group Loading, Allow Recursive Groups (#1321) · d714fc95

Lintang Sutawika authored Feb 01, 2024



* add trust_remote_code as default

* task for testing recursive

* changed source of ALL_TASKS

* tasks should only accept TaskObjects

* initialize_tasks returns list of tasks and groups

* remove trust_remote_code for now

* moved constructor process to inside load_yaml_config

* more comprehensive way to index tasks and groups

* pre-commit format

* add exit after error

* adjust how task objects are called

* no need to use get_task_dict

* load_task_or_group works but only for tasks

* pre-commit format

* half working for nested groups

* changed variable names

* allow groups and tasks to work

* temp save

* indexing and loading are part of a task_manager object

* adapted initialize_tasks

* iron out bugs

* fixed typo

* fixed typo

* simplified code

* further tidy up

* remove lines for testing

* removed test lines

* removed unused code

* remove unused import

* fixed bug

* removed comments

* group in a list of group can accept parameter changes like `num_fewshot`

* add trust_remote_code as default

* task for testing recursive

* changed source of ALL_TASKS

* tasks should only accept TaskObjects

* initialize_tasks returns list of tasks and groups

* remove trust_remote_code for now

* moved constructor process to inside load_yaml_config

* more comprehensive way to index tasks and groups

* pre-commit format

* add exit after error

* adjust how task objects are called

* no need to use get_task_dict

* load_task_or_group works but only for tasks

* pre-commit format

* half working for nested groups

* changed variable names

* allow groups and tasks to work

* temp save

* indexing and loading are part of a task_manager object

* adapted initialize_tasks

* iron out bugs

* fixed typo

* fixed typo

* simplified code

* further tidy up

* remove lines for testing

* removed test lines

* removed unused code

* remove unused import

* fixed bug

* removed comments

* group in a list of group can accept parameter changes like `num_fewshot`

* check if config is task update

* add GroupConfig object

* edit test yaml

* remove args

* testing returning to python task list

* add weight_by_size config

* describe weight_by_size in docs

* fix weight by size potential error

* can load individual custom python class task

* moved import_function into the config loading file

* remove print lines

* add squadv2 yaml

* temporary scroll implementation

* revert back to use load_yaml_config but with modes

* fix group being loaded with a None

* reformat

* can load unregistered tasks from a group

* update scrolls

* edit scrolls multiplechoice task

* adjust class initialization

* fix initialization

* changed how to identify group and python tasks, fix logger

* allow loading "include" that is nested in a group config

* reworked flan benchmark

* allow duplicate task in the same group to co-exist

* process group_alias

* removed group_alias

* allow parameters set in group_config to apply to all tasks in tasklist

* add function, but comment for now

* reworked processing dict-base config

* fixed how configs in group are processed

* update to allow root group to have its alias used

* remove unused classes

* remove unused classes

* revert some parts to original

* forgot to change one variable

* adapt the new process to use get_task_dict

* fix for singular group call

* fix variable names

* add TaskManager into the evaluator

* format

* changed how dict tasks are loaded

* add docs

* Update docs/new_task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update evaluator.py

* Update evaluator.py

* remove groupconfig for now

* changed _config to config

* update interface.md to explain TaskManager

* added property functions

* adjusted logger

* update write_out.py

* updated tests

* added documentation and some modifications

* added docstring documentation

* precommit format

* updated task loading for tests

* updates tests

* changed arg order for load_yaml_config

* update to handle scrolls and edit log message

* remove unused lines

* return a list of task classes and not a dict

* Update __init__.py

* Delete lm_eval/tasks/benchmarks/test.yaml

* Update task.py

* Update lm_eval/utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update utils.py

* re-added old functions with new log message

* Update docs/new_task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update new_task_guide.md

* added infor regarding `get_task_dict` and documentation

* add get_config for Task

* pre-commit formatting

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

d714fc95

Enable override of printed `n-shot` in table (#1379) · 17191063
Hailey Schoelkopf authored Feb 01, 2024
```
* allow tasks to specify printed fewshot val

* fix to belebele

* update metadata field's documentation
```
17191063
Hf: minor egde cases (#1380) · 994bdb3f
Baber Abbasi authored Feb 01, 2024
```
* edge cases where variable might not be assigned.

* type hint
```
994bdb3f

Expand docs, update CITATION.bib (#1227) · f5408b6b

Hailey Schoelkopf authored Feb 01, 2024



* Update CITATION.bib

* Create CONTRIBUTING.md

* add disclaimer re: multi node

* flesh out some sections more

* Flesh out contributor guide

* revert CITATION.bib

* appease pre-commit

---------
Co-authored-by: lintangsutawika <lintang@eleuther.ai>

f5408b6b

31 Jan, 2024 5 commits

add bypass metric (#1156) · f8203de1

Baber Abbasi authored Feb 01, 2024

* add bypass metric

* fixed `bypass` metric.

* add task attributes if predict_only

* add `predict_only` checks

* add docs

* added `overide_metric`, `override_config` to `Task`

* nits

* nit

* changed --predict_only to generations; nits

* nits

* nits

* change gen_kwargs warning

* add note about `--predict_only` in README.md

* added `predict_only`

* move table to bottom

* nit

* change null aggregation to bypass (conflict)

* bugfix; default `temp=0.0`

* typo

f8203de1

Add support for RWKV models with World tokenizer (#1374) · 084b7050

Eugene Cheah authored Jan 31, 2024



* Add support for RWKV models with World tokenizer

The RWKV line of model with the World tokenizer, does not allow the padding token to be configured, and has its value preset as 0

This however fails all the "if set" checks, and would cause the tokenizer to crash.

A tokenizer class name check was added, in addition to a model type check, as there exists RWKV models which uses the neox tokenizers

* Update huggingface.py

Genericized so that this supports any RWKVWorld tokenizer, and added a fall-back for if the HF implementation name changes.

* Comply with formatting guidelines

* fix format

---------
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

084b7050

Make dependencies compatible with PyPI (#1378) · a0a2fec8
Hailey Schoelkopf authored Jan 31, 2024
```
* make deps not point to github urls

* formatting

* try making PyPI only run on tag pushes
```
a0a2fec8
Publish to pypi (#1194) · 0da0dcba
Anjor Kanekar authored Jan 31, 2024
```
* publish to pypi

* lint

* Update publish.yml

* minor
```
0da0dcba

Fix unintuitive `--gen_kwargs` behavior (#1329) · bd7d265a

Hailey Schoelkopf authored Jan 31, 2024

* don't override do_sample if no value for it is passed

* Update gen_kwargs override condition

* Update huggingface.py

* Update huggingface.py

* run linters

* silence an erroneous warning

bd7d265a

30 Jan, 2024 1 commit
- delay filter init; remove `*args` (#1369) · 1554066c
  Baber Abbasi authored Jan 30, 2024
```
* delay filter init; remove `*args`

* bugfix

* optimize

* type hint
```
  1554066c
29 Jan, 2024 1 commit
- serialize callable functions in config (#1367) · 7fc43656
  Baber Abbasi authored Jan 29, 2024
  
  7fc43656
28 Jan, 2024 1 commit

Apply some best practices and guideline recommendations to code (#1363) · 488759d2

LSinev authored Jan 28, 2024

* raise Exception, not a string

Additional info https://peps.python.org/pep-0352/#exception-hierarchy-changes
https://docs.python.org/3.8/tutorial/errors.html#raising-exceptions

* Apply PEP8 recommendation to prefer isinstance

"Object type comparisons should always use isinstance() instead of comparing types directly"
https://peps.python.org/pep-0008/

* Remove dangerous default mutable values in arguments

https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/dangerous-default-value.html

* Format logging messages with fstring (not with format)

Additional info
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/logging-format-interpolation.html
There are also discussions about the speed of formatting while logging or some unintended code executions
https://github.com/pylint-dev/pylint/issues/2395
https://stackoverflow.com/a/54368109
but at least one format (fstring one) will be used throughout the project

* Specify utf-8 encoding for `open` explicitly

If not specified, it may be supposed differently in different environments, OSes, and Python versions. See
https://peps.python.org/pep-0597/
https://docs.python.org/3.11/library/locale.html#locale.getencoding
https://docs.python.org/3.10/library/os.html#utf8-mode
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/unspecified-encoding.html

Helps also if some code from English language tasks is taken as inspiration for tasks in non-English languages.

* Use inline-ignoring comments to pass pre-commit instead of identity process

https://flake8.pycqa.org/en/3.0.1/user/ignoring-errors.html#in-line-ignoring-errors
https://www.flake8rules.com/rules/F841.html

flake8 comments are supported by ruff: https://docs.astral.sh/ruff/linter/#error-suppression

488759d2