Commits · 7d09b24c626edb1d4898ca73a5b53fe6adb3bc08 · gaoqiong / lm-evaluation-harness

03 Jul, 2024 2 commits
- move group api to separate file · 94673d40
  haileyschoelkopf authored Jul 03, 2024
  
  94673d40
- pre-commit format · 96dfe976
  lintangsutawika authored Jul 03, 2024
  
  96dfe976
02 Jul, 2024 2 commits
- clean up diff · c1d9e625
  haileyschoelkopf authored Jul 02, 2024
  
  c1d9e625
- fix merge conflicts? · d13b1f56
  haileyschoelkopf authored Jul 02, 2024
  
  d13b1f56
27 Jun, 2024 1 commit
- update task_id to be updateable and uses group:task format · 43765669
  lintangsutawika authored Jun 27, 2024
  
  43765669
25 Jun, 2024 2 commits
- Remove `LM` dependency from `build_all_requests` (#2011) · 9b6179b2
  Baber Abbasi authored Jun 25, 2024
```
* refactored `lm.apply_chat_template`

* nit

* fix weird type error

* fixed!

* skip failing test

* pre-commit run all

* add type hints

* nit

* nit

* fixup
```
  9b6179b2
- add more error msgs, agg_metric -> agg_metric_list · 93c17c57
  haileyschoelkopf authored Jun 25, 2024
  
  93c17c57
13 Jun, 2024 1 commit

`samples` is newline delimited (#1930) · 3850e21a

Baber Abbasi authored Jun 13, 2024



* `samples` is newline delimited

* updated git and pre-commit

* appease pre-commit

* nit

* Revert back for now

* Revert for now

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

3850e21a

10 Jun, 2024 5 commits
- update format · 5b527a71
  lintangsutawika authored Jun 10, 2024
  
  5b527a71
- change how aggregate_metric is loaded · 9fa3b3f4
  lintangsutawika authored Jun 10, 2024
  
  9fa3b3f4
- change how aggregate_metric is loaded · 80d0f412
  lintangsutawika authored Jun 10, 2024
  
  80d0f412
- remove version for metadata · 1848d664
  lintangsutawika authored Jun 10, 2024
  
  1848d664
- remove group_alias · e8f49184
  lintangsutawika authored Jun 10, 2024
  
  e8f49184
07 Jun, 2024 1 commit
- formatting · e6b1581f
  lintangsutawika authored Jun 07, 2024
  
  e6b1581f
06 Jun, 2024 1 commit
- add documentation · be8b547b
  lintangsutawika authored Jun 06, 2024
  
  be8b547b
03 Jun, 2024 1 commit

Add chat template (#1873) · 070d31df

KonradSzafer authored Jun 03, 2024



* initial chat template

* tokenizer attribute check

* variable rename

* interface update

* system instruction

* system inst default update

* fewshot as multiturn

* typing update

* indent update

* added comments

* Adding a fewshot in a more readable way

* linting

* Moved apply chat template to LM

* multiturn alternation fix

* cache key update

* apply chat template method fix

* add system prompt hash to cache_key

* tokenizer name property for cache_key

* property name fix

* linting backward compatibility fix

* docs and errors update

* add documentation on adding chat template compatibility to model_guide

* fewshot as multiturn check fix

* saving system inst and chat template in results

* eval tracker update

* docs update

* Apply suggestions from code review
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

070d31df

31 May, 2024 1 commit

Making hardcoded few shots compatible with the chat template mechanism (#1895) · 4902aaaf

Clémentine Fourrier authored May 31, 2024



* init test 1

* fix

* this format seems to be working - need to update all other tasks with the new format

* bbh with few shot format

* fix fewshot bbh

* add mmlu flan cot

* samples of cot

* kmmlu

* fix gsm8k

* update keys for mmlu

* minerva math

* bbh

* fix

* fix samples

* small fixes to templates

* last prompt format change

* fixing prompt

* fixed minerva math format

* rm accidental commited file

* added doc for few shot samples

* Update lm_eval/loggers/evaluation_tracker.py

* Update lm_eval/loggers/evaluation_tracker.py

* Update docs/new_task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* added check in sampler per code review

* added the system from a function, plus an example in minerva math

* style

* Apply suggestions from code review
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix unit tests 1

* forcing use of test split

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

4902aaaf

16 May, 2024 1 commit
- group config to_dict update · e66c5f57
  lintangsutawika authored May 16, 2024
  
  e66c5f57
15 May, 2024 1 commit
- adjust task_id · c90655d5
  lintangsutawika authored May 15, 2024
  
  c90655d5
11 May, 2024 3 commits
- new group config parameter `tag_to_task` · 41e64b2e
  lintangsutawika authored May 11, 2024
  
  41e64b2e
- add task_id for python tasks as well · f4d2e6e0
  lintangsutawika authored May 11, 2024
  
  f4d2e6e0
- add task_id for python tasks as well · 1ef5b0bf
  lintangsutawika authored May 11, 2024
  
  1ef5b0bf
10 May, 2024 3 commits
- use task_id · 39c40277
  lintangsutawika authored May 10, 2024
  
  39c40277
- fixed info log · 7350958b
  lintangsutawika authored May 10, 2024
  
  7350958b
- remove warning · 1fae7283
  lintangsutawika authored May 10, 2024
  
  1fae7283
08 May, 2024 2 commits
- replace group to tag · 2f2322b9
  lintangsutawika authored May 08, 2024
  
  2f2322b9
- add description regarding tags replacing group · 3f770bb6
  lintangsutawika authored May 08, 2024
  
  3f770bb6
07 May, 2024 2 commits
- update to work with new group and task configuration · ad70d206
  lintangsutawika authored May 07, 2024
  
  ad70d206
- add configurable group · 86039e85
  lintangsutawika authored May 07, 2024
  
  86039e85
06 May, 2024 1 commit

Provide ability for custom sampler for ConfigurableTask (#1616) · ae72cebc

LSinev authored May 06, 2024

* Added fewshot sampling seeds to evaluator.simple_evaluate signature

Way to control seed of fewshot sampling
may help with #1591

* Added ability for custom sampler for ConfigurableTask

May be set in config like
```
fewshot_config:
  sampler: !function utils.MyFewshotSampler
```

* explicitly set fewshot random generator seed for HFLM generate_until_task test

* add backward compatibility for three args seed setup

* save seeds info to logs/reports

ae72cebc

25 Apr, 2024 1 commit
- Fix Parameter Propagation for Tasks that have `include` (#1749) · 0bafcef0
  Lintang Sutawika authored Apr 26, 2024
```
* Update task.py

* Update __init__.py
```
  0bafcef0
24 Apr, 2024 1 commit
- add group config · eb9c6a57
  lintangsutawika authored Apr 24, 2024
  
  eb9c6a57
23 Apr, 2024 1 commit
- add greoup_config arg · 2a2566e6
  lintangsutawika authored Apr 23, 2024
  
  2a2566e6
18 Mar, 2024 1 commit

Cleanup for v0.4.2 release (#1573) · 5627e819

Hailey Schoelkopf authored Mar 18, 2024

* Update interface.md

* fix: make caching reqs always work with accelerate launch

* remove stale task migration checklist

* remove deprecation warnings

* make informative TypeErrors for get_task_dict

* bump version metadata

* fix num_fewshot printing bug

* add fewshot value to cache key

5627e819

10 Mar, 2024 1 commit

Support jinja templating for task descriptions (#1553) · 3bdf25ec

Hisham Alyahya authored Mar 10, 2024



* Support jinja templating for "description"

* Update task_guide.md

* Update lm_eval/api/task.py

* fix format?

* whitespace errors

* fix whitespace

* fix bad variable reference

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

3bdf25ec

06 Mar, 2024 1 commit

Cleanup and fixes (Task, Instance, and a little bit of *evaluate) (#1533) · 4ee1b386

LSinev authored Mar 06, 2024



* Remove unused `decontamination_ngrams_path` and all mentions (still no alternative path provided)

* Fix improper import of LM and usage of evaluator in one of scripts

* update type hints in instance and task api

* raising errors in task.py instead of asserts

* Fix warnings from ruff

* raising errors in __main__.py instead of asserts

* raising errors in tasks/__init__.py instead of asserts

* raising errors in evaluator.py instead of asserts

* evaluator: update type hints and remove unused variables in code

* Update lm_eval/__main__.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/__main__.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/evaluator.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* pre-commit induced fixes

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

4ee1b386

27 Feb, 2024 1 commit

Refactor `evaluater.evaluate` (#1441) · 5ccd65d4

Baber Abbasi authored Feb 27, 2024



* change `all_gather` to `gather`

* add TaskOutput utility class

* Add FilterResults class and refactor task handling.

* Rename `key` to `filter_key` for clarity

* Add `print_writeout` function in utils.py

* Add function to calculate limit size.

* Add doc_iterator method to Task class

* Refactor `doc_iterator` and cleanup in Task class

* remove superfluous bits

* change `all_gather` to `gather`

* bugfix

* bugfix

* fix `gather`

* Refactor `gather` loop

* Refactor aggregate metrics calculation

* Refactor and simplify aggregate metrics calculation
Removed unused code

* Simplify metrics calculation and remove unused code.

* simplify the metrics calculation in `utils.py` and `evaluator.py`.

* Fix group metric

* change evaluate to hf_evaluate

* change evaluate to hf_evaluate

* add docs

* add docs

* nits

* make isslice keyword only

* nit

* add todo

* nit

* nit

* nit: swap order samples_metrics tuple

* move instance sorting outside loop

* nit

* nit

* Add __repr__ for ConfigurableTask

* nit

* nit

* Revert "nit"

This reverts commit dab8d9977a643752a17f840fd8cf7e4b107df28f.

* fix some logging

* nit

* fix `predict_only` bug. thanks to `@LSinev`!

* change `print_tasks` to `prepare_print_tasks`

* nits

* move eval utils

* move eval utils

* nit

* add comment

* added tqdm descriptions

* Update lm_eval/evaluator_utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix mgsm bug

* nit

* fix `build_all_requests`

* pre-commit

* add ceil to limit

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

5ccd65d4

26 Feb, 2024 2 commits

Cont metrics (#1475) · 96d185fa

Lintang Sutawika authored Feb 26, 2024



* add brier_score

* process brier_score

* brier score is working for N-sized class

* fxied brier score

* add TED to BigBench and Brier score to MMLU

* format

* Update metrics.py

* Update task.py

* Update generate_until_template_yaml

* Delete lm_eval/tasks/bigbench/aux_metric.py

* Update generate_until_template_yaml

* Update _default_template_yaml

* Update _generate_configs.py

* Update _generate_configs.py

* Update _generate_configs.py

* fix (format?)

* format?

* format, once more

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

96d185fa

Create a means for caching task registration and request building. Ad… (#1372) · 1e6c9272

Aaron V authored Feb 26, 2024



* Create a means for caching task registration and request building. Add the ability to specify an args dict for simple_evaluate().

* Remove extra S in cache path in caching module
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Rename requests cache args, make model_args polymorphic so that a dict can also be accepted.

* Update docs to reflect new caching behavior, add CLI args for requests caching. Create a function for deleting items in the cache.

* Update documentation, fix minor bug with arg parsing for requests caching where an undefined variable was used.

* Remove line from gitignore, add to cli for caching datasets.

* Add hashing suffix to .pickles. Update test script typo.

* Favor isinstance() over type() in evaluator.py

* Add tests for caching, gets tests working, remove unneeded arg from build_all_requests().

* Update arg description to simple_evaluate.

* Update pyproject.toml

* Fix typehint

* Remove the use of random() for creating default cache pickle hash.

* Check that cache dir exists before clearing it in request cache tests.

* Fix linting problems.

* Fix additional formatting errors.

* Remove trailing whitespace.

* Add new line to the end of .gitignore.

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

1e6c9272

22 Feb, 2024 1 commit

Fixed generation args issue affection OpenAI completion model (#1458) · 75ac1f47

Amine Elhattami authored Feb 23, 2024



* Fixed generation args issue affection openai completion model

* Fixed hf unit test; removed pop attributes in OpenAi completion.

* fix format

* fix format

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

75ac1f47