Commits · 9ae96cdfd816ded3c7c10f625f56aadfce708bdf · gaoqiong / lm-evaluation-harness

05 Apr, 2024 1 commit

TMMLU+ implementation (#1394) · 9ae96cdf

ZoneTwelve authored Apr 06, 2024



* implementation of TMMLU+

* implemented: TMMLU+

****TMMLU+ : large-scale Traditional chinese Massive Multitask language Understanding****

- 4 categories
    - STEM
    - Social Science
    - Humanities
    - Other

The TMMLU+ dataset, encompassing over 67 subjects and 20160 tasks, is six times larger and more balanced than its predecessor, TMMLU, and includes benchmark results from both closed-source and 20 open-weight Chinese large language models with 1.8B to 72B parameters. However, Traditional Chinese variants continue to underperform compared to major Simplified Chinese models.

```markdown
Total number of tasks in the 'test' sets: 20160
Total number of tasks in the 'validation' sets: 2247
Total number of tasks in the 'train' sets: 335
```

* Remove print from __init__.py

There was my mistake in forgetting to remove the debug print from the code.

* update: move TMMLU+ config generation program into default

* fix: we should use training set as few shots example

* update: README for TMMLU+

* update: a small changes of TMMLU+ README file

* pre-commit run thought

* Add README for TMMLU+ dataset

* run precommit

* trigger precommit again

* trigger precommit again

* isort is fussy

* isort is fussy

* format, again

* oops

* oops

---------
Co-authored-by: lintang <lintang@eleuther.ai>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

9ae96cdf

04 Apr, 2024 1 commit
- Patch QQP prompt (#1661) · ff24e992
  Hailey Schoelkopf authored Apr 04, 2024
  
  ff24e992
01 Apr, 2024 1 commit

Add Latxa paper evaluation tasks for Basque (#1654) · c2c8e238

Julen Etxaniz authored Apr 01, 2024

* add basqueglue

* add eus_exams

* add eus_proficiency

* add eus_reading

* add eus_trivia

* run pre-commit

c2c8e238

28 Mar, 2024 1 commit
- Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring (#1647) · ab7cc6b1
  Or Sharir authored Mar 28, 2024
  
  ab7cc6b1
21 Mar, 2024 1 commit
- Add ACLUE task (#1614) · 65546905
  Haonan Li authored Mar 21, 2024
```
* Add task ACLUE

* fix minor bug

* fix code style

* fix code style
```
  65546905
18 Mar, 2024 2 commits

Fix eval_logger import for mmlu/_generate_configs.py (#1593) · 4600d6bf

Nouf M. Alotaibi authored Mar 18, 2024



* Fix eval_logger import for mmlu/_generate_configs.py

* linter

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

4600d6bf

Cleanup for v0.4.2 release (#1573) · 5627e819

Hailey Schoelkopf authored Mar 18, 2024

* Update interface.md

* fix: make caching reqs always work with accelerate launch

* remove stale task migration checklist

* remove deprecation warnings

* make informative TypeErrors for get_task_dict

* bump version metadata

* fix num_fewshot printing bug

* add fewshot value to cache key

5627e819

15 Mar, 2024 1 commit
- Fix Jinja template for Advanced AI Risk (#1587) · dc90fecc
  Rylan Schaeffer authored Mar 15, 2024
  
  dc90fecc
13 Mar, 2024 1 commit

add manual tqdm disabling management (#1569) · e74ec966

achervyakov authored Mar 13, 2024



* add manual tqdm disabling management

* add typing to all new args

* apply precommit changes

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

e74ec966

11 Mar, 2024 4 commits

AGIEval (#1359) · a3e56afe

Hailey Schoelkopf authored Mar 11, 2024



* add agieval

* fix typo

* add cloze / math exactmatch agieval tasks, rename

* update exact-match agieval tasks, allow for multiple-correct answers

* add more detail to readme

* don't parse_math_answer twice

---------
Co-authored-by: Alex Bäuerle <alex@a13x.io>

a3e56afe

add Arabic EXAMS benchmark (#1498) · 4ab07597

khalil authored Mar 11, 2024



* add Arabic EXAMS benchmark

* fixed the linter issue, and add more information on the readme

* Update README.md

---------
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

4ab07597

Update ifeval.yaml (#1506) · 282b9e76
Hailey Schoelkopf authored Mar 11, 2024

282b9e76
Update generate_until_template_yaml (#1546) · a79a7c33
Hailey Schoelkopf authored Mar 11, 2024

a79a7c33

09 Mar, 2024 1 commit
- Fix incorrect `max_gen_toks` generation kwarg default in code2_text. (#1551) · f518228f
  Piyush Thakur authored Mar 09, 2024
```
* update gen_kwargs in code2-text-go.yaml

* update gen_kwargs in rest code2-text
```
  f518228f
06 Mar, 2024 5 commits

Cleanup and fixes (Task, Instance, and a little bit of *evaluate) (#1533) · 4ee1b386

LSinev authored Mar 06, 2024



* Remove unused `decontamination_ngrams_path` and all mentions (still no alternative path provided)

* Fix improper import of LM and usage of evaluator in one of scripts

* update type hints in instance and task api

* raising errors in task.py instead of asserts

* Fix warnings from ruff

* raising errors in __main__.py instead of asserts

* raising errors in tasks/__init__.py instead of asserts

* raising errors in evaluator.py instead of asserts

* evaluator: update type hints and remove unused variables in code

* Update lm_eval/__main__.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/__main__.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/evaluator.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* pre-commit induced fixes

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

4ee1b386

update printed num-fewshot ; prevent fewshots from erroneously being used by... · 02705057
Hailey Schoelkopf authored Mar 06, 2024
```
update printed num-fewshot ; prevent fewshots from erroneously being used by cot which hardcodes fewshot prompt (#1502)
```
02705057
Adding new task : KorMedMCQA (#1530) · faee1adf
sean0042 authored Mar 06, 2024

faee1adf

Add WMDP Multiple-choice (#1534) · 29b2b013

Long Phan authored Mar 05, 2024



* init wmdp yaml file

* Add WMDP Multiple-choice

* fix linter issues

* Delete lm_eval/tasks/wmdp/_wmdp.yaml

---------
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

29b2b013

Add EQ-Bench as per #1459 (#1511) · c5acce0c

Peter Bevan authored Mar 06, 2024

* Start adding eq-bench

* Start adding to yaml and utils

* Get metric working

* Add README

* Handle cases where answer is not parseable

* Deal with unparseable answers and add percent_parseable metric

* Update README

c5acce0c

05 Mar, 2024 2 commits

Add a new task GPQA (the part CoT and generative) (#1482) · 01108aca

Uanu authored Mar 06, 2024



* Add new tasks of GPQA

* Add README

* Remove unused functions

* Remove unused functions

* Linters

* Add flexible match

* update

* Remove deplicate function

* Linter

* update

* Update lm_eval/filters/extraction.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* register multi_choice_regex

* Update

* run precommit

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

01108aca

Openllm benchmark (#1526) · 8a875e9a
Baber Abbasi authored Mar 05, 2024

8a875e9a

04 Mar, 2024 1 commit

French Bench (#1500) · 48476c4c

Manuel Faysse authored Mar 04, 2024



* add french-bench

* rename arc easy

* linting

* update datasets for no remote code exec

* fix string delimiter

* add info to readmr

* trim trailing whitespace

* add detailed groups

* add info to readme

* remove orangesum title from fbench main

* Force PPL tasks to be 0-shot

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

48476c4c

03 Mar, 2024 1 commit

Setting trust_remote_code to True for HuggingFace datasets compatibility (#1487) · 95167926

Vicki Boykis authored Mar 03, 2024

* setting trust_remote_code

* dataset list no notebooks

* respect trust remote code

* Address changes, move cli options and change datasets

* fix task for tests

* headqa

* remove kobest

* pin datasets and address comments

* clean up space

95167926

01 Mar, 2024 1 commit
- Add multilingual truthfulqa targets (#1499) · d272c19f
  Zehan Li authored Mar 01, 2024
  
  d272c19f
27 Feb, 2024 2 commits
- update name of val split in truthfulqa multilingual (#1488) · a08eb870
  Hailey Schoelkopf authored Feb 27, 2024
  
  a08eb870
- add multilingual mmlu eval (#1484) · 7cd004c4
  Zehan Li authored Feb 27, 2024
  
  7cd004c4
26 Feb, 2024 6 commits

Cont metrics (#1475) · 96d185fa

Lintang Sutawika authored Feb 26, 2024



* add brier_score

* process brier_score

* brier score is working for N-sized class

* fxied brier score

* add TED to BigBench and Brier score to MMLU

* format

* Update metrics.py

* Update task.py

* Update generate_until_template_yaml

* Delete lm_eval/tasks/bigbench/aux_metric.py

* Update generate_until_template_yaml

* Update _default_template_yaml

* Update _generate_configs.py

* Update _generate_configs.py

* Update _generate_configs.py

* fix (format?)

* format?

* format, once more

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

96d185fa

Create a means for caching task registration and request building. Ad… (#1372) · 1e6c9272

Aaron V authored Feb 26, 2024



* Create a means for caching task registration and request building. Add the ability to specify an args dict for simple_evaluate().

* Remove extra S in cache path in caching module
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Rename requests cache args, make model_args polymorphic so that a dict can also be accepted.

* Update docs to reflect new caching behavior, add CLI args for requests caching. Create a function for deleting items in the cache.

* Update documentation, fix minor bug with arg parsing for requests caching where an undefined variable was used.

* Remove line from gitignore, add to cli for caching datasets.

* Add hashing suffix to .pickles. Update test script typo.

* Favor isinstance() over type() in evaluator.py

* Add tests for caching, gets tests working, remove unneeded arg from build_all_requests().

* Update arg description to simple_evaluate.

* Update pyproject.toml

* Fix typehint

* Remove the use of random() for creating default cache pickle hash.

* Check that cache dir exists before clearing it in request cache tests.

* Fix linting problems.

* Fix additional formatting errors.

* Remove trailing whitespace.

* Add new line to the end of .gitignore.

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

1e6c9272

Revert "setting trust_remote_code (#1467)" (#1474) · f6befdb9
Hailey Schoelkopf authored Feb 26, 2024
```
This reverts commit c1145dfd.
```
f6befdb9
add arabic mmlu (#1402) · 7de7b27e
khalil authored Feb 26, 2024
```
* add arabic mmlu

* update the description

* add readme file
```
7de7b27e
setting trust_remote_code (#1467) · c1145dfd
Vicki Boykis authored Feb 26, 2024

c1145dfd
Apply code autoformatting with Ruff to tasks/*.py an *__init__.py (#1469) · d27c0c08
LSinev authored Feb 26, 2024

d27c0c08

23 Feb, 2024 1 commit
- update parsing logic of mgsm following gsm8k (#1462) · 8371662c
  thnkinbtfly authored Feb 24, 2024
  
  8371662c
22 Feb, 2024 1 commit

PR fixing the issue #1391 (wrong contexts in the mgsm task) (#1440) · a72babbf

Lei Chen authored Feb 22, 2024



* fix the issue #1391, wrong contexts in mgsm tasks

* fix yaml issue for having two target_delimiter lines. For COT tasks, keep the one with a space (default)

* regenerate all task yaml files
- change naming so that file name will match with task name
- task|file follows a consistent naming way, mgsm_(mode)_(lang) for three modes, i.e., direct, en_cot, and native_cot

* English CoTs should have a space as target_delimiter

* Update utils.py

* Apply suggestions from code review

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

a72babbf

21 Feb, 2024 1 commit

Added KMMLU evaluation method and changed ReadMe (#1447) · c26a6ac7

Hanwool Albert Lee authored Feb 21, 2024



* update kmmlu default formatting

* Update _default_kmmlu_yaml

* Delete lm_eval/tasks/kmmlu/utils.py

* new tasks implemented

* add direct tasks

* update direct evaluate

* update direct eval

* add cot sample

* update cot

* add cot

* Update _cot_kmmlu_yaml

* add kmmlu90

* Update and rename _cot_kmmlu.yaml to _cot_kmmlu_yaml

* Create kmmlu90.yaml

* Update _cot_kmmlu_yaml

* add direct

* Update _cot_kmmlu_yaml

* Update and rename kmmlu90.yaml to kmmlu90_cot.yaml

* Update kmmlu90_direct.yaml

* add kmmlu hard

* Update _cot_kmmlu_yaml

* Update _cot_kmmlu_yaml

* update cot

* update cot

* erase typo

* Update _cot_kmmlu_yaml

* update cot

* Rename dataset to match k-mmlu-hard

* removed kmmlu90

* fixed name 'kmmlu_cot' to 'kmmlu_hard_cot' and revised README

* applied pre-commit before pull requests

* rename datasets and add notes

* Remove DS_Store cache

* Update lm_eval/tasks/kmmlu/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Change citations and reflect reviews on version

* Added kmmlu_hard and fixed other errors

* fixing minor errors

* remove duplicated

* Rename files

* try ".index"

* minor fix

* minor fix again

* fix revert.

* minor fix. thank for hailey

---------
Co-authored-by: GUIJIN SON <spthsrbwls123@yonsei.ac.kr>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

c26a6ac7

20 Feb, 2024 3 commits

Add a new task GPQA (the part without CoT) (#1434) · 5ab295c8

Uanu authored Feb 21, 2024



* add new task GPQA_n_shot

* add new task GPQA_zeroshot

* correct GPQA_zeroshot filename

* Add randomly shuffle choices

* Correct missing parentheses

* delete wrong tasks

* Add README

* Update lm_eval/tasks/gpqa/zeroshot/_gpqa_zeroshot_yaml

* Update lm_eval/tasks/gpqa/n_shot/utils.py

* Update lm_eval/tasks/gpqa/n_shot/utils.py

* Update lm_eval/tasks/gpqa/README.md

* placate linter

* linter

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

5ab295c8

Group reqs by context (#1425) · 45941c67

Baber Abbasi authored Feb 20, 2024



* add key lookup for same contexts

* nit

* appease pre-commit

* nit

* use `expand` (in-place view) rather than `repeat`

* try mixed grouping

* add docs.

* nit

* nit

* nits

* fix tests

* Move greedy_tokens calculation out of cache loop

* nit

* nits

* add test

* nits

* fix name conflict

* fix name conflict

* chunk tensor

* move Collator

* nits/docstring

* fixup

* fixup

* group contexts only for decoders

* pre-commit

* fix `generate_until` test

* fix `generate_until` test

* Update lm_eval/models/huggingface.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add docs

* nit

* add docs

* add docs

* add 'logits_cache' arg

* bugfix

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

45941c67

Add a new task HaeRae-Bench (#1445) · 8680e938

Hanwool Albert Lee authored Feb 20, 2024

* haerae_reimplementation

* edited Readme and add few_shot settings

* edited readme

* newlines at end of each files

* Modifying the README file

* applied pre-commit

8680e938

19 Feb, 2024 2 commits

update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) (#1356) · 89deeeaf

thnkinbtfly authored Feb 20, 2024



* update bbh, gsm8k, mmlu parsing logic and prompts

* remove the formatting prompt (bbh) + minor update (mmlu)

* update bbh, gsm8k, mmlu zeroshot, revert fewshots

* update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot

* remove take_last, update to use docs parameters

* add newline

* ruff formatting

* Update pyproject.toml

* fix format

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

89deeeaf

Correct typo in task name (#1443) · 19cbb292
larekrow authored Feb 19, 2024

19cbb292