Commits · 3f770bb616b58d602c4e392e7293e60cb3d1a142 · gaoqiong / lm-evaluation-harness

07 May, 2024 9 commits
- reversed task list · 56373978
  lintangsutawika authored May 07, 2024
  
  56373978
- readd files · 75dfac43
  lintangsutawika authored May 07, 2024
  
  75dfac43
- Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant (#1793) · d42a3e44
  Hailey Schoelkopf authored May 07, 2024
```
* add Hendrycks MATH (no sympy checking) variant

* add readmes for MATH tasks
```
  d42a3e44
- update to work with new group and task configuration · ad70d206
  lintangsutawika authored May 07, 2024
  
  ad70d206
- Fix Caching Tests ; Remove `pretrained=gpt2` default (#1775) · 7fe2b93c
  Hailey Schoelkopf authored May 07, 2024
  
  7fe2b93c
- update mmlu · c23c9305
  lintangsutawika authored May 07, 2024
  
  c23c9305
- update mmlu · cb085b02
  lintangsutawika authored May 07, 2024
  
  cb085b02
- adjust args · 09bc7c68
  lintangsutawika authored May 07, 2024
  
  09bc7c68
- adjust group to also be a configurable group · 3473e196
  lintangsutawika authored May 07, 2024
  
  3473e196
06 May, 2024 1 commit
- modify mmlu tasks · 96336194
  lintangsutawika authored May 06, 2024
  
  96336194
01 May, 2024 4 commits

Simran Arora authored May 01, 2024



* upload new tasks

* add readmes

* run linters

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

caaf9ab6

Fix m_arc choices (#1760) · f27c4050

Zehan Li authored May 02, 2024



* Update utils.py

This is a 4-choice task, option_e is null for all but 3 samples

* Fix options

Adaptive choices

* add option e

* bump multilingual arc version

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

f27c4050

Pile 10k new task (#1758) · b898bdaa
Gabriel Mukobi authored May 01, 2024
```
* Add Pile-10k readme

* Add Pile-10k task configuration file
```
b898bdaa
remove duplicated `num_fewshot: 0` (#1769) · 552eeae7
Chujie Zheng authored May 01, 2024

552eeae7

26 Apr, 2024 1 commit
- Support individual scrolls datasets (#1740) · 9b49556a
  giorgossideris authored Apr 26, 2024
```
* Support individual scrolls datasets

* Add qmsum context

* Fix formatting
```
  9b49556a
25 Apr, 2024 5 commits
- edit format · 0579b304
  lintangsutawika authored Apr 25, 2024
  
  0579b304
- update all mmlu variants to include group_config · 9ccec647
  lintangsutawika authored Apr 25, 2024
  
  9ccec647
- Fix Parameter Propagation for Tasks that have `include` (#1749) · 0bafcef0
  Lintang Sutawika authored Apr 26, 2024
```
* Update task.py

* Update __init__.py
```
  0bafcef0
- fixed issues related to printing alias of group and updated yaml · bcc887ad
  lintangsutawika authored Apr 25, 2024
  
  bcc887ad
- Add XNLIeu: a dataset for cross-lingual NLI in Basque (#1694) · 594015b6
  Julen Etxaniz authored Apr 25, 2024
```
* add xnli_eu tasks

* update tasks readme

* update readme
```
  594015b6
24 Apr, 2024 1 commit
- adjust mmlu to use group_config · 4dd69062
  lintangsutawika authored Apr 24, 2024
  
  4dd69062
23 Apr, 2024 1 commit
- add a group config that allows disabling table for group score and group aggregate in general · 6da6d187
  lintangsutawika authored Apr 23, 2024
  
  6da6d187
18 Apr, 2024 1 commit
- Adding retries and rate limit to toxicity tasks (#1620) · 3196e907
  sator-labs authored Apr 18, 2024
  
  3196e907
05 Apr, 2024 1 commit

TMMLU+ implementation (#1394) · 9ae96cdf

ZoneTwelve authored Apr 06, 2024



* implementation of TMMLU+

* implemented: TMMLU+

****TMMLU+ : large-scale Traditional chinese Massive Multitask language Understanding****

- 4 categories
    - STEM
    - Social Science
    - Humanities
    - Other

The TMMLU+ dataset, encompassing over 67 subjects and 20160 tasks, is six times larger and more balanced than its predecessor, TMMLU, and includes benchmark results from both closed-source and 20 open-weight Chinese large language models with 1.8B to 72B parameters. However, Traditional Chinese variants continue to underperform compared to major Simplified Chinese models.

```markdown
Total number of tasks in the 'test' sets: 20160
Total number of tasks in the 'validation' sets: 2247
Total number of tasks in the 'train' sets: 335
```

* Remove print from __init__.py

There was my mistake in forgetting to remove the debug print from the code.

* update: move TMMLU+ config generation program into default

* fix: we should use training set as few shots example

* update: README for TMMLU+

* update: a small changes of TMMLU+ README file

* pre-commit run thought

* Add README for TMMLU+ dataset

* run precommit

* trigger precommit again

* trigger precommit again

* isort is fussy

* isort is fussy

* format, again

* oops

* oops

---------
Co-authored-by: lintang <lintang@eleuther.ai>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

9ae96cdf

04 Apr, 2024 1 commit
- Patch QQP prompt (#1661) · ff24e992
  Hailey Schoelkopf authored Apr 04, 2024
  
  ff24e992
01 Apr, 2024 1 commit

Add Latxa paper evaluation tasks for Basque (#1654) · c2c8e238

Julen Etxaniz authored Apr 01, 2024

* add basqueglue

* add eus_exams

* add eus_proficiency

* add eus_reading

* add eus_trivia

* run pre-commit

c2c8e238

28 Mar, 2024 1 commit
- Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring (#1647) · ab7cc6b1
  Or Sharir authored Mar 28, 2024
  
  ab7cc6b1
21 Mar, 2024 1 commit
- Add ACLUE task (#1614) · 65546905
  Haonan Li authored Mar 21, 2024
```
* Add task ACLUE

* fix minor bug

* fix code style

* fix code style
```
  65546905
18 Mar, 2024 2 commits

Fix eval_logger import for mmlu/_generate_configs.py (#1593) · 4600d6bf

Nouf M. Alotaibi authored Mar 18, 2024



* Fix eval_logger import for mmlu/_generate_configs.py

* linter

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

4600d6bf

Cleanup for v0.4.2 release (#1573) · 5627e819

Hailey Schoelkopf authored Mar 18, 2024

* Update interface.md

* fix: make caching reqs always work with accelerate launch

* remove stale task migration checklist

* remove deprecation warnings

* make informative TypeErrors for get_task_dict

* bump version metadata

* fix num_fewshot printing bug

* add fewshot value to cache key

5627e819

15 Mar, 2024 1 commit
- Fix Jinja template for Advanced AI Risk (#1587) · dc90fecc
  Rylan Schaeffer authored Mar 15, 2024
  
  dc90fecc
13 Mar, 2024 1 commit

add manual tqdm disabling management (#1569) · e74ec966

achervyakov authored Mar 13, 2024



* add manual tqdm disabling management

* add typing to all new args

* apply precommit changes

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

e74ec966

11 Mar, 2024 4 commits

AGIEval (#1359) · a3e56afe

Hailey Schoelkopf authored Mar 11, 2024



* add agieval

* fix typo

* add cloze / math exactmatch agieval tasks, rename

* update exact-match agieval tasks, allow for multiple-correct answers

* add more detail to readme

* don't parse_math_answer twice

---------
Co-authored-by: Alex Bäuerle <alex@a13x.io>

a3e56afe

add Arabic EXAMS benchmark (#1498) · 4ab07597

khalil authored Mar 11, 2024



* add Arabic EXAMS benchmark

* fixed the linter issue, and add more information on the readme

* Update README.md

---------
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

4ab07597

Update ifeval.yaml (#1506) · 282b9e76
Hailey Schoelkopf authored Mar 11, 2024

282b9e76
Update generate_until_template_yaml (#1546) · a79a7c33
Hailey Schoelkopf authored Mar 11, 2024

a79a7c33

09 Mar, 2024 1 commit
- Fix incorrect `max_gen_toks` generation kwarg default in code2_text. (#1551) · f518228f
  Piyush Thakur authored Mar 09, 2024
```
* update gen_kwargs in code2-text-go.yaml

* update gen_kwargs in rest code2-text
```
  f518228f
06 Mar, 2024 3 commits

Cleanup and fixes (Task, Instance, and a little bit of *evaluate) (#1533) · 4ee1b386

LSinev authored Mar 06, 2024



* Remove unused `decontamination_ngrams_path` and all mentions (still no alternative path provided)

* Fix improper import of LM and usage of evaluator in one of scripts

* update type hints in instance and task api

* raising errors in task.py instead of asserts

* Fix warnings from ruff

* raising errors in __main__.py instead of asserts

* raising errors in tasks/__init__.py instead of asserts

* raising errors in evaluator.py instead of asserts

* evaluator: update type hints and remove unused variables in code

* Update lm_eval/__main__.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/__main__.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/evaluator.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* pre-commit induced fixes

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

4ee1b386

update printed num-fewshot ; prevent fewshots from erroneously being used by... · 02705057
Hailey Schoelkopf authored Mar 06, 2024
```
update printed num-fewshot ; prevent fewshots from erroneously being used by cot which hardcodes fewshot prompt (#1502)
```
02705057
Adding new task : KorMedMCQA (#1530) · faee1adf
sean0042 authored Mar 06, 2024

faee1adf