Commits · cda25fef4e1df2f4bc2dab3ec6668ae9f5bf7296 · gaoqiong / lm-evaluation-harness

02 Jan, 2024 1 commit
- Update README.md (#1230) · a12ef445
  Pasquale Minervini authored Jan 02, 2024
  
  a12ef445
29 Dec, 2023 1 commit

Don't silence errors when loading tasks (#1148) · 34b563b1

Paul McCann authored Dec 30, 2023



* Add example failing task

This task includes an invalid import. This will cause an exception and
the task will not be loaded. But this just results in a DEBUG level log
message, so in normal usage you'll see no error, and will be told the
task doesn't exist.

Here's an example command line to run the task:

    python -m lm_eval --model hf --model_args pretrained=rinna/japanese-gpt-1b --tasks fail

This task is based on a Japanese Winograd task, but that's not
important, and was just used due to familiarity.

* Do not ignore errors when loading tasks

* Change how task errors are logged

This makes the proposed changes from PR discussion.

1. Exceptions not related to missing modules/imports are logged as
   warnings.

2. module/import related exceptions are still logged at debug level, but
   if any of them happen there is a warning about it with instructions
   on how to show logs.

* Remove intentionally failing task

---------
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

34b563b1

27 Dec, 2023 1 commit

nits + fix siqa (#1216) · 6a1c19ed

Baber Abbasi authored Dec 27, 2023

* fix group

* siqa: default.yml -> default.yaml

* max_gen_toks -> self.max_gen_toks

* add ids to task tests

* fix siqa

* fix gen_kwargs for openai-chat

6a1c19ed

24 Dec, 2023 1 commit

Add remove_whitespace to FLD benchmark (#1206) · 8ffbe58a

MorishT authored Dec 24, 2023



* Add remove_whitespace to FLD benchmark

* bump task version

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

8ffbe58a

21 Dec, 2023 1 commit

Correctly Print Task Versioning (#1173) · 9cd79897

Hailey Schoelkopf authored Dec 21, 2023

* change version field formatting in metadata

* mention versioning in new task guide

* add instructions for changelog

* run linters

9cd79897

20 Dec, 2023 2 commits

Error in --num_fewshot option for K-MMLU Evaluation Harness (#1178) · 12f2c5ea
GUIJIN SON authored Dec 21, 2023
```
* update kmmlu default formatting

* Update _default_kmmlu_yaml

* Delete lm_eval/tasks/kmmlu/utils.py
```
12f2c5ea

Switch Linting to `ruff` (#1166) · 65b8761d

Baber Abbasi authored Dec 20, 2023

* add ruff and isort. remove black and flake8

* remove unnecessary dependencies

* remove dependency from table

* change order

* ran ruff

* check 3.9

* exclude evaluator

* update CI workflow

* use ruff config in pyproject.toml

* test

* add isort rules to ruff

* sort imports

* import `make_table`

* try stages for no-commit-to-branch

* turn on mypy for pre-commit

* test

* test

* test

* change no-commit-to-branch to default

* nits

* fixed dependency

65b8761d

19 Dec, 2023 1 commit

Fix Column Naming and Dataset Naming Conventions in K-MMLU Evaluation (#1171) · 9e03d9d0

seungduk.kim.2304 authored Dec 20, 2023



* Correct column names and dataset names

* Remove kmmlu_general_physics.yaml and kmmlu_korean_language.yaml

* Update _default_kmmlu_yaml

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

9e03d9d0

18 Dec, 2023 1 commit
- bugfix (#1150) · 6f9630c8
  Baber Abbasi authored Dec 18, 2023
  
  6f9630c8
17 Dec, 2023 1 commit

[WIP] Add IFEval / Instruction-Following Eval (#1087) · aa61f940

Wis Kojohnjaratkul authored Dec 17, 2023

* Add IFEval task

* Check and download nltk punkt if not already downloaded

* Update gen_max_toks to 2048 to support "900 words+" instructions

* Resolve pre-commit linting issues

* Reduce max_gen_toks to 1280 to conserve token usage

* Add warning message in `process_results` call for non chat-finetuned models

aa61f940

15 Dec, 2023 1 commit

Add benchmark FLD (#1122) · 755bf6e8

MorishT authored Dec 15, 2023



* [fix] loading dataset from hub fails when the dataset name includes '.', as the program assumes it is on the local filesystem

* add FLD benchmark

* Update task.py

* [update] add group 'fld'

* [update] rename fld -> fld_default. add explanation to the readme

* Update README.md

---------
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

755bf6e8

14 Dec, 2023 2 commits
- Additional process for doc_to_choice (#1093) · a2ed953f
  Lintang Sutawika authored Dec 14, 2023
```
* Additional process for doc_to_choice

* doc_to_choice can also parse a string
```
  a2ed953f
- fix: _generate_configs.py · c314246d
  momotori authored Dec 14, 2023
  
  c314246d
13 Dec, 2023 5 commits
- `qqp`, `mnli_mismatch`: remove unlabled test sets (#1114) · 057dc2d7
  Baber Abbasi authored Dec 14, 2023
```
* remove unlabled test sets

* add note to readme
```
  057dc2d7
- bump version on cot_fewshot tasks · a7707c76
  haileyschoelkopf authored Dec 13, 2023
  
  a7707c76
- update regeneration script, bump bbh_cot_fewshot version · 3fbdfea1
  haileyschoelkopf authored Dec 13, 2023
  
  3fbdfea1
- fix: enlarge max_gen_toks to make output of bbh_cot_fewshot complete · 7ec42165
  momotori authored Dec 13, 2023
  
  7ec42165
- fix: fix bug in the "doc_to_text" of BBH_cot_fewshot · 33dcbd49
  momotori authored Dec 13, 2023
  
  33dcbd49
11 Dec, 2023 5 commits
- Delete lm_eval/tasks/glue/sst directory · ce4d9c68
  weijie authored Dec 11, 2023
  
  ce4d9c68
- Change the sub-task name from sst to sst2 in glue · dc556837
  weijie authored Dec 11, 2023
  
  dc556837
- Update template of qqp dataset · 5dbdf582
  weijie authored Dec 11, 2023
  
  5dbdf582
- Debugging · c256eda8
  h-albert-lee authored Dec 11, 2023
  
  c256eda8
- Change spaces to underline · 489814b7
  h-albert-lee authored Dec 11, 2023
  
  489814b7
10 Dec, 2023 6 commits
- Remove blanks at the end of a file · 4edd9c4c
  h-albert-lee authored Dec 10, 2023
  
  4edd9c4c
- reformatted with Black · 520d8ce2
  h-albert-lee authored Dec 10, 2023
  
  520d8ce2
- Modify code for flake8 · 95908d8b
  h-albert-lee authored Dec 10, 2023
  
  95908d8b
- bugfix #2 remove 'context' · 9221cb57
  h-albert-lee authored Dec 10, 2023
  
  9221cb57
- bug fix (f-string) · 7bc9b82b
  h-albert-lee authored Dec 10, 2023
  
  7bc9b82b
- update yaml config · c64da8ac
  h-albert-lee authored Dec 10, 2023
  
  c64da8ac
08 Dec, 2023 1 commit
- implementing kmmlu · 1b14602e
  h-albert-lee authored Dec 08, 2023
  
  1b14602e
07 Dec, 2023 8 commits
- fixed enumeration · 12f260cf
  lintangsutawika authored Dec 07, 2023
  
  12f260cf
- Update minerva_math_algebra.yaml · ce079c96
  Hailey Schoelkopf authored Dec 06, 2023
  
  ce079c96
- Update _cot_zeroshot_template_yaml · ba7ba910
  Hailey Schoelkopf authored Dec 06, 2023
  
  ba7ba910
- Update _fewshot_template_yaml · a2cc877b
  Hailey Schoelkopf authored Dec 06, 2023
  
  a2cc877b
- Update _zeroshot_template_yaml · 361ba192
  Hailey Schoelkopf authored Dec 06, 2023
  
  361ba192
- Update _mmlu_flan_cot_zeroshot_template_yaml · 8c05c6be
  Hailey Schoelkopf authored Dec 06, 2023
  
  8c05c6be
- Update _mmlu_flan_cot_fewshot_template_yaml · 4e34a6e8
  Hailey Schoelkopf authored Dec 06, 2023
  
  4e34a6e8
- Update _cot_fewshot_template_yaml · a6d28ea7
  Lintang Sutawika authored Dec 07, 2023
```
BBH cot fewshot already has fewshot examples in the description. So num_fewshot needs to be set to 0 so that users won't mistakenly set other num_fewshot values.
```
  a6d28ea7
04 Dec, 2023 2 commits
- add \n\n to end of description · c30b71c3
  lintangsutawika authored Dec 04, 2023
  
  c30b71c3
- add \n\n to end of description · b32b3793
  lintangsutawika authored Dec 04, 2023
  
  b32b3793