Commits · b9efb08816ccc7940f29f5d09ca99366b48aa708 · gaoqiong / lm-evaluation-harness

23 Jan, 2024 17 commits
- removed unused code · b9efb088
  lintangsutawika authored Jan 20, 2024
  
  b9efb088
- removed test lines · 9e98d1a9
  lintangsutawika authored Jan 20, 2024
  
  9e98d1a9
- further tidy up · 73c8c87c
  lintangsutawika authored Jan 20, 2024
  
  73c8c87c
- simplified code · 52e0148a
  lintangsutawika authored Jan 20, 2024
  
  52e0148a
- fixed typo · 6475a2b5
  lintangsutawika authored Jan 20, 2024
  
  6475a2b5
- fixed typo · 007e5b80
  lintangsutawika authored Jan 20, 2024
  
  007e5b80
- iron out bugs · 8db2ec3e
  lintangsutawika authored Jan 20, 2024
  
  8db2ec3e
- adapted initialize_tasks · 153ea28e
  lintangsutawika authored Jan 20, 2024
  
  153ea28e
- indexing and loading are part of a task_manager object · e487748e
  lintangsutawika authored Jan 20, 2024
  
  e487748e
- temp save · f9c51329
  lintangsutawika authored Jan 20, 2024
  
  f9c51329
- allow groups and tasks to work · 72859bbc
  lintangsutawika authored Jan 19, 2024
  
  72859bbc
- half working for nested groups · fea304e0
  lintangsutawika authored Jan 19, 2024
  
  fea304e0
- load_task_or_group works but only for tasks · 702593ec
  lintangsutawika authored Jan 19, 2024
  
  702593ec
- pre-commit format · 57f158b3
  lintangsutawika authored Jan 19, 2024
  
  57f158b3
- more comprehensive way to index tasks and groups · fe9f0d46
  lintangsutawika authored Jan 19, 2024
  
  fe9f0d46
- initialize_tasks returns list of tasks and groups · 55ea2888
  lintangsutawika authored Jan 19, 2024
  
  55ea2888
- task for testing recursive · d5071b70
  lintangsutawika authored Jan 19, 2024
  
  d5071b70
19 Jan, 2024 1 commit
- Update polemo2_in.yaml (#1318) · 4477d572
  Lintang Sutawika authored Jan 20, 2024
  
  4477d572
18 Jan, 2024 3 commits
- Fix group register (#1315) · 72ea626e
  Lintang Sutawika authored Jan 19, 2024
```
* tuple should be considered as well

* set option to keep callable as callable
```
  72ea626e
- Fix polemo2_in.yaml config name (#1313) · b8cbc425
  Quentin Lhoest authored Jan 18, 2024
  
  b8cbc425
- Update nq_open.yaml (#1305) · 10488d0d
  Hannibal046 authored Jan 18, 2024
```
* Update nq_open.yaml

change regex

* Bump NQ version

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
```
  10488d0d
16 Jan, 2024 1 commit
- Update nq_open.yaml (#1289) · 032e879b
  Hailey Schoelkopf authored Jan 15, 2024
  
  032e879b
15 Jan, 2024 2 commits
- Allow parameter edits for registered tasks when listed in a benchmark (#1273) · 03e7df51
  Lintang Sutawika authored Jan 15, 2024
```
* benchmark yamls allow minor edits of already registered tasks

* add documentation

* removed print
```
  03e7df51
- fix whitespace in target + prompt for CoT gsm8k (#1275) · ace4393e
  Hailey Schoelkopf authored Jan 15, 2024
  
  ace4393e
12 Jan, 2024 1 commit

jp authored Jan 12, 2024

* Add: kobest config file

* Add: kobest utils

* Add: README

* Update utils.py

653217a7

11 Jan, 2024 2 commits

Fix bug in multi-token Stop Sequences (#1268) · ff739414
Hailey Schoelkopf authored Jan 11, 2024
```
* fix incorrect lookback protections

* bump generate_until task versions
```
ff739414

MultiMedQA (#1198) · 818c056b

Tanishq Abraham authored Jan 10, 2024



* multimedqa

* Update medqa.yaml

* move to benchmarks folder

* add README.md

---------
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

818c056b

10 Jan, 2024 1 commit
- fixed belebele (#1267) · 9b0b15b1
  James A. Michaelov authored Jan 10, 2024
  
  9b0b15b1
05 Jan, 2024 1 commit

Add multilingual HellaSwag task (#1228) · 28bb45fb

JorgeDeCorte authored Jan 05, 2024



* add hellaswag_nl

* add other languages and update readme to hellaswag

* refactor as new task

* update readme

* add endline to yaml files and readme.md

* add group, change folder location and update yaml file

* rename default hellaswag yaml file

* fix whitespace error in some labels

* downgrade log level of whitespace checking

---------
Co-authored-by: JorgeDeCorte <jorge.decorte@ravago.be>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

28bb45fb

02 Jan, 2024 1 commit
- Update README.md (#1230) · a12ef445
  Pasquale Minervini authored Jan 02, 2024
  
  a12ef445
29 Dec, 2023 1 commit

Don't silence errors when loading tasks (#1148) · 34b563b1

Paul McCann authored Dec 30, 2023



* Add example failing task

This task includes an invalid import. This will cause an exception and
the task will not be loaded. But this just results in a DEBUG level log
message, so in normal usage you'll see no error, and will be told the
task doesn't exist.

Here's an example command line to run the task:

    python -m lm_eval --model hf --model_args pretrained=rinna/japanese-gpt-1b --tasks fail

This task is based on a Japanese Winograd task, but that's not
important, and was just used due to familiarity.

* Do not ignore errors when loading tasks

* Change how task errors are logged

This makes the proposed changes from PR discussion.

1. Exceptions not related to missing modules/imports are logged as
   warnings.

2. module/import related exceptions are still logged at debug level, but
   if any of them happen there is a warning about it with instructions
   on how to show logs.

* Remove intentionally failing task

---------
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

34b563b1

27 Dec, 2023 1 commit

nits + fix siqa (#1216) · 6a1c19ed

Baber Abbasi authored Dec 27, 2023

* fix group

* siqa: default.yml -> default.yaml

* max_gen_toks -> self.max_gen_toks

* add ids to task tests

* fix siqa

* fix gen_kwargs for openai-chat

6a1c19ed

24 Dec, 2023 1 commit

Add remove_whitespace to FLD benchmark (#1206) · 8ffbe58a

MorishT authored Dec 24, 2023



* Add remove_whitespace to FLD benchmark

* bump task version

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

8ffbe58a

21 Dec, 2023 1 commit

Correctly Print Task Versioning (#1173) · 9cd79897

Hailey Schoelkopf authored Dec 21, 2023

* change version field formatting in metadata

* mention versioning in new task guide

* add instructions for changelog

* run linters

9cd79897

20 Dec, 2023 2 commits

Error in --num_fewshot option for K-MMLU Evaluation Harness (#1178) · 12f2c5ea
GUIJIN SON authored Dec 21, 2023
```
* update kmmlu default formatting

* Update _default_kmmlu_yaml

* Delete lm_eval/tasks/kmmlu/utils.py
```
12f2c5ea

Switch Linting to `ruff` (#1166) · 65b8761d

Baber Abbasi authored Dec 20, 2023

* add ruff and isort. remove black and flake8

* remove unnecessary dependencies

* remove dependency from table

* change order

* ran ruff

* check 3.9

* exclude evaluator

* update CI workflow

* use ruff config in pyproject.toml

* test

* add isort rules to ruff

* sort imports

* import `make_table`

* try stages for no-commit-to-branch

* turn on mypy for pre-commit

* test

* test

* test

* change no-commit-to-branch to default

* nits

* fixed dependency

65b8761d

19 Dec, 2023 1 commit

Fix Column Naming and Dataset Naming Conventions in K-MMLU Evaluation (#1171) · 9e03d9d0

seungduk.kim.2304 authored Dec 20, 2023



* Correct column names and dataset names

* Remove kmmlu_general_physics.yaml and kmmlu_korean_language.yaml

* Update _default_kmmlu_yaml

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

9e03d9d0

18 Dec, 2023 1 commit
- bugfix (#1150) · 6f9630c8
  Baber Abbasi authored Dec 18, 2023
  
  6f9630c8
17 Dec, 2023 1 commit

[WIP] Add IFEval / Instruction-Following Eval (#1087) · aa61f940

Wis Kojohnjaratkul authored Dec 17, 2023

* Add IFEval task

* Check and download nltk punkt if not already downloaded

* Update gen_max_toks to 2048 to support "900 words+" instructions

* Resolve pre-commit linting issues

* Reduce max_gen_toks to 1280 to conserve token usage

* Add warning message in `process_results` call for non chat-finetuned models

aa61f940

15 Dec, 2023 1 commit

Add benchmark FLD (#1122) · 755bf6e8

MorishT authored Dec 15, 2023



* [fix] loading dataset from hub fails when the dataset name includes '.', as the program assumes it is on the local filesystem

* add FLD benchmark

* Update task.py

* [update] add group 'fld'

* [update] rename fld -> fld_default. add explanation to the readme

* Update README.md

---------
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

755bf6e8