Commits · 56a4e7943fca2959aff06bb96a3fe5ec63255aaa · gaoqiong / lm-evaluation-harness

15 Jul, 2024 1 commit
- formatting (#2104) · 56a4e794
  Lintang Sutawika authored Jul 15, 2024
  
  56a4e794
12 Jul, 2024 1 commit

Irokobench: Benchmark Dataset for African languages (#2042) · 383bbd54

Jess authored Jul 12, 2024



* add afrixnli to task

* add chat completion

* remove chat completion -untested

* afrimmlu added

* afrimmlu folder update

* afrimmlu folder update

* updated prompt

* remove print

* add afrimgsm -direct

* add squad metric

* fix bash script

* remove direct util, update common yaml

* remove print

* add few show. metric fixes

* fix direct path, add bash script for gpt models

* added transate test

* update afrixnli tasks

* update afrixnli tasks

* update metrics for afrixnli

* prompt translations fix

* prompt translations fix

* filter and metric fix -mgsm

* remove squad metric

* remove squad metric

* add f1 score to mgsm

* add f1 score to mgsm

* update native-direct with lin

* change f1 function

* add lin to utils

* add utils

* remove test limit

* remove test configs

* add swahili to mmlu

* change eng to ewe in ewe yaml mmlu

* add squad metric to mgsm, remove whitespace filter

* added translate test

* added afrixnli_translate

* fix exact match valueError

* fix exact match valueError

* restructure mmlu folder

* spacing

* remove afrimmlu_translate folder

* add utility

* format task name, clean ups

* modefied mgsm

* update on afrimgsm

* update on afrimgsm

* removed utils

* other mgsm varieties

* other mgsm varieties

* adding trasnslate direct

* Update translate_direct_yaml

* add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model

* edit for open models

* Update translate_direct_yaml

* add verbalizer for xnli

* change xnli from multiple choice to generate

* add manual accuracy scores

* revert xnli to multiple choice

* change afrimgsm utils

* revert xnli to multiple_choice

* cleanups and readmes

* remove openai fixes and unused regex

* pr review changes

* revert metrics.py, task.py and extraction.py to main version

---------
Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de>
Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com>

383bbd54

01 Jul, 2024 1 commit
- ship with exact_match function already used ; don't call evaluate.load() on import (#2045) · a8ac0446
  Hailey Schoelkopf authored Jul 01, 2024
  
  a8ac0446
24 May, 2024 2 commits

Fix for bootstrap_iters = 0 case (#1715) (#1789) · b043b050
Hailey Schoelkopf authored May 24, 2024
```
* add handling for bootstrap_iters=0 case

* add more detail to docstring

* run precommit
```
b043b050

Fix Brier Score (#1847) · 7d747ea9

Lintang Sutawika authored May 25, 2024

`gold_one_hot` needs to follow the dimension of predictions so that it still works when `--limit` is used and the indexes in gold does not cover all gold indexes.

7d747ea9

26 Feb, 2024 1 commit

Cont metrics (#1475) · 96d185fa

Lintang Sutawika authored Feb 26, 2024



* add brier_score

* process brier_score

* brier score is working for N-sized class

* fxied brier score

* add TED to BigBench and Brier score to MMLU

* format

* Update metrics.py

* Update task.py

* Update generate_until_template_yaml

* Delete lm_eval/tasks/bigbench/aux_metric.py

* Update generate_until_template_yaml

* Update _default_template_yaml

* Update _generate_configs.py

* Update _generate_configs.py

* Update _generate_configs.py

* fix (format?)

* format?

* format, once more

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

96d185fa

20 Feb, 2024 1 commit

Group reqs by context (#1425) · 45941c67

Baber Abbasi authored Feb 20, 2024



* add key lookup for same contexts

* nit

* appease pre-commit

* nit

* use `expand` (in-place view) rather than `repeat`

* try mixed grouping

* add docs.

* nit

* nit

* nits

* fix tests

* Move greedy_tokens calculation out of cache loop

* nit

* nits

* add test

* nits

* fix name conflict

* fix name conflict

* chunk tensor

* move Collator

* nits/docstring

* fixup

* fixup

* group contexts only for decoders

* pre-commit

* fix `generate_until` test

* fix `generate_until` test

* Update lm_eval/models/huggingface.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add docs

* nit

* add docs

* add docs

* add 'logits_cache' arg

* bugfix

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

45941c67

13 Feb, 2024 1 commit
- Fix: task weighting by subtask size ; update Pooled Stderr formula slightly (#1427) · 620d6a15
  Hailey Schoelkopf authored Feb 13, 2024
```
* fix weight_by_size condition

* add tests, update stderr formula slightly

* apply pre-commit
```
  620d6a15
06 Feb, 2024 1 commit

Use Pooled rather than Combined Variance for calculating stderr of task groupings (#1390) · 94cc1850

Hailey Schoelkopf authored Feb 06, 2024

* update formula for stderr aggregation

* hack: see what happens when using stderr_for_metric bootstrapping on a group

* undo bootstrap_for_stderr test

* factor out variance-aggregation formulas into api.metrics

* fix failing tests

* remove stray print

* update comment

* further detail in comment

* add back initialize_tasks() call

* fix format

94cc1850

31 Jan, 2024 1 commit

add bypass metric (#1156) · f8203de1

Baber Abbasi authored Feb 01, 2024

* add bypass metric

* fixed `bypass` metric.

* add task attributes if predict_only

* add `predict_only` checks

* add docs

* added `overide_metric`, `override_config` to `Task`

* nits

* nit

* changed --predict_only to generations; nits

* nits

* nits

* change gen_kwargs warning

* add note about `--predict_only` in README.md

* added `predict_only`

* move table to bottom

* nit

* change null aggregation to bypass (conflict)

* bugfix; default `temp=0.0`

* typo

f8203de1

20 Dec, 2023 1 commit

Switch Linting to `ruff` (#1166) · 65b8761d

Baber Abbasi authored Dec 20, 2023

* add ruff and isort. remove black and flake8

* remove unnecessary dependencies

* remove dependency from table

* change order

* ran ruff

* check 3.9

* exclude evaluator

* update CI workflow

* use ruff config in pyproject.toml

* test

* add isort rules to ruff

* sort imports

* import `make_table`

* try stages for no-commit-to-branch

* turn on mypy for pre-commit

* test

* test

* test

* change no-commit-to-branch to default

* nits

* fixed dependency

65b8761d

02 Nov, 2023 2 commits
- precommit format · 73f3029c
  lintangsutawika authored Nov 02, 2023
  
  73f3029c
- eval_logger is not imported from logger.py anymore · f701ba7d
  lintangsutawika authored Nov 02, 2023
  
  f701ba7d
19 Oct, 2023 3 commits
- don't call evaluate.load() every time · ff148cd8
  haileyschoelkopf authored Oct 19, 2023
  
  ff148cd8
- fixed registered metric · 8d4d1fa9
  lintangsutawika authored Oct 19, 2023
  
  8d4d1fa9
- fix issue with default metrics and aggregation functions · 90d818da
  lintangsutawika authored Oct 19, 2023
  
  90d818da
18 Oct, 2023 1 commit
- replace the rest of the 'greedy_until' occurrences · c28d100d
  haileyschoelkopf authored Oct 18, 2023
  
  c28d100d
25 Aug, 2023 1 commit

Add suggestions from autotyping · fc69d84f

Ethan Smith authored Aug 25, 2023

This adds a bunch of simple annotations suggested by https://github.com/JelleZijlstra/autotyping.

fc69d84f

14 Aug, 2023 5 commits
- Removed Implementation from metrics.py · 60763408
  Aflah authored Aug 03, 2023
  
  60763408
- Improved DocString · 57bcb4d6
  Aflah authored Aug 02, 2023
  
  57bcb4d6
- CMD Line Works · b221531a
  Aflah authored Aug 02, 2023
  
  b221531a
- Added Metric and Seems to Run but some Errors · d857e539
  Aflah authored Aug 02, 2023
  
  d857e539
- Base Template · 94ccc429
  Aflah authored Aug 02, 2023
  
  94ccc429
12 Aug, 2023 1 commit
- make chrf and ter aggregations · 116c540a
  haileyschoelkopf authored Aug 12, 2023
  
  116c540a
11 Aug, 2023 1 commit
- support bleu score as a metric · 8806eff5
  haileyschoelkopf authored Aug 11, 2023
  
  8806eff5
03 Aug, 2023 1 commit
- Removed Implementation from metrics.py · 7b376ae1
  Aflah authored Aug 03, 2023
  
  7b376ae1
02 Aug, 2023 4 commits
- Improved DocString · c42cb562
  Aflah authored Aug 02, 2023
  
  c42cb562
- CMD Line Works · eaa1c766
  Aflah authored Aug 02, 2023
  
  eaa1c766
- Added Metric and Seems to Run but some Errors · 62b8a6ce
  Aflah authored Aug 02, 2023
  
  62b8a6ce
- Base Template · 6cb8169c
  Aflah authored Aug 02, 2023
  
  6cb8169c
06 Jul, 2023 1 commit
- bugfixes + add write_out · 4e5a328e
  haileyschoelkopf authored Jul 06, 2023
  
  4e5a328e
15 Jun, 2023 1 commit
- minor fixes to satisify pre-commit · 400c0199
  lintangsutawika authored Jun 15, 2023
  
  400c0199
13 Jun, 2023 1 commit
- fix matthews corr. metric · baff2568
  haileyschoelkopf authored Jun 13, 2023
  
  baff2568
12 Jun, 2023 1 commit

[Refactor] [WIP] New YAML advanced docs (#567) · 79b972d6

Hailey Schoelkopf authored Jun 12, 2023



* add wip gsm8k yaml

* cleanup tasks dir

* push gsm8k yaml changes

* rename gpt2.py

* add updated gsm8k , triviaqa baseline

* add new cot yaml

* allow for multiple filter pipelines, new filter types

* updated gsm8k + sampling gen configs

* cleanup self-consistency yaml

* push outline for advanced docs

* push docs checklist

* switch to inheritance for many tasks

* acc_norm and acc_mutual_info fixed

* fix missing newline in error msg

* remove many .py tasks

* updated GSM8k

* added more doc

* Update advanced_task_guide.md

Added list of parameters

* Update advanced_task_guide.md

* Added details on listing metrics

* Update advanced_task_guide.md

* Added more explanation

* modify current default filter name

* add new tags to tasks

* remove a lingering print()

* add rest of param docs, cleanup deprecated fields

* push docs update

* move ALL_TASKS definition location

* confirm write_out.py works if no description dict passed

---------
Co-authored-by: lintangsutawika <lintang@sutawika.com>

79b972d6

07 Jun, 2023 2 commits
- fixed register_model origin and other imports · 424a4280
  lintangsutawika authored Jun 07, 2023
  
  424a4280
- moved METRIC_REGISTRY to registry.py · 9e75a84d
  lintangsutawika authored Jun 07, 2023
  
  9e75a84d
06 Jun, 2023 1 commit
- metrics are now in a special folder so that registry can work better · b3591562
  lintangsutawika authored Jun 06, 2023
  
  b3591562
19 May, 2023 1 commit
- pre-commit stuff · c4c20ff5
  lintangsutawika authored May 19, 2023
  
  c4c20ff5
10 May, 2023 1 commit
- moved METRICS related dicts · 35730ace
  lintangsutawika authored May 10, 2023
  
  35730ace
08 May, 2023 1 commit
- add mutual info, code cleanup · 95642aa6
  haileyschoelkopf authored May 08, 2023
  
  95642aa6