Commits · d83f7eb0c51ea1f4e5aee97df93a1124a0449c29 · gaoqiong / lm-evaluation-harness

21 Jul, 2025 1 commit
- add type hints · d83f7eb0
  Baber authored Jul 21, 2025
  
  d83f7eb0
14 Jul, 2025 1 commit
- Fix for hang due to mp.Pool in bootstrap_stderr (#3135) · cf631de0
  Ankit Gola authored Jul 14, 2025
  
  cf631de0
04 Jul, 2025 1 commit

[FIX] Initial code to disable multi-proc for stderr (#3106) · 71d0289d

Neel Gupta authored Jul 04, 2025



* [FIX] Initial code to disable multi-proc for stderr

* add docs; align no-mp bootstrap with mp

---------
Co-authored-by: Baber <baber@hey.com>

71d0289d

30 Jun, 2025 2 commits
- update type hints · 3ba4e897
  Baber authored Jul 01, 2025
  
  3ba4e897
- add MetricConfig · 1b5c6f88
  Baber authored Jun 30, 2025
  
  1b5c6f88
11 Mar, 2025 1 commit

New healthcare benchmark: careqa (#2714) · 7c9fbcf8

PabloAgustin authored Mar 11, 2025



* New healthcare benchmark: careqa

* LAUNCH_MN5_ACC <python main.py --config config/mn5.yml --models Llama-3.2-1B-Instruct --tasks careqa_open --num_fewshot 0>

* Add fixes, READMES, and remove task_list.txt

* pre-commit passed, add formatting updates; add nanmean agg_metric

* Fix import error.

* Wrapped imports in try excepts

* Wrapped imports in try excepts; also metrics to catch bert_score import error

* Try except to catch ImportErrors as well

* use np.nan

* pre-commit

---------
Co-authored-by: PabloAgustin <pablo.martin@bsc.es>
Co-authored-by: Baber <baber@hey.com>

7c9fbcf8

21 Feb, 2025 1 commit

Logging (#2203) · 1ba35e62

Lintang Sutawika authored Feb 20, 2025



* changed source of eval_logger

* allow eval_logger to be set from args

* removed verbosity arg from non-main methods

* fix logging

* pre-commit

* set verbosity in eval logger

* replace utils.eval_logger

* fix logging in main

* add logging to docs

* add logging message

* nit

* add logging to docs

* refactor setup_logging to utils

---------
Co-authored-by: Baber <baber@hey.com>

1ba35e62

19 Jan, 2025 1 commit
- update pre-commit (#2632) · f724be69
  Baber Abbasi authored Jan 19, 2025
```
* update pre-commit
```
  f724be69
01 Aug, 2024 1 commit

refactor: limit usage of `scipy` and `skilearn` dependencies (#2097) · 7f15cce4

Nathan Weinberg authored Aug 01, 2024



* refactor: move scipy and sklearn module imports to func imports
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

* refactor: consolidate weighted_f1_score func into lm_eval utils
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

* lint: allow for utils file to have unused imports

this allows for shared functions to be defined only
once while allowing for the YAML function importing
to continue working
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

---------
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

7f15cce4

15 Jul, 2024 1 commit
- formatting (#2104) · 56a4e794
  Lintang Sutawika authored Jul 15, 2024
  
  56a4e794
12 Jul, 2024 1 commit

Irokobench: Benchmark Dataset for African languages (#2042) · 383bbd54

Jess authored Jul 12, 2024



* add afrixnli to task

* add chat completion

* remove chat completion -untested

* afrimmlu added

* afrimmlu folder update

* afrimmlu folder update

* updated prompt

* remove print

* add afrimgsm -direct

* add squad metric

* fix bash script

* remove direct util, update common yaml

* remove print

* add few show. metric fixes

* fix direct path, add bash script for gpt models

* added transate test

* update afrixnli tasks

* update afrixnli tasks

* update metrics for afrixnli

* prompt translations fix

* prompt translations fix

* filter and metric fix -mgsm

* remove squad metric

* remove squad metric

* add f1 score to mgsm

* add f1 score to mgsm

* update native-direct with lin

* change f1 function

* add lin to utils

* add utils

* remove test limit

* remove test configs

* add swahili to mmlu

* change eng to ewe in ewe yaml mmlu

* add squad metric to mgsm, remove whitespace filter

* added translate test

* added afrixnli_translate

* fix exact match valueError

* fix exact match valueError

* restructure mmlu folder

* spacing

* remove afrimmlu_translate folder

* add utility

* format task name, clean ups

* modefied mgsm

* update on afrimgsm

* update on afrimgsm

* removed utils

* other mgsm varieties

* other mgsm varieties

* adding trasnslate direct

* Update translate_direct_yaml

* add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model

* edit for open models

* Update translate_direct_yaml

* add verbalizer for xnli

* change xnli from multiple choice to generate

* add manual accuracy scores

* revert xnli to multiple choice

* change afrimgsm utils

* revert xnli to multiple_choice

* cleanups and readmes

* remove openai fixes and unused regex

* pr review changes

* revert metrics.py, task.py and extraction.py to main version

---------
Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de>
Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com>

383bbd54

01 Jul, 2024 1 commit
- ship with exact_match function already used ; don't call evaluate.load() on import (#2045) · a8ac0446
  Hailey Schoelkopf authored Jul 01, 2024
  
  a8ac0446
24 May, 2024 2 commits

Fix for bootstrap_iters = 0 case (#1715) (#1789) · b043b050
Hailey Schoelkopf authored May 24, 2024
```
* add handling for bootstrap_iters=0 case

* add more detail to docstring

* run precommit
```
b043b050

Fix Brier Score (#1847) · 7d747ea9

Lintang Sutawika authored May 25, 2024

`gold_one_hot` needs to follow the dimension of predictions so that it still works when `--limit` is used and the indexes in gold does not cover all gold indexes.

7d747ea9

26 Feb, 2024 1 commit

Cont metrics (#1475) · 96d185fa

Lintang Sutawika authored Feb 26, 2024



* add brier_score

* process brier_score

* brier score is working for N-sized class

* fxied brier score

* add TED to BigBench and Brier score to MMLU

* format

* Update metrics.py

* Update task.py

* Update generate_until_template_yaml

* Delete lm_eval/tasks/bigbench/aux_metric.py

* Update generate_until_template_yaml

* Update _default_template_yaml

* Update _generate_configs.py

* Update _generate_configs.py

* Update _generate_configs.py

* fix (format?)

* format?

* format, once more

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

96d185fa

20 Feb, 2024 1 commit

Group reqs by context (#1425) · 45941c67

Baber Abbasi authored Feb 20, 2024



* add key lookup for same contexts

* nit

* appease pre-commit

* nit

* use `expand` (in-place view) rather than `repeat`

* try mixed grouping

* add docs.

* nit

* nit

* nits

* fix tests

* Move greedy_tokens calculation out of cache loop

* nit

* nits

* add test

* nits

* fix name conflict

* fix name conflict

* chunk tensor

* move Collator

* nits/docstring

* fixup

* fixup

* group contexts only for decoders

* pre-commit

* fix `generate_until` test

* fix `generate_until` test

* Update lm_eval/models/huggingface.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add docs

* nit

* add docs

* add docs

* add 'logits_cache' arg

* bugfix

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

45941c67

13 Feb, 2024 1 commit
- Fix: task weighting by subtask size ; update Pooled Stderr formula slightly (#1427) · 620d6a15
  Hailey Schoelkopf authored Feb 13, 2024
```
* fix weight_by_size condition

* add tests, update stderr formula slightly

* apply pre-commit
```
  620d6a15
06 Feb, 2024 1 commit

Use Pooled rather than Combined Variance for calculating stderr of task groupings (#1390) · 94cc1850

Hailey Schoelkopf authored Feb 06, 2024

* update formula for stderr aggregation

* hack: see what happens when using stderr_for_metric bootstrapping on a group

* undo bootstrap_for_stderr test

* factor out variance-aggregation formulas into api.metrics

* fix failing tests

* remove stray print

* update comment

* further detail in comment

* add back initialize_tasks() call

* fix format

94cc1850

31 Jan, 2024 1 commit

add bypass metric (#1156) · f8203de1

Baber Abbasi authored Feb 01, 2024

* add bypass metric

* fixed `bypass` metric.

* add task attributes if predict_only

* add `predict_only` checks

* add docs

* added `overide_metric`, `override_config` to `Task`

* nits

* nit

* changed --predict_only to generations; nits

* nits

* nits

* change gen_kwargs warning

* add note about `--predict_only` in README.md

* added `predict_only`

* move table to bottom

* nit

* change null aggregation to bypass (conflict)

* bugfix; default `temp=0.0`

* typo

f8203de1

20 Dec, 2023 1 commit

Switch Linting to `ruff` (#1166) · 65b8761d

Baber Abbasi authored Dec 20, 2023

* add ruff and isort. remove black and flake8

* remove unnecessary dependencies

* remove dependency from table

* change order

* ran ruff

* check 3.9

* exclude evaluator

* update CI workflow

* use ruff config in pyproject.toml

* test

* add isort rules to ruff

* sort imports

* import `make_table`

* try stages for no-commit-to-branch

* turn on mypy for pre-commit

* test

* test

* test

* change no-commit-to-branch to default

* nits

* fixed dependency

65b8761d

02 Nov, 2023 2 commits
- precommit format · 73f3029c
  lintangsutawika authored Nov 02, 2023
  
  73f3029c
- eval_logger is not imported from logger.py anymore · f701ba7d
  lintangsutawika authored Nov 02, 2023
  
  f701ba7d
19 Oct, 2023 3 commits
- don't call evaluate.load() every time · ff148cd8
  haileyschoelkopf authored Oct 19, 2023
  
  ff148cd8
- fixed registered metric · 8d4d1fa9
  lintangsutawika authored Oct 19, 2023
  
  8d4d1fa9
- fix issue with default metrics and aggregation functions · 90d818da
  lintangsutawika authored Oct 19, 2023
  
  90d818da
18 Oct, 2023 1 commit
- replace the rest of the 'greedy_until' occurrences · c28d100d
  haileyschoelkopf authored Oct 18, 2023
  
  c28d100d
25 Aug, 2023 1 commit

Add suggestions from autotyping · fc69d84f

Ethan Smith authored Aug 25, 2023

This adds a bunch of simple annotations suggested by https://github.com/JelleZijlstra/autotyping.

fc69d84f

14 Aug, 2023 5 commits
- Removed Implementation from metrics.py · 60763408
  Aflah authored Aug 03, 2023
  
  60763408
- Improved DocString · 57bcb4d6
  Aflah authored Aug 02, 2023
  
  57bcb4d6
- CMD Line Works · b221531a
  Aflah authored Aug 02, 2023
  
  b221531a
- Added Metric and Seems to Run but some Errors · d857e539
  Aflah authored Aug 02, 2023
  
  d857e539
- Base Template · 94ccc429
  Aflah authored Aug 02, 2023
  
  94ccc429
12 Aug, 2023 1 commit
- make chrf and ter aggregations · 116c540a
  haileyschoelkopf authored Aug 12, 2023
  
  116c540a
11 Aug, 2023 1 commit
- support bleu score as a metric · 8806eff5
  haileyschoelkopf authored Aug 11, 2023
  
  8806eff5
03 Aug, 2023 1 commit
- Removed Implementation from metrics.py · 7b376ae1
  Aflah authored Aug 03, 2023
  
  7b376ae1
02 Aug, 2023 4 commits
- Improved DocString · c42cb562
  Aflah authored Aug 02, 2023
  
  c42cb562
- CMD Line Works · eaa1c766
  Aflah authored Aug 02, 2023
  
  eaa1c766
- Added Metric and Seems to Run but some Errors · 62b8a6ce
  Aflah authored Aug 02, 2023
  
  62b8a6ce
- Base Template · 6cb8169c
  Aflah authored Aug 02, 2023
  
  6cb8169c
06 Jul, 2023 1 commit
- bugfixes + add write_out · 4e5a328e
  haileyschoelkopf authored Jul 06, 2023
  
  4e5a328e