Commits · b3131364e7ebb69f2fc3ebdd1ffece54396ad62e · gaoqiong / lm-evaluation-harness

23 Jul, 2025 1 commit
- update scrolls · b3131364
  Baber authored Jul 23, 2025
  
  b3131364
21 Jul, 2025 3 commits
- nq_open: move multi_target to `exact_match` · 08c54c63
  Baber authored Jul 21, 2025
  
  08c54c63
- move multi_target to `exact_match` · 8f924e1c
  Baber authored Jul 21, 2025
  
  8f924e1c
- refactor masakhapos · 66e62de7
  Baber authored Jul 21, 2025
  
  66e62de7
19 Jul, 2025 3 commits
- multiblimp (#3162) · 4366fc82
  Baber Abbasi authored Jul 20, 2025
  
  4366fc82
- Add the MultiBLiMP benchmark (#3155) · de4ce482
  James A. Michaelov authored Jul 19, 2025
```
* add multiblimp

* run linter
```
  de4ce482
- Bugfix: update path for GLUE (#3159) · 8c025369
  Avelina Asada Hadji-Kyriacou authored Jul 19, 2025
```
* Update default.yaml
```
  8c025369
18 Jul, 2025 1 commit
- Fix medical benchmarks import (#3151) · 489fbc21
  Idan Tene authored Jul 18, 2025
```
* Update utils.py
```
  489fbc21
16 Jul, 2025 1 commit

`bbh_cot_fewshot`: Removed repeated "Let''s think step by step." text from bbh cot prompts (#3140) · c2be7211

philipdoldo authored Jul 16, 2025



* Removed the 'Let''s think step by step.' text from the start of the target entry in each of the samples to prevent this phrase from being repeated twice in the few-shot prompts and to match the behavior from the original bbh repository. Worth noting that this applied to only 26 out of 27 subtasks, the only one it did not apply to is boolean_expressions.yaml. When it comes to boolean_expressions.yaml, in my opinion there is an error in that it doesn't say the 'Remember that (i) ...' text after the final 'A: Let's think step by step.' in the prompt. Models like EleutherAI/gpt-neo-125m seem to always begin answers with this string anyway (copying what was done in the few-shot prompts), but I think it really should've been part of the prompt, much like how 'A: Let's think step by step.' is included in the prompt for all of the cot tasks. However, the original bbh repo also has this issue, so I think it is fine to keep it this way for consistency, but just thought I'd point it out anyway.

* feat: remove extra space from answers; add changelog

---------
Co-authored-by: Baber <baber@hey.com>

c2be7211

14 Jul, 2025 1 commit

Adding EgyMMLU and EgyHellaSwag (#3063) · 2ea6114e

Atou Houdaifa authored Jul 14, 2025

* add egy mmlu hellaswag

* add egymmlu egyhellaswag to tasks readme

* fix egymmlu config generation

* fix _generate_configs formating

2ea6114e

10 Jul, 2025 1 commit

warning for "chat" pretrained; disable buggy evalita configs (#3127) · f3a0b554

Baber Abbasi authored Jul 10, 2025

* check for chat for warning

* add test

* remove yaml extension from some evalita configs

* move unitxt to own test script

* fix CI test

f3a0b554

03 Jul, 2025 2 commits

Humaneval - fix regression (#3102) · 8c1016cb
Baber Abbasi authored Jul 03, 2025
```
* use double quotes
```
8c1016cb

Truthfulqa multi harness (#3062) · e0dc33ae

Blanca Calvo authored Jul 03, 2025



* truthfulqa-multi task

* truthfulqa-multi with chat few-shot

* few shot chat implementation

* changed until so it outputs lists

* changed dataset location

* added MT task

* Create README.md

* do not include MT

* changes for PR

* tag change

* removed yaml extension

* adding task to the table

* fix task configs

* add import exception

---------
Co-authored-by: Baber <baber@hey.com>

e0dc33ae

30 Jun, 2025 1 commit

FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct (#3092) · a7ca0435

jinze authored Jul 01, 2025

* Fix: Align the Humaneval dataset with official results

Details:(1) modified the "doc_to_text" and "gen_prefix" in the "humaneval_instruct.yaml" file to make them the same as the Prompt in "meta-llama/Llama-3.1-70B-Instruct-evals".

(2) Change r.rfind("```") to r.find("```"), so it can locate the first "```", not the last one.

Results: Partially reproduced the official results: The result of LLaMA3.1-8B-Instruct is 66.5 (the official result is 72.6), and the result of LLaMA3.1-70B-Instruct is 80.5 (the official result is 80.5).

Ref: PR#2650

* add changelog and version

* add changelog

a7ca0435

25 Jun, 2025 1 commit
- Ensure backwards compatibility in fewshot_context by using kwargs (#3079) · 532909c0
  Kiersten Stokes authored Jun 25, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  532909c0
20 Jun, 2025 1 commit

llama3 task: update README.md (#3074) · 68c3a811

Anna Fontana authored Jun 20, 2025

"arc_chalenge_chat" doesn't exist: I think it should be "arc_challenge_chat", but this task is not implemented here (see arc task folder).

68c3a811

19 Jun, 2025 2 commits
- Update instructions.py (#3060) · 37357004
  Maxim Evtush authored Jun 19, 2025
  
  37357004
- Update README.md (#3070) · 5a15058e
  Anna Fontana authored Jun 19, 2025
```
Wrong task name: mmlu_generation doesn't non exist -> mmlu_generative is the correct one
```
  5a15058e
16 Jun, 2025 2 commits
- fix longbech citation (#3061) · 9fbe48c2
  Baber Abbasi authored Jun 16, 2025
```
* fix longbech citation
```
  9fbe48c2
- Fix Typo in README and Comment in utils_mcq.py (#3057) · e20ef72e
  fuder.eth authored Jun 16, 2025
```
* Update README.md

* Update utils_mcq.py
```
  e20ef72e
12 Jun, 2025 1 commit
- Fallback to super impl in fewshot_context for Unitxt tasks (#3023) · d09e03dd
  Kiersten Stokes authored Jun 12, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  d09e03dd
08 Jun, 2025 1 commit

[longbench] fix metric calculation (#2983) · 147e9d61

Baber Abbasi authored Jun 08, 2025

* use all answers

* use middle truncation

* maybe fix classification score

* strip classification preds

* [vllm] remove stop tokens post-hoc

* strip all preds

* pacify pre-commit

* start on truncation utility

* add to readme

* add a footgun doc

* fix newline in yaml templates

* do not strip code_sim preds!

* fix pre-commit config

* fix instruction warning

* add not to longbench readme

147e9d61

03 Jun, 2025 2 commits

remove prints (#3041) · 9f152e0b
Baber Abbasi authored Jun 03, 2025

9f152e0b

add Mbpp instruct (#2995) · 60e85da5

Baber Abbasi authored Jun 03, 2025

* feat: add mbpp_instruct

* fix: update generation_kwargs to use an empty until list

* fix: correct predictions formatting in pass_at_1 function

* fix: improve code block extraction by checking first without opening backticks

* fix mbpp `pass_at_1`

60e85da5

26 May, 2025 1 commit
- add arab_culture task (#3006) · 8bc4afff
  Boda Sadallah authored May 26, 2025
```
* add arab_culture tasks

* add target_delimeter and remove debugging code
```
  8bc4afff
21 May, 2025 1 commit
- add kbl 2025 (#3000) · 8be417a8
  Hongseok Oh authored May 21, 2025
  
  8be417a8
19 May, 2025 2 commits

[SGLANG] Add the SGLANG generate API (#2997) · 53c65300
Baber Abbasi authored May 19, 2025
```
* add `sglang-generate`

* nit

* nit

* nit

* pacify pre-commit
```
53c65300

Adding ACPBench Hard tasks (#2980) · 0daf28fd

Harsha authored May 19, 2025

* adding ACPBench_hard

* adding Clingo

* changing tarski to tarski[clingo]

* denoting the main variants in each paper

0daf28fd

15 May, 2025 4 commits

fix formatting (#2759) · 0126f6d1
Baber Abbasi authored May 15, 2025

0126f6d1
Update utils.py (#2870) · 2bde99e4
tawsif authored May 15, 2025

2bde99e4

Added C4 Support (#2889) · 86a3b270

Yufeng Xu authored May 15, 2025

* added c4 dataset (working)

* fixed bugs in c4

* fixed loading bugs in c4 dataset; using partial loading

* cleaned the code

* added version number for c4

* removed irrelevant files

86a3b270

AfroBench: How Good are Large Language Models on African Languages? (#2825) · 18297993

Jess authored May 15, 2025



* add afrixnli to task

* add chat completion

* remove chat completion -untested

* afrimmlu added

* afrimmlu folder update

* afrimmlu folder update

* updated prompt

* remove print

* add afrimgsm -direct

* add squad metric

* fix bash script

* remove direct util, update common yaml

* remove print

* add few show. metric fixes

* fix direct path, add bash script for gpt models

* added transate test

* update afrixnli tasks

* update afrixnli tasks

* update metrics for afrixnli

* prompt translations fix

* prompt translations fix

* filter and metric fix -mgsm

* remove squad metric

* remove squad metric

* add f1 score to mgsm

* add f1 score to mgsm

* update native-direct with lin

* change f1 function

* add lin to utils

* add utils

* remove test limit

* remove test configs

* add swahili to mmlu

* change eng to ewe in ewe yaml mmlu

* add squad metric to mgsm, remove whitespace filter

* added translate test

* added afrixnli_translate

* fix exact match valueError

* fix exact match valueError

* restructure mmlu folder

* spacing

* remove afrimmlu_translate folder

* add utility

* format task name, clean ups

* modefied mgsm

* update on afrimgsm

* update on afrimgsm

* removed utils

* other mgsm varieties

* other mgsm varieties

* adding trasnslate direct

* Update translate_direct_yaml

* add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model

* edit for open models

* Update translate_direct_yaml

* add verbalizer for xnli

* change xnli from multiple choice to generate

* add manual accuracy scores

* revert xnli to multiple choice

* change afrimgsm utils

* revert xnli to multiple_choice

* cleanups and readmes

* remove openai fixes and unused regex

* pr review changes

* revert metrics.py, task.py and extraction.py to main version

* add afrisenti

* utilities

* pulled from main

* add afrixnli

* add afrimmlu

* update afrixnli prompts

* mising senti language

* fix afrisenti prompt 2

* fix afrisenti prompts

* fix afrisenti prompts

* configure task grouping

* add multiple prompts to afrixnli for irokobench

* add multiple prompts to afrimmlu for irokobench

* Update afrixnli_yaml

* fixes and moves

* fixes and moves

* afrimmlu multiple prompts configs

* remove validation set from afrimmlu

* remove eng from afrimmlu translate test

* correct dataset path

* multiple prompts for mgsm

* file restructure

* afribench grouping

* repo restructuring

* repo restructuring

* update exact match to hugging face exact match and add new mgsm language

* remove decontamination

* update generation kwargs

* update generation kwargs for all mgsm prompts

* remove lang

* update generation kwargs for afrimgsm translatetest

* add afrimgsm cot for direct and translate

* remove eng from translate-cot

* add masakhaPOS tasks

* remove changes from task script

* add masakhanews tasks

* add uhura arc easy

* add afriqa and belebele files

* add tags for easier run. add naija rc

* add new metrics and transformation scripts

* fix afriqa swa fewshot split

* add naijarc

* add afrobench lite tasks

* update afrobench

* update afrobench

* remove unverified files to avoid bugs

* remove files not needed

* add afrobench tasks

* add afrobench tasks

* change to version 1

* change to version 1

* update afrobench

* update afrobench

* restore metric to original script

* update readme instructions

* add individual dataset readmes

* add link to collections

* correct run script

* align with main

* align with main

* align with main

* align with main

* align with main

* align with main

* align with main

* align with main

* failed run fixes

* failed run fixes

* add afrimgsm cot

* Apply precommit fixes

* update mafand dataset name

* pull request fixes

* remove afrihate due to availability

---------
Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de>
Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com>
Co-authored-by: David Adelani <davlanade@gmail.com>
Co-authored-by: theyorubayesian <akin.o.oladipo@gmail.com>

18297993

13 May, 2025 2 commits

mmlu pro generation_kwargs until Q: -> Question: (#2945) · cf51e699

Yoonsoo Kim authored May 13, 2025



* mmlu pro generation_kwargs until Q: -> Question:

* pacify pre-commit

* change stop token

---------
Co-authored-by: Baber <baber@hey.com>

cf51e699

Fix import error for deepcopy (#2969) · 24fc1a47
Kiersten Stokes authored May 13, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
24fc1a47

06 May, 2025 2 commits

Fix import error for eval_logger in score utils (#2940) · 817a2fe7

Anna Fontana authored May 06, 2025



* Fix import error for eval_logger in score utils

* pacify pre-commit

---------
Co-authored-by: Baber <baber@hey.com>

817a2fe7

Added NorEval, a novel Norwegian benchmark (#2919) · 71f2954b

Vladislav Mikhailov authored May 06, 2025

* added noreval

* added a checklist for noreval

* run pre-commit

* changed imports and added short noreval description

* fixed norsumm path

* refactored multi-folder tasks

* refactored multi-folder tasks

71f2954b

29 Apr, 2025 1 commit
- use np.NaN (#2937) · fc5019ea
  Baber Abbasi authored Apr 29, 2025
  
  fc5019ea
16 Apr, 2025 3 commits

Longbench bugfix (#2895) · 930d8378

Baber Abbasi authored Apr 17, 2025

* add warning in for default until

* fix stop tokens; add vcsum

* bugfix:fix doc_to_target to string

* fix lsht, trec

* add task to readme

* add debugging logs for multiple input/output

930d8378

mmlu - switch dataset to cais/mmlu; fix tests (#2918) · cb316a18
Baber Abbasi authored Apr 16, 2025
```
* switch MMLU to cais/mmlu

* switch back to tj-actions/changed-files

* cache HF folder
```
cb316a18
Fix a typo in README for tasks (#2910) · bb90a90c
Eldar Kurtic authored Apr 16, 2025

bb90a90c