Commits · 5454e95d63d259ece1fc7ccfc7d33f9ec98f9efe · gaoqiong / lm-evaluation-harness

11 Jul, 2025 2 commits
- refactor: use Path · 5454e95d
  Baber authored Jul 11, 2025
  
  5454e95d
- refactor: enhance task loading by including yaml_path parameter · 0a184f46
  Baber authored Jul 11, 2025
  
  0a184f46
10 Jul, 2025 3 commits
- refactor: preserve original task name during config inclusion · 15c01f4d
  Baber authored Jul 11, 2025
  
  15c01f4d
- refactor: simplify task and config validation methods · acc634fa
  Baber authored Jul 11, 2025
  
  acc634fa
- warning for "chat" pretrained; disable buggy evalita configs (#3127) · f3a0b554
  Baber Abbasi authored Jul 10, 2025
```
* check for chat for warning

* add test

* remove yaml extension from some evalita configs

* move unitxt to own test script

* fix CI test
```
  f3a0b554
03 Jul, 2025 2 commits

Humaneval - fix regression (#3102) · 8c1016cb
Baber Abbasi authored Jul 03, 2025
```
* use double quotes
```
8c1016cb

Truthfulqa multi harness (#3062) · e0dc33ae

Blanca Calvo authored Jul 03, 2025



* truthfulqa-multi task

* truthfulqa-multi with chat few-shot

* few shot chat implementation

* changed until so it outputs lists

* changed dataset location

* added MT task

* Create README.md

* do not include MT

* changes for PR

* tag change

* removed yaml extension

* adding task to the table

* fix task configs

* add import exception

---------
Co-authored-by: Baber <baber@hey.com>

e0dc33ae

30 Jun, 2025 1 commit

FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct (#3092) · a7ca0435

jinze authored Jul 01, 2025

* Fix: Align the Humaneval dataset with official results

Details:(1) modified the "doc_to_text" and "gen_prefix" in the "humaneval_instruct.yaml" file to make them the same as the Prompt in "meta-llama/Llama-3.1-70B-Instruct-evals".

(2) Change r.rfind("```") to r.find("```"), so it can locate the first "```", not the last one.

Results: Partially reproduced the official results: The result of LLaMA3.1-8B-Instruct is 66.5 (the official result is 72.6), and the result of LLaMA3.1-70B-Instruct is 80.5 (the official result is 80.5).

Ref: PR#2650

* add changelog and version

* add changelog

a7ca0435

25 Jun, 2025 1 commit
- Ensure backwards compatibility in fewshot_context by using kwargs (#3079) · 532909c0
  Kiersten Stokes authored Jun 25, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  532909c0
20 Jun, 2025 1 commit

llama3 task: update README.md (#3074) · 68c3a811

Anna Fontana authored Jun 20, 2025

"arc_chalenge_chat" doesn't exist: I think it should be "arc_challenge_chat", but this task is not implemented here (see arc task folder).

68c3a811

19 Jun, 2025 2 commits
- Update instructions.py (#3060) · 37357004
  Maxim Evtush authored Jun 19, 2025
  
  37357004
- Update README.md (#3070) · 5a15058e
  Anna Fontana authored Jun 19, 2025
```
Wrong task name: mmlu_generation doesn't non exist -> mmlu_generative is the correct one
```
  5a15058e
16 Jun, 2025 2 commits
- fix longbech citation (#3061) · 9fbe48c2
  Baber Abbasi authored Jun 16, 2025
```
* fix longbech citation
```
  9fbe48c2
- Fix Typo in README and Comment in utils_mcq.py (#3057) · e20ef72e
  fuder.eth authored Jun 16, 2025
```
* Update README.md

* Update utils_mcq.py
```
  e20ef72e
12 Jun, 2025 1 commit
- Fallback to super impl in fewshot_context for Unitxt tasks (#3023) · d09e03dd
  Kiersten Stokes authored Jun 12, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  d09e03dd
08 Jun, 2025 1 commit

[longbench] fix metric calculation (#2983) · 147e9d61

Baber Abbasi authored Jun 08, 2025

* use all answers

* use middle truncation

* maybe fix classification score

* strip classification preds

* [vllm] remove stop tokens post-hoc

* strip all preds

* pacify pre-commit

* start on truncation utility

* add to readme

* add a footgun doc

* fix newline in yaml templates

* do not strip code_sim preds!

* fix pre-commit config

* fix instruction warning

* add not to longbench readme

147e9d61

03 Jun, 2025 2 commits

remove prints (#3041) · 9f152e0b
Baber Abbasi authored Jun 03, 2025

9f152e0b

add Mbpp instruct (#2995) · 60e85da5

Baber Abbasi authored Jun 03, 2025

* feat: add mbpp_instruct

* fix: update generation_kwargs to use an empty until list

* fix: correct predictions formatting in pass_at_1 function

* fix: improve code block extraction by checking first without opening backticks

* fix mbpp `pass_at_1`

60e85da5

26 May, 2025 1 commit
- add arab_culture task (#3006) · 8bc4afff
  Boda Sadallah authored May 26, 2025
```
* add arab_culture tasks

* add target_delimeter and remove debugging code
```
  8bc4afff
21 May, 2025 1 commit
- add kbl 2025 (#3000) · 8be417a8
  Hongseok Oh authored May 21, 2025
  
  8be417a8
19 May, 2025 2 commits

[SGLANG] Add the SGLANG generate API (#2997) · 53c65300
Baber Abbasi authored May 19, 2025
```
* add `sglang-generate`

* nit

* nit

* nit

* pacify pre-commit
```
53c65300

Adding ACPBench Hard tasks (#2980) · 0daf28fd

Harsha authored May 19, 2025

* adding ACPBench_hard

* adding Clingo

* changing tarski to tarski[clingo]

* denoting the main variants in each paper

0daf28fd

15 May, 2025 4 commits

fix formatting (#2759) · 0126f6d1
Baber Abbasi authored May 15, 2025

0126f6d1
Update utils.py (#2870) · 2bde99e4
tawsif authored May 15, 2025

2bde99e4

Added C4 Support (#2889) · 86a3b270

Yufeng Xu authored May 15, 2025

* added c4 dataset (working)

* fixed bugs in c4

* fixed loading bugs in c4 dataset; using partial loading

* cleaned the code

* added version number for c4

* removed irrelevant files

86a3b270

AfroBench: How Good are Large Language Models on African Languages? (#2825) · 18297993

Jess authored May 15, 2025



* add afrixnli to task

* add chat completion

* remove chat completion -untested

* afrimmlu added

* afrimmlu folder update

* afrimmlu folder update

* updated prompt

* remove print

* add afrimgsm -direct

* add squad metric

* fix bash script

* remove direct util, update common yaml

* remove print

* add few show. metric fixes

* fix direct path, add bash script for gpt models

* added transate test

* update afrixnli tasks

* update afrixnli tasks

* update metrics for afrixnli

* prompt translations fix

* prompt translations fix

* filter and metric fix -mgsm

* remove squad metric

* remove squad metric

* add f1 score to mgsm

* add f1 score to mgsm

* update native-direct with lin

* change f1 function

* add lin to utils

* add utils

* remove test limit

* remove test configs

* add swahili to mmlu

* change eng to ewe in ewe yaml mmlu

* add squad metric to mgsm, remove whitespace filter

* added translate test

* added afrixnli_translate

* fix exact match valueError

* fix exact match valueError

* restructure mmlu folder

* spacing

* remove afrimmlu_translate folder

* add utility

* format task name, clean ups

* modefied mgsm

* update on afrimgsm

* update on afrimgsm

* removed utils

* other mgsm varieties

* other mgsm varieties

* adding trasnslate direct

* Update translate_direct_yaml

* add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model

* edit for open models

* Update translate_direct_yaml

* add verbalizer for xnli

* change xnli from multiple choice to generate

* add manual accuracy scores

* revert xnli to multiple choice

* change afrimgsm utils

* revert xnli to multiple_choice

* cleanups and readmes

* remove openai fixes and unused regex

* pr review changes

* revert metrics.py, task.py and extraction.py to main version

* add afrisenti

* utilities

* pulled from main

* add afrixnli

* add afrimmlu

* update afrixnli prompts

* mising senti language

* fix afrisenti prompt 2

* fix afrisenti prompts

* fix afrisenti prompts

* configure task grouping

* add multiple prompts to afrixnli for irokobench

* add multiple prompts to afrimmlu for irokobench

* Update afrixnli_yaml

* fixes and moves

* fixes and moves

* afrimmlu multiple prompts configs

* remove validation set from afrimmlu

* remove eng from afrimmlu translate test

* correct dataset path

* multiple prompts for mgsm

* file restructure

* afribench grouping

* repo restructuring

* repo restructuring

* update exact match to hugging face exact match and add new mgsm language

* remove decontamination

* update generation kwargs

* update generation kwargs for all mgsm prompts

* remove lang

* update generation kwargs for afrimgsm translatetest

* add afrimgsm cot for direct and translate

* remove eng from translate-cot

* add masakhaPOS tasks

* remove changes from task script

* add masakhanews tasks

* add uhura arc easy

* add afriqa and belebele files

* add tags for easier run. add naija rc

* add new metrics and transformation scripts

* fix afriqa swa fewshot split

* add naijarc

* add afrobench lite tasks

* update afrobench

* update afrobench

* remove unverified files to avoid bugs

* remove files not needed

* add afrobench tasks

* add afrobench tasks

* change to version 1

* change to version 1

* update afrobench

* update afrobench

* restore metric to original script

* update readme instructions

* add individual dataset readmes

* add link to collections

* correct run script

* align with main

* align with main

* align with main

* align with main

* align with main

* align with main

* align with main

* align with main

* failed run fixes

* failed run fixes

* add afrimgsm cot

* Apply precommit fixes

* update mafand dataset name

* pull request fixes

* remove afrihate due to availability

---------
Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de>
Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com>
Co-authored-by: David Adelani <davlanade@gmail.com>
Co-authored-by: theyorubayesian <akin.o.oladipo@gmail.com>

18297993

13 May, 2025 2 commits

mmlu pro generation_kwargs until Q: -> Question: (#2945) · cf51e699

Yoonsoo Kim authored May 13, 2025



* mmlu pro generation_kwargs until Q: -> Question:

* pacify pre-commit

* change stop token

---------
Co-authored-by: Baber <baber@hey.com>

cf51e699

Fix import error for deepcopy (#2969) · 24fc1a47
Kiersten Stokes authored May 13, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
24fc1a47

06 May, 2025 2 commits

Fix import error for eval_logger in score utils (#2940) · 817a2fe7

Anna Fontana authored May 06, 2025



* Fix import error for eval_logger in score utils

* pacify pre-commit

---------
Co-authored-by: Baber <baber@hey.com>

817a2fe7

Added NorEval, a novel Norwegian benchmark (#2919) · 71f2954b

Vladislav Mikhailov authored May 06, 2025

* added noreval

* added a checklist for noreval

* run pre-commit

* changed imports and added short noreval description

* fixed norsumm path

* refactored multi-folder tasks

* refactored multi-folder tasks

71f2954b

29 Apr, 2025 1 commit
- use np.NaN (#2937) · fc5019ea
  Baber Abbasi authored Apr 29, 2025
  
  fc5019ea
16 Apr, 2025 3 commits

Longbench bugfix (#2895) · 930d8378

Baber Abbasi authored Apr 17, 2025

* add warning in for default until

* fix stop tokens; add vcsum

* bugfix:fix doc_to_target to string

* fix lsht, trec

* add task to readme

* add debugging logs for multiple input/output

930d8378

mmlu - switch dataset to cais/mmlu; fix tests (#2918) · cb316a18
Baber Abbasi authored Apr 16, 2025
```
* switch MMLU to cais/mmlu

* switch back to tj-actions/changed-files

* cache HF folder
```
cb316a18
Fix a typo in README for tasks (#2910) · bb90a90c
Eldar Kurtic authored Apr 16, 2025

bb90a90c

14 Apr, 2025 1 commit
- tasks README: fix dead link (#2899) · a9582804
  Daniele authored Apr 14, 2025
  
  a9582804
04 Apr, 2025 2 commits

Add GSM8K Platinum (#2771) · 11ac352d
Qubitium-ModelCloud authored Apr 04, 2025
```
* add gsm8k platinum

* only test splits

* wrong dataset

* link to blog

* format
```
11ac352d

Optimization for evalita-llm rouge computation (#2878) · 22bd2bcb

Michele Resta authored Apr 04, 2025



* feat: initial commit with templates for evalita evaluation

* fix: change rule for generate_until

* feat: modified yaml to use reduced version of NER test datasets

* feat: added templates to use reduced dataset for summarization (fanpage and ilpost)

* Add Six Prompts for Each Multiple-Choice Task

* fix: fastest eval for summarization

* chore: linted with ruff

* chore: linted with ruff

---------
Co-authored-by: rzanoli <zanoli@fbk.eu>

22bd2bcb

02 Apr, 2025 2 commits

leaderboard - add subtask scores (#2867) · ac0bc1df
Baber Abbasi authored Apr 02, 2025
```
* add subtask scores

* pacify pre-commit
```
ac0bc1df

Add JSONSchemaBench: A Benchmark for Evaluating Structured Output from LLMs (#2865) · 6cc41d34

Saibo-creator authored Apr 02, 2025



* Add JSON schema benchmark

* Update lm_eval/tasks/jsonschema_bench/metrics.py

Thanks for catching this
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

* run pre-commit

* add description to task catalogue readme

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

6cc41d34

01 Apr, 2025 1 commit

[leaderboard] math - sync with repo (#2817) · 61ee8678

Baber Abbasi authored Apr 01, 2025

* sync with leaderboard

* also output old metric

* wrap old extraction in try except

* better log

61ee8678