Commits · qwen_math · gaoqiong / lm-evaluation-harness

20 Feb, 2025 17 commits
- add math_verify to minerva_math and leaderboard_math · adaa79ec
  Baber authored Feb 20, 2025
  
  adaa79ec
- nit · 47051bd8
  Baber authored Feb 20, 2025
  
  47051bd8
- nit · 55020c98
  Baber authored Feb 20, 2025
  
  55020c98
- nit · 6e7c7897
  Baber authored Feb 20, 2025
  
  6e7c7897
- add math_verify to pyproject · f8e9fa1d
  Baber authored Feb 20, 2025
  
  f8e9fa1d
- add math_verify · 9f3655ac
  Baber authored Feb 20, 2025
  
  9f3655ac
- add ruler · ccf4a58a
  Baber authored Feb 20, 2025
  
  ccf4a58a
- Merge branch 'main' into longcxt · 527a4352
  Baber authored Feb 20, 2025
```
# Conflicts:
#	lm_eval/tasks/README.md
```
  527a4352
- add ruler · 6042f622
  Baber authored Feb 20, 2025
  
  6042f622
- nit · 75a38212
  Baber authored Feb 20, 2025
  
  75a38212
- nit · 7707b29a
  Baber authored Feb 20, 2025
  
  7707b29a
- nit · 77356fb2
  Baber authored Feb 20, 2025
  
  77356fb2
- fix metadata · a2bc6240
  Baber authored Feb 20, 2025
  
  a2bc6240
- add readme · 09509b10
  Baber authored Feb 20, 2025
  
  09509b10
- add metadata to TaskManager · dba96127
  Baber authored Feb 20, 2025
  
  dba96127
- nit · fabd0d90
  Baber authored Feb 20, 2025
  
  fabd0d90
- get essays from HF · 40f9dc14
  Baber authored Feb 20, 2025
  
  40f9dc14
17 Feb, 2025 1 commit
- fix vllm (#2708) · 52df63b7
  Baber Abbasi authored Feb 17, 2025
```
* fix vllm

* fix data_parallel

* copy to multimodal
```
  52df63b7
14 Feb, 2025 4 commits
- `arithmetic`: set target delimiter to empty string (#2701) · 41b952f3
  Baber Abbasi authored Feb 14, 2025
```
* set target delimiter to empty string

* nit

* add warning
```
  41b952f3
- fix `construct_requests` kwargs (#2700) · 5a5acc08
  Baber Abbasi authored Feb 14, 2025
  
  5a5acc08
- Update README.md (#2694) · 157d8c3c
  Irina Proskurina authored Feb 14, 2025
  
  157d8c3c
- Update remaining references to assistant_prefill to gen_prefix (#2683) · ef6f5243
  Kiersten Stokes authored Feb 14, 2025
  
  ef6f5243
13 Feb, 2025 1 commit
- set aggregation and higher_is_better (instead of falling back on defaults) (#2692) · c3c05b06
  James A. Michaelov authored Feb 13, 2025
  
  c3c05b06
12 Feb, 2025 2 commits
- change ensure_ascii to False for JsonChatStr (#2691) · 96f5e58f
  achervyakov authored Feb 13, 2025
  
  96f5e58f
- Update unitxt task.py to bring in line with recent repo changes (#2684) · 8751fb35
  Kiersten Stokes authored Feb 12, 2025
  
  8751fb35
11 Feb, 2025 2 commits

Delete lm_eval/tasks/evalita_llm/single_prompt.zip (#2687) · 684fd2dd
Baber Abbasi authored Feb 11, 2025

684fd2dd

Adding the Evalita-LLM benchmark (#2681) · b7fccef5

Michele Resta authored Feb 11, 2025



* feat: initial commit with templates for evalita evaluation

* fix: change rule for generate_until

* feat: modified yaml to use reduced version of NER test datasets

* feat: added templates to use reduced dataset for summarization (fanpage and ilpost)

* Add Six Prompts for Each Multiple-Choice Task

* feat: modified fewshot split for textual entailment task

* fix: new doc_to_target function for NER tasks

* Update prompt

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Add partition for few-shot evaluatio

* Update prompt

* Add partition for few-shot evaluation

* Rename file

Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Enhance lexical substitution management

- Improve scorer calculation for better accuracy
- Update model output postprocessing for clearer results
- Add support for few-shot relation extraction task

* Add F1 macro measure for the document dating task

* Add F1-macro measure to evaluate document dating

* Use the whole dataset

* Small changes

* Add the two prompts for the task of lexical substitution

* Add few-shot split configuration

* Add few-shot split configuration

* Add function for handling few-shot learning setup

* Fix prompt

* Remove configuration file

* Update dataset from test_same to test_cross for evaluations

* Remove whitespace at end of prompt

* Fix configuration error: corrected parameter name for the dataset used in few-shot

* Fix: Check if results is not empty before processing in lexical substitution task

* added the prompts and functions for correct NER and RE execution

* Add accuracy measure

* Add tasks for the EVALITA-LLM benchmark evaluation

* Small changes

Add the alias of the task name that will be printed in the final table results.

* Updated the prompts to reflect changes made to the extended dataset for the Admission Test task

* chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks.

* fix: add information on Evalita-LLM for PR

* fix: rename folders and files

* fix: remove unused imports

* chore: run pre-commit

* chore: add task description

---------
Co-authored-by: rzanoli <zanoli@fbk.eu>
Co-authored-by: Marco Madeddu <marco.madeddu.bra@gmail.com>

b7fccef5

07 Feb, 2025 3 commits

remove cuda device assertion (#2680) · a40fe42a
Baber Abbasi authored Feb 07, 2025

a40fe42a
Fix typos (#2679) · 8fe3435a
omahs authored Feb 07, 2025
```
* fix typo

* fix typos

* fix typos
```
8fe3435a

Turkish mmlu Config Update (#2678) · 53504a96

Arda authored Feb 07, 2025



* Added TurkishMMLU to LM Evaluation Harness

* Fixed COT name

* Fixed COT name

* Updated Readme

* Fixed Test issues

* Completed  Scan for changed tasks

* Updated Readme

* Update README.md

* fixup task naming casing + ensure yaml template stubs aren't registered

* Fix Regex Pattern for CoT experiments

* Fixed multiple choice accuracy

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

53504a96

06 Feb, 2025 1 commit
- fix early return for multuple dict (#2673) · 144a1e58
  Baber Abbasi authored Feb 06, 2025
  
  144a1e58
31 Jan, 2025 1 commit

MMLU Pro Plus (#2366) · 0bb8406f

asgsaeid authored Jan 31, 2025



* mmlu-pro-plus is implemented

* README file is updated

* Update README.md with new task: MMLU Pro Plus

* Update README.md with new task: MMLU Pro Plus

* pre-commit

* nit

---------
Co-authored-by: asgsaeid <asgaris@Saeids-MacBook-Pro.local>
Co-authored-by: Baber <baber@hey.com>

0bb8406f

29 Jan, 2025 2 commits

Add Histoires Morales task (#2662) · 1208afd3

Irina Proskurina authored Jan 29, 2025

* Add Histoires Morales task

* Histoires Morales task: fix mixed line endings

* Histoires Morales task: fix mixed line endings

* Remove tag for a single task

* Add some MT for Histoires Morales

1208afd3

remove `group` from bigbench task configs (#2663) · fe9c5707
Baber Abbasi authored Jan 29, 2025
```
* remove group from task configs

* add tags

* update readme
```
fe9c5707

28 Jan, 2025 5 commits

update pre-commit (#2660) · 4b4b0363
Baber Abbasi authored Jan 28, 2025
```
* nit

* update pre-commit
```
4b4b0363
Add Aggregation for Kobest Benchmark (#2446) · 94344a61
Seungwoo Ryu authored Jan 29, 2025
```
Co-authored-by: Baber <baber@hey.com>
```
94344a61

fix multiple input chat tempalte (#2576) · 96e499ba

Baber Abbasi authored Jan 28, 2025

* feat: drop Python 3.8 support

* feat: drop Python 3.8 tests

* pre-commit

* handle chat_template for multiple iput

96e499ba

add TransformerLens example (#2651) · 42f79131

Nicky Pochinkov authored Jan 28, 2025

* add TransformerLens example

Many people use TransformerLens to do interpretability and interventions on models, and then need to test the model.

Here is a simple script that allows one to pass in the TransformerLens model and run evaluations on it.

* Ran pre-commit checks

42f79131

Add Moral Stories (#2653) · a0466f01

Irina Proskurina authored Jan 28, 2025

* Add moral stories task

* Add moral stories task

* Create README.md

* Update README.md

* Update line endings in moral_stories files

a0466f01

24 Jan, 2025 1 commit
- separate category for `global_mmlu` (#2652) · 5c006ed4
  Minho Ryu authored Jan 25, 2025
```
* separate category

* set version 0.0

* apply precommit
```
  5c006ed4