Commits · gen · gaoqiong / lm-evaluation-harness

11 Feb, 2025 4 commits
- nit · 59ee4652
  Baber authored Feb 11, 2025
  
  59ee4652
- nit · b72515a1
  Baber authored Feb 11, 2025
  
  b72515a1
- nit · a4ca4d2c
  Baber authored Feb 11, 2025
  
  a4ca4d2c
- refactor multichoiceregex · 81f84653
  Baber authored Feb 11, 2025
  
  81f84653
10 Feb, 2025 5 commits
- add fallback regex to regex filter · f8920c74
  Baber authored Feb 10, 2025
  
  f8920c74
- add arc_easy and arc_challenge generation variants · 5707ae55
  Baber authored Feb 10, 2025
  
  5707ae55
- add piqa generation · bb70131c
  Baber authored Feb 10, 2025
  
  bb70131c
- add winogrande_gen · 0dbf0255
  Baber authored Feb 10, 2025
  
  0dbf0255
- add openbookqa generative · 799359ad
  Baber authored Feb 10, 2025
  
  799359ad
07 Feb, 2025 3 commits

remove cuda device assertion (#2680) · a40fe42a
Baber Abbasi authored Feb 07, 2025

a40fe42a
Fix typos (#2679) · 8fe3435a
omahs authored Feb 07, 2025
```
* fix typo

* fix typos

* fix typos
```
8fe3435a

Turkish mmlu Config Update (#2678) · 53504a96

Arda authored Feb 07, 2025



* Added TurkishMMLU to LM Evaluation Harness

* Fixed COT name

* Fixed COT name

* Updated Readme

* Fixed Test issues

* Completed  Scan for changed tasks

* Updated Readme

* Update README.md

* fixup task naming casing + ensure yaml template stubs aren't registered

* Fix Regex Pattern for CoT experiments

* Fixed multiple choice accuracy

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

53504a96

06 Feb, 2025 1 commit
- fix early return for multuple dict (#2673) · 144a1e58
  Baber Abbasi authored Feb 06, 2025
  
  144a1e58
31 Jan, 2025 1 commit

MMLU Pro Plus (#2366) · 0bb8406f

asgsaeid authored Jan 31, 2025



* mmlu-pro-plus is implemented

* README file is updated

* Update README.md with new task: MMLU Pro Plus

* Update README.md with new task: MMLU Pro Plus

* pre-commit

* nit

---------
Co-authored-by: asgsaeid <asgaris@Saeids-MacBook-Pro.local>
Co-authored-by: Baber <baber@hey.com>

0bb8406f

29 Jan, 2025 2 commits

Add Histoires Morales task (#2662) · 1208afd3

Irina Proskurina authored Jan 29, 2025

* Add Histoires Morales task

* Histoires Morales task: fix mixed line endings

* Histoires Morales task: fix mixed line endings

* Remove tag for a single task

* Add some MT for Histoires Morales

1208afd3

remove `group` from bigbench task configs (#2663) · fe9c5707
Baber Abbasi authored Jan 29, 2025
```
* remove group from task configs

* add tags

* update readme
```
fe9c5707

28 Jan, 2025 5 commits

update pre-commit (#2660) · 4b4b0363
Baber Abbasi authored Jan 28, 2025
```
* nit

* update pre-commit
```
4b4b0363
Add Aggregation for Kobest Benchmark (#2446) · 94344a61
Seungwoo Ryu authored Jan 29, 2025
```
Co-authored-by: Baber <baber@hey.com>
```
94344a61

fix multiple input chat tempalte (#2576) · 96e499ba

Baber Abbasi authored Jan 28, 2025

* feat: drop Python 3.8 support

* feat: drop Python 3.8 tests

* pre-commit

* handle chat_template for multiple iput

96e499ba

add TransformerLens example (#2651) · 42f79131

Nicky Pochinkov authored Jan 28, 2025

* add TransformerLens example

Many people use TransformerLens to do interpretability and interventions on models, and then need to test the model.

Here is a simple script that allows one to pass in the TransformerLens model and run evaluations on it.

* Ran pre-commit checks

42f79131

Add Moral Stories (#2653) · a0466f01

Irina Proskurina authored Jan 28, 2025

* Add moral stories task

* Add moral stories task

* Create README.md

* Update README.md

* Update line endings in moral_stories files

a0466f01

24 Jan, 2025 1 commit
- separate category for `global_mmlu` (#2652) · 5c006ed4
  Minho Ryu authored Jan 25, 2025
```
* separate category

* set version 0.0

* apply precommit
```
  5c006ed4
21 Jan, 2025 3 commits
- Fix max_tokens handling in vllm_vlms.py (#2637) · 370e2f9e
  Jan Kaniecki authored Jan 21, 2025
```
* Update vllm_vlms.py

* pre-commit

---------
Co-authored-by: Baber <baber@hey.com>
```
  370e2f9e
- aggregate by group (total and categories) (#2643) · b2c090cc
  Minho Ryu authored Jan 22, 2025
  
  b2c090cc
- revise mbpp prompt (#2645) · ed9c6fc8
  Minho Ryu authored Jan 22, 2025
  
  ed9c6fc8
20 Jan, 2025 6 commits

fixed mmlu generative response extraction (#2503) · 12b6eeb5

Ramiro R. C. authored Jan 20, 2025



* fixed mmlu generative response extraction

* updated file version | added args to exact_match

* fix

* fix

* pre-commit

* fix groups

---------
Co-authored-by: Baber <baber@hey.com>

12b6eeb5

fix tmlu tmlu_taiwan_specific_tasks tag (#2420) · 88144079
nike00811 authored Jan 21, 2025

88144079

Update KorMedMCQA: ver 2.0 (#2540) · ff2c49ff

Gyouk Chu authored Jan 21, 2025

* Update KorMedMCQA: ver 2.0

* Fix pre-commit formatting issues

* Update KorMedMCQA v2.0

* pre-commit

ff2c49ff

apply precommit (#2636) · 3a4e4674
Minho Ryu authored Jan 21, 2025

3a4e4674

New arabicmmlu (#2541) · 6dac8c69

Boda Sadallah authored Jan 21, 2025

* point to the original ArabicMMLU dataset

* create the new subtasks files

* fix bug when the context filed is empty

6dac8c69

add hrm8k benchmark for both Korean and English (#2627) · a5c344cf

Minho Ryu authored Jan 21, 2025



* add hrm8k benchmark for both Korean and English

* apply precommit

* revise tasks to make models not to directly answer; use zeroshot_cot if possible

* add README

* Add hrm8k on the task-list

---------
Co-authored-by: Baber <baber@hey.com>

a5c344cf

19 Jan, 2025 1 commit
- update pre-commit (#2632) · f724be69
  Baber Abbasi authored Jan 19, 2025
```
* update pre-commit
```
  f724be69
17 Jan, 2025 1 commit
- fix gen_prefix (#2630) · 9dda03d6
  Baber Abbasi authored Jan 17, 2025
```
* switch arg
```
  9dda03d6
15 Jan, 2025 4 commits

assistant prefill (#2615) · 703fbffd

Baber Abbasi authored Jan 15, 2025

* add assistant prefix

* add arc_challenge from llama

* nit

* nit

* nit

* add assistant prefix

* add mmlu_llama

* nit

* nit

* Revert "nit"

This reverts commit 6a97f8356237305e375212b966b30e8de59dd4bc.

* fix regex bug

* add assistant_prefix to vllm

* add `Question:`

* add mmlu_pro

* add fewshot assistant_prefix

* use `assistant_prefill`

* typehints

* nits

* nits

* add to docs

* add readme

703fbffd

Add MLQA (#2622) · e86cece6

Shivansh Pachnanda authored Jan 16, 2025

* Add MLQA
* add mlqa_common_yaml

* add 49 tests of mlqa family

* update tasks/README.md

---------

* fix: mlqa ast error

* nit: removed .yaml ext from template_yaml

* nit changes: minor modifications generate_tasks.py

* deleted    lm_eval/tasks/mlqa/mlqa_common_yaml.yaml

* tests updated

* nit

e86cece6

Add MBPP (#2247) · 5db23e2c

Hojin Lee authored Jan 16, 2025



* add mbpp

* fix some bugs

* add README for mbpp

* update README

* nits

---------
Co-authored-by: Hojin Lee <19949034+hjlee1371@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

5db23e2c

Add HumanEval (#1992) · 4c11206b

Hojin Lee authored Jan 16, 2025



* add custom filter

* fix type casting of references

* add humaneval

* fix a bug in humaneval

* add greedy version of humaneval

* update tasks README

* test humaneval

* return multiple metrics

* nit

* add confirmation to run code tasks

* nit

* nit

---------
Co-authored-by: Hojin Lee <19949034+hjlee1371@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

4c11206b

07 Jan, 2025 3 commits

Fix the format of mgsm zh and ja. (#2587) · bb098f13

Wenyang LUO authored Jan 07, 2025

* Fix the format of mgsm zh and ja.

* Add change log to mgsm.

* Add newline after changelog.

bb098f13

Fix Zeno visualizer on tasks like GSM8k (#2599) · 6d62a69c

Petr Baudis authored Jan 07, 2025



* fix(zeno): Generate unique ids in case of multiple filters

* fix(zeno): Report even non-aggregable metrics, just not as metrics

* pre-commit

---------
Co-authored-by: Baber <baber@hey.com>

6d62a69c

Fix gguf loading via Transformers (#2596) · 16cfe464

CL-ModelCloud authored Jan 07, 2025



* hf support load gguf file

* code review

* code review

* code clean up

* note about use_fast compat with gguf

---------
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>

16cfe464