Commits · mmlu-pro · gaoqiong / lm-evaluation-harness

05 Aug, 2024 8 commits
- resolved left out merge · 84025a43
  lintangsutawika authored Aug 05, 2024
  
  84025a43
- merged · 17396935
  lintangsutawika authored Aug 05, 2024
  
  17396935
- format · 458342e2
  lintangsutawika authored Aug 05, 2024
  
  458342e2
- pre-commit · b8122d98
  lintangsutawika authored Aug 05, 2024
  
  b8122d98
- pre-commit · 01b129bb
  lintangsutawika authored Aug 05, 2024
  
  01b129bb
- add process for each subtask · 89de5103
  lintangsutawika authored Aug 05, 2024
  
  89de5103
- add process for each subtask · 9bee4b4f
  lintangsutawika authored Aug 05, 2024
  
  9bee4b4f
- add custom fewshot doc_to_text, target, and choice · 578f5d48
  lintangsutawika authored Aug 05, 2024
  
  578f5d48
26 Jul, 2024 3 commits
- mmlu-pro: fixed yaml formatting · cd8642e7
  Yu Shi Jie authored Jul 26, 2024
  
  cd8642e7
- mmlu-pro: fixed mislabelling of task (math->chemistry) · cd0983b8
  Yu Shi Jie authored Jul 26, 2024
  
  cd0983b8
- mmlu-pro: update yaml content in line with mmlu · cd441ab1
  Yu Shi Jie authored Jul 26, 2024
  
  cd441ab1
24 Jul, 2024 5 commits
- mmlu-pro/default: aligned with mmlu updates · 5bae76d6
  root authored Jul 24, 2024
  
  5bae76d6
- Merge branch 'upstream' into 'mmlu-pro' · c1e63555
  Yu Shi Jie authored Jul 24, 2024
```
add tokenizer logs info (#1731)

See merge request shijie.yu/lm-evaluation-harness!4
```
  c1e63555
- Merge branch 'tplx-staging' into 'mmlu-pro' · e361687c
  Yu Shi Jie authored Jul 24, 2024
```
# Conflicts:
#   README.md
```
  e361687c
- mmlu-pro: minor fixes · a33971ff
  root authored Jul 24, 2024
  
  a33971ff
- mmlu-pro: fixed doc_to_text/choice/target configs for all variants · 25bb0c3b
  root authored Jul 24, 2024
  
  25bb0c3b
22 Jul, 2024 1 commit

Refactor API models (#2008) · 42dc2448

Baber Abbasi authored Jul 23, 2024

* refactor pad_token handling to fn

* fix docs

* add pad_token_handling to vllm

* start on API superclass

* don't detokenize the returned logits

* streamline vllm tokenizer

* add type hint

* pre-commit

* seems to be in working order

* add model to init

* refactor api models

* nit

* cleanup

* add pbar

* fix type hints

* change optional dependencies

* json encode chat template

* add type hints

* deal with different prompt input requiremnts

* nits

* fix

* cache inside async

* fix

* fix

* nits

* nits

* nits

* nit

* fixup

* fixup

* nit

* add dummy retry

* add dummy retry

* handle imports; skip failing test

* add type hint

* add tests

* add dependency to tests

* add package names to exception

* nit

* docs; type hints

* handle api key

* nit

* tokenizer bug

* fix tokenizer

* nit

* nit

* add better error messages

* nit

* remove decorator

* CI...

42dc2448

21 Jul, 2024 1 commit
- fix caching module (hotfix for now) (#2124) · 4a62757d
  Hailey Schoelkopf authored Jul 21, 2024
  
  4a62757d
20 Jul, 2024 1 commit
- docs: update truthfulqa tasks (#2119) · feff1b55
  Jennifer Cwagenberg authored Jul 19, 2024
  
  feff1b55
18 Jul, 2024 2 commits
- fix: broken discord link in CONTRIBUTING.md (#2114) · 8f8e7f6e
  Nathan Weinberg authored Jul 18, 2024
```
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
```
  8f8e7f6e
- [python] fix haerae tasks (#2112) · 9d4a04a0
  Jungwhan Kim authored Jul 18, 2024
  
  9d4a04a0
17 Jul, 2024 1 commit
- Fixed colon in Belebele _default_template_yaml (#2111) · 69502c06
  jab13x authored Jul 17, 2024
  
  69502c06
16 Jul, 2024 2 commits
- ONLY FOR DEFAULT: fixed yaml file to use variable number of choices · 0c81cada
  root authored Jul 16, 2024
  
  0c81cada
- changed choices -> options in yaml config to fit dataset schema · 9e9b8a09
  Yu Shi Jie authored Jul 16, 2024
  
  9e9b8a09
15 Jul, 2024 3 commits
- docs: align local test command to match CI (#2100) · 1adab703
  Nathan Weinberg authored Jul 15, 2024
```
Also add 'test_logs/' to .gitignore
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
```
  1adab703
- formatting (#2104) · 56a4e794
  Lintang Sutawika authored Jul 15, 2024
  
  56a4e794
- make recurrent_gemma model types included in the force-BOS case (#2105) · 9884ad6e
  Hailey Schoelkopf authored Jul 15, 2024
  
  9884ad6e
14 Jul, 2024 2 commits

Added MedConceptsQA Benchmark (#2010) · 2b26690f

Ben Shoham Ofir authored Jul 14, 2024



* Added MedConceptsQA Benchmark

* pre-commit factor

* update group name

* update in naming

* changed name

* Changed mcqa to med_concepts_qa prefix

* Added med_concepts_qa to README.md

* Changed config files according the new format

* Updated README

---------
Co-authored-by: lintangsutawika <lintang@eleuther.ai>

2b26690f

mmlu_pro: · ad01e887

root authored Jul 14, 2024

-changed task 'other' to 'miscellaneous'
there is already a group named 'other'
task and group with the same alias (e.g. mmlu_pro_other_generative) throws an error

-fixed yaml backslash escape for fewshot cot

ad01e887

13 Jul, 2024 1 commit
- docs: remove trailing sentence from contribution doc (#2098) · a7a2923f
  Nathan Weinberg authored Jul 13, 2024
```
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
```
  a7a2923f
12 Jul, 2024 4 commits

Irokobench: Benchmark Dataset for African languages (#2042) · 383bbd54

Jess authored Jul 12, 2024



* add afrixnli to task

* add chat completion

* remove chat completion -untested

* afrimmlu added

* afrimmlu folder update

* afrimmlu folder update

* updated prompt

* remove print

* add afrimgsm -direct

* add squad metric

* fix bash script

* remove direct util, update common yaml

* remove print

* add few show. metric fixes

* fix direct path, add bash script for gpt models

* added transate test

* update afrixnli tasks

* update afrixnli tasks

* update metrics for afrixnli

* prompt translations fix

* prompt translations fix

* filter and metric fix -mgsm

* remove squad metric

* remove squad metric

* add f1 score to mgsm

* add f1 score to mgsm

* update native-direct with lin

* change f1 function

* add lin to utils

* add utils

* remove test limit

* remove test configs

* add swahili to mmlu

* change eng to ewe in ewe yaml mmlu

* add squad metric to mgsm, remove whitespace filter

* added translate test

* added afrixnli_translate

* fix exact match valueError

* fix exact match valueError

* restructure mmlu folder

* spacing

* remove afrimmlu_translate folder

* add utility

* format task name, clean ups

* modefied mgsm

* update on afrimgsm

* update on afrimgsm

* removed utils

* other mgsm varieties

* other mgsm varieties

* adding trasnslate direct

* Update translate_direct_yaml

* add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model

* edit for open models

* Update translate_direct_yaml

* add verbalizer for xnli

* change xnli from multiple choice to generate

* add manual accuracy scores

* revert xnli to multiple choice

* change afrimgsm utils

* revert xnli to multiple_choice

* cleanups and readmes

* remove openai fixes and unused regex

* pr review changes

* revert metrics.py, task.py and extraction.py to main version

---------
Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de>
Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com>

383bbd54

Add new dataset MMLU-SR tasks (#2032) · d5f39bf8

SuperCat authored Jul 12, 2024



* add mmlusr tasks

* renamed all tasks names in mmlusr

* edit format and readme

* added mmlu_sr

* mmlu_sr -> mmlusr

* update

---------
Co-authored-by: lintangsutawika <lintang@eleuther.ai>

d5f39bf8

Update default.yaml (#2092) · cdd954f9
Wonung Kim authored Jul 12, 2024

cdd954f9
make RougeScorer only initialized once (#2090) · eeec6dae
Hailey Schoelkopf authored Jul 12, 2024

eeec6dae

11 Jul, 2024 1 commit

Prettify lm_eval --tasks list (#1929) · a0243d54

anthony-dipofi authored Jul 11, 2024



* add  and ; move task list newline logic to new TaskManager.list_all_tasks() method

* format table list into markdown table; add config location column

* add Output Type column

* add logic for printing table of tags separately

* merge with main and fix conflicts ; update docstrings

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

a0243d54

10 Jul, 2024 2 commits

batch_size may be str if 'auto' is specified (#2084) · 30273b47
meg authored Jul 10, 2024

30273b47

Update utils.py (#2085) · 058cfd0e

Lintang Sutawika authored Jul 10, 2024

Group Configs with no aggregation will print a empty space as the score for result table.
Example
```
|    Tasks     |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------------|-------|------|-----:|--------|---|-----:|---|-----:|
|group         |    N/A|      |      |        |   |      |   |      |
| - task 0     |Yaml   |none  |     0|acc     |↑  |0.4000|±  |0.0910|
| - task 1     |Yaml   |none  |     0|acc     |↑  |0.3333|±  |0.0875|
| - task 2     |Yaml   |none  |     0|acc     |↑  |0.2667|±  |0.0821|
| - task 3     |Yaml   |none  |     0|acc     |↑  |0.3333|±  |0.0875|
```

So the `v` variable in the `make_table` needs to check if the value is a float or a string.

058cfd0e

09 Jul, 2024 1 commit
- fix: utf-8 encoding for logged sample files was missing (#2082) · c12e5bec
  Hailey Schoelkopf authored Jul 08, 2024
  
  c12e5bec
08 Jul, 2024 2 commits

Minor doc fix: leaderboard README.md missing mmlu-pro group and task (#2075) · be01651c
Pankaj Mathur authored Jul 08, 2024
```
leaderboard README.md missing mmlu-pro group and task
```
be01651c

Allow gating EvaluationTracker HF Hub results; customizability (#2051) · 563f7971

Nathan Habib authored Jul 08, 2024

* batch commit

* :Revert "batch commit"

This reverts commit d859d1ca.

* batch commit

* checkout from main

* checkout from main

* checkout from main

* checkout from main

* checkout from main

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup eval results

* cleanup

* add check for gated repo

* fix jsonline issue

* fix

* add try catch when gating the details repo

* add doc

* adds back hub_repo_name

* readds hub repo name

563f7971