- 22 May, 2024 1 commit
-
-
zhabuye authored
-
- 21 May, 2024 1 commit
-
-
Zafir Stojanovski authored
-
- 13 May, 2024 1 commit
-
-
Lucas Weber authored
* Add tinyBenchmarks * Add acknowledgements * Add ordering of outputs for data-parallel * Run pre-commit * Add few_shot specifications * Add tinyBenchmarks post-processing * add conditional import ; fix task names --------- Co-authored-by:haileyschoelkopf <hailey@eleuther.ai>
-
- 09 May, 2024 1 commit
-
-
Edd authored
* add copal * change name to copal id for clarity and the task name * remove `copal_id...` to yaml to make it work * checkmark on README * change group name to `copal_id`
-
- 08 May, 2024 1 commit
-
-
jonabur authored
* add mmlu arc style evaluation * rename arc_style to continuation --------- Co-authored-by:
Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi> Co-authored-by:
Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi>
-
- 07 May, 2024 3 commits
-
-
Yoav Katz authored
* Initial support for Unitxt datasets in LM Eval Harness See https://github.com/IBM/unitxt The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file. The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'. * Added dataset loading check to generate_yaml Improved error messages. * Speed up generate_yaml Added printouts and improved error message * Added output printout * Simplified integration of unitxt datasets Store all the common yaml configuration in a yaml include shared by all datasets of the same task. * Post code review comments - part 1 1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks 2. Added more datasets and tasks (NER, GEC) 3. Added README * Post code review comments - part 2 1. Added install unitxt install option in pyproject.toml: pip install 'lm_eval[unitxt]' 2. Added a check that unitxt is installed and print a clear error message if not * Commited missing pyproject change * Added documentation on adding datasets * More doc changes * add unitxt extra to readme * run precommit --------- Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
Hailey Schoelkopf authored
* add Hendrycks MATH (no sympy checking) variant * add readmes for MATH tasks
-
Hailey Schoelkopf authored
-
- 01 May, 2024 4 commits
-
-
Simran Arora authored
* upload new tasks * add readmes * run linters --------- Co-authored-by:haileyschoelkopf <hailey@eleuther.ai>
-
Zehan Li authored
* Update utils.py This is a 4-choice task, option_e is null for all but 3 samples * Fix options Adaptive choices * add option e * bump multilingual arc version --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Gabriel Mukobi authored
* Add Pile-10k readme * Add Pile-10k task configuration file
-
Chujie Zheng authored
-
- 26 Apr, 2024 1 commit
-
-
giorgossideris authored
* Support individual scrolls datasets * Add qmsum context * Fix formatting
-
- 25 Apr, 2024 2 commits
-
-
Lintang Sutawika authored
* Update task.py * Update __init__.py
-
Julen Etxaniz authored
* add xnli_eu tasks * update tasks readme * update readme
-
- 18 Apr, 2024 1 commit
-
-
sator-labs authored
-
- 05 Apr, 2024 1 commit
-
-
ZoneTwelve authored
* implementation of TMMLU+ * implemented: TMMLU+ ****TMMLU+ : large-scale Traditional chinese Massive Multitask language Understanding**** - 4 categories - STEM - Social Science - Humanities - Other The TMMLU+ dataset, encompassing over 67 subjects and 20160 tasks, is six times larger and more balanced than its predecessor, TMMLU, and includes benchmark results from both closed-source and 20 open-weight Chinese large language models with 1.8B to 72B parameters. However, Traditional Chinese variants continue to underperform compared to major Simplified Chinese models. ```markdown Total number of tasks in the 'test' sets: 20160 Total number of tasks in the 'validation' sets: 2247 Total number of tasks in the 'train' sets: 335 ``` * Remove print from __init__.py There was my mistake in forgetting to remove the debug print from the code. * update: move TMMLU+ config generation program into default * fix: we should use training set as few shots example * update: README for TMMLU+ * update: a small changes of TMMLU+ README file * pre-commit run thought * Add README for TMMLU+ dataset * run precommit * trigger precommit again * trigger precommit again * isort is fussy * isort is fussy * format, again * oops * oops --------- Co-authored-by:lintang <lintang@eleuther.ai> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
- 04 Apr, 2024 1 commit
-
-
Hailey Schoelkopf authored
-
- 01 Apr, 2024 1 commit
-
-
Julen Etxaniz authored
* add basqueglue * add eus_exams * add eus_proficiency * add eus_reading * add eus_trivia * run pre-commit
-
- 28 Mar, 2024 1 commit
-
-
Or Sharir authored
-
- 21 Mar, 2024 1 commit
-
-
Haonan Li authored
* Add task ACLUE * fix minor bug * fix code style * fix code style
-
- 18 Mar, 2024 2 commits
-
-
Nouf M. Alotaibi authored
* Fix eval_logger import for mmlu/_generate_configs.py * linter --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Hailey Schoelkopf authored
* Update interface.md * fix: make caching reqs always work with accelerate launch * remove stale task migration checklist * remove deprecation warnings * make informative TypeErrors for get_task_dict * bump version metadata * fix num_fewshot printing bug * add fewshot value to cache key
-
- 15 Mar, 2024 1 commit
-
-
Rylan Schaeffer authored
-
- 13 Mar, 2024 1 commit
-
-
achervyakov authored
* add manual tqdm disabling management * add typing to all new args * apply precommit changes --------- Co-authored-by:haileyschoelkopf <hailey@eleuther.ai>
-
- 11 Mar, 2024 4 commits
-
-
Hailey Schoelkopf authored
* add agieval * fix typo * add cloze / math exactmatch agieval tasks, rename * update exact-match agieval tasks, allow for multiple-correct answers * add more detail to readme * don't parse_math_answer twice --------- Co-authored-by:Alex Bäuerle <alex@a13x.io>
-
khalil authored
* add Arabic EXAMS benchmark * fixed the linter issue, and add more information on the readme * Update README.md --------- Co-authored-by:Lintang Sutawika <lintang@sutawika.com>
-
Hailey Schoelkopf authored
-
Hailey Schoelkopf authored
-
- 09 Mar, 2024 1 commit
-
-
Piyush Thakur authored
* update gen_kwargs in code2-text-go.yaml * update gen_kwargs in rest code2-text
-
- 06 Mar, 2024 5 commits
-
-
LSinev authored
* Remove unused `decontamination_ngrams_path` and all mentions (still no alternative path provided) * Fix improper import of LM and usage of evaluator in one of scripts * update type hints in instance and task api * raising errors in task.py instead of asserts * Fix warnings from ruff * raising errors in __main__.py instead of asserts * raising errors in tasks/__init__.py instead of asserts * raising errors in evaluator.py instead of asserts * evaluator: update type hints and remove unused variables in code * Update lm_eval/__main__.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/__main__.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/api/task.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/api/task.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/api/task.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/evaluator.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * pre-commit induced fixes --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Hailey Schoelkopf authored
update printed num-fewshot ; prevent fewshots from erroneously being used by cot which hardcodes fewshot prompt (#1502)
-
sean0042 authored
-
Long Phan authored
* init wmdp yaml file * Add WMDP Multiple-choice * fix linter issues * Delete lm_eval/tasks/wmdp/_wmdp.yaml --------- Co-authored-by:Lintang Sutawika <lintang@sutawika.com>
-
Peter Bevan authored
* Start adding eq-bench * Start adding to yaml and utils * Get metric working * Add README * Handle cases where answer is not parseable * Deal with unparseable answers and add percent_parseable metric * Update README
-
- 05 Mar, 2024 2 commits
-
-
Uanu authored
* Add new tasks of GPQA * Add README * Remove unused functions * Remove unused functions * Linters * Add flexible match * update * Remove deplicate function * Linter * update * Update lm_eval/filters/extraction.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * register multi_choice_regex * Update * run precommit --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
Baber Abbasi authored
-
- 04 Mar, 2024 1 commit
-
-
Manuel Faysse authored
* add french-bench * rename arc easy * linting * update datasets for no remote code exec * fix string delimiter * add info to readmr * trim trailing whitespace * add detailed groups * add info to readme * remove orangesum title from fbench main * Force PPL tasks to be 0-shot --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 03 Mar, 2024 1 commit
-
-
Vicki Boykis authored
* setting trust_remote_code * dataset list no notebooks * respect trust remote code * Address changes, move cli options and change datasets * fix task for tests * headqa * remove kobest * pin datasets and address comments * clean up space
-
- 01 Mar, 2024 1 commit
-
-
Zehan Li authored
-