Commits · 543617fef9ba885e87f8db8930fbbff1d4e2ca49 · gaoqiong / lm-evaluation-harness

05 Sep, 2024 1 commit
- Bump version to v0.4.4 ; Fixes to TMMLUplus (#2280) · 543617fe
  Hailey Schoelkopf authored Sep 05, 2024
  
  543617fe
28 Aug, 2024 1 commit
- update nltk version to require 3.9.1 (#2259) · 2de3688f
  Hailey Schoelkopf authored Aug 28, 2024
  
  2de3688f
01 Aug, 2024 1 commit

refactor: limit usage of `scipy` and `skilearn` dependencies (#2097) · 7f15cce4

Nathan Weinberg authored Aug 01, 2024



* refactor: move scipy and sklearn module imports to func imports
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

* refactor: consolidate weighted_f1_score func into lm_eval utils
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

* lint: allow for utils file to have unused imports

this allows for shared functions to be defined only
once while allowing for the YAML function importing
to continue working
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

---------
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

7f15cce4

22 Jul, 2024 1 commit

Refactor API models (#2008) · 42dc2448

Baber Abbasi authored Jul 23, 2024



* refactor pad_token handling to fn

* fix docs

* add pad_token_handling to vllm

* start on API superclass

* don't detokenize the returned logits

* streamline vllm tokenizer

* add type hint

* pre-commit

* seems to be in working order

* add model to init

* refactor api models

* nit

* cleanup

* add pbar

* fix type hints

* change optional dependencies

* json encode chat template

* add type hints

* deal with different prompt input requiremnts

* nits

* fix

* cache inside async

* fix

* fix

* nits

* nits

* nits

* nit

* fixup

* fixup

* nit

* add dummy retry

* add dummy retry

* handle imports; skip failing test

* add type hint

* add tests

* add dependency to tests

* add package names to exception

* nit

* docs; type hints

* handle api key

* nit

* tokenizer bug

* fix tokenizer

* nit

* nit

* add better error messages

* nit

* remove decorator

* CI: install api dep

* revert evaluator.py

* consolidate

* consolidate

* nits

* nit

* fix typealias

* nit

* nit

* nit

* Update lm_eval/models/api_models.py

typo
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/openai_completions.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/anthropic_llms.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/api_models.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix typo

* add news section

* add info for API

* pre-commit

* typo

* fix bug: unpack logliklehood requests

* fix bug: shared gen_kwargs mutated

* nit: handle copy properly

* Update README.md

* Update README.md

* Update README.md

* Update api_models.py

* Update README.md

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

42dc2448

08 Jul, 2024 1 commit

Easier unitxt tasks loading and removal of unitxt library dependancy (#1933) · ad80f555

Elron Bandel authored Jul 08, 2024



* Updated unitxt loading
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* Revert change to general Readme
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* Adjust fda,squadv2,squad_completion and swde to work accept config in the constructor
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* Fix scrolls
Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Update documentation
Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Enforce backward compatability
Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Format unitxt class
Signed-off-by: elronbandel <elron.bandel@ibm.com>

---------
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Signed-off-by: elronbandel <elron.bandel@ibm.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

ad80f555

01 Jul, 2024 1 commit
- update to v0.4.3 (#2046) · 3fa4fd72
  Hailey Schoelkopf authored Jul 01, 2024
  
  3fa4fd72
30 May, 2024 1 commit

[HFLM]Add support for Ascend NPU (#1886) · 8f716817

Huazhong Ji authored May 31, 2024



* [HFLM]Add support for Ascend NPU
Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com>
Co-authored-by: zhabuye <2947436155@qq.com>

* bump accelerate dependency version to 0.26.0 for NPU compat.

---------
Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com>
Co-authored-by: zhabuye <2947436155@qq.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

8f716817

23 May, 2024 1 commit
- Unpin vllm in dependencies (#1874) · 5711ab87
  Edward Gan authored May 23, 2024
  
  5711ab87
07 May, 2024 1 commit

Initial integration of the Unitxt to LM eval harness (#1615) · 885f48d6

Yoav Katz authored May 08, 2024

* Initial support for Unitxt datasets in LM Eval Harness

See  https://github.com/IBM/unitxt



The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file.

The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'.

* Added dataset loading check to generate_yaml

Improved error messages.

* Speed up generate_yaml

Added printouts and improved error message

* Added output printout

* Simplified integration of unitxt datasets

Store all the common yaml configuration in a yaml include shared by all datasets of the same task.

* Post code review comments - part 1

1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks
2. Added more datasets and tasks (NER, GEC)
3. Added README

* Post code review comments - part 2

1. Added install unitxt install option in pyproject.toml:
pip install 'lm_eval[unitxt]'
2. Added a check that unitxt is installed and print a clear error message if not

* Commited missing pyproject change

* Added documentation on adding datasets

* More doc changes

* add unitxt extra to readme

* run precommit

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

885f48d6

16 Apr, 2024 1 commit

Add `neuralmagic` models for `sparseml` and `deepsparse` (#1674) · 8b326be7

Michael Goin authored Apr 16, 2024



* Add neuralmagic models for SparseML and DeepSparse

* Update to latest and add test

* Format

* Fix list to List

* Format

* Add deepsparse/sparseml to automated testing

* Update pyproject.toml

* Update pyproject.toml

* Update README

* Fixes for dtype and device

* Format

* Fix test

* Apply suggestions from code review
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Address review comments!

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

8b326be7

18 Mar, 2024 1 commit

Cleanup for v0.4.2 release (#1573) · 5627e819

Hailey Schoelkopf authored Mar 18, 2024

* Update interface.md

* fix: make caching reqs always work with accelerate launch

* remove stale task migration checklist

* remove deprecation warnings

* make informative TypeErrors for get_task_dict

* bump version metadata

* fix num_fewshot printing bug

* add fewshot value to cache key

5627e819

06 Mar, 2024 1 commit

Cleanup and fixes (Task, Instance, and a little bit of *evaluate) (#1533) · 4ee1b386

LSinev authored Mar 06, 2024



* Remove unused `decontamination_ngrams_path` and all mentions (still no alternative path provided)

* Fix improper import of LM and usage of evaluator in one of scripts

* update type hints in instance and task api

* raising errors in task.py instead of asserts

* Fix warnings from ruff

* raising errors in __main__.py instead of asserts

* raising errors in tasks/__init__.py instead of asserts

* raising errors in evaluator.py instead of asserts

* evaluator: update type hints and remove unused variables in code

* Update lm_eval/__main__.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/__main__.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/evaluator.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* pre-commit induced fixes

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

4ee1b386

03 Mar, 2024 2 commits

Setting trust_remote_code to True for HuggingFace datasets compatibility (#1487) · 95167926

Vicki Boykis authored Mar 03, 2024

* setting trust_remote_code

* dataset list no notebooks

* respect trust remote code

* Address changes, move cli options and change datasets

* fix task for tests

* headqa

* remove kobest

* pin datasets and address comments

* clean up space

95167926

Vllm update DP+TP (#1508) · e5e35fca

Baber Abbasi authored Mar 03, 2024

* use `@ray.remote` with distributed vLLM

* update versions

* bugfix

* unpin vllm

* fix pre-commit

* added version assertion error

* Revert "added version assertion error"

This reverts commit 8041e9b78e95eea9f4f4d0dc260115ba8698e9cc.

* added version assertion for DP

* expand DP note

* add warning

* nit

* pin vllm

* fix typos

e5e35fca

01 Mar, 2024 1 commit
- Improve data-parallel request partitioning for VLLM (#1477) · 27a3da96
  Hailey Schoelkopf authored Mar 01, 2024
```
* add undistribute + use more_itertools

* remove divide() util fn

* add more_itertools as dependency
```
  27a3da96
26 Feb, 2024 2 commits

Create a means for caching task registration and request building. Ad… (#1372) · 1e6c9272

Aaron V authored Feb 26, 2024



* Create a means for caching task registration and request building. Add the ability to specify an args dict for simple_evaluate().

* Remove extra S in cache path in caching module
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Rename requests cache args, make model_args polymorphic so that a dict can also be accepted.

* Update docs to reflect new caching behavior, add CLI args for requests caching. Create a function for deleting items in the cache.

* Update documentation, fix minor bug with arg parsing for requests caching where an undefined variable was used.

* Remove line from gitignore, add to cli for caching datasets.

* Add hashing suffix to .pickles. Update test script typo.

* Favor isinstance() over type() in evaluator.py

* Add tests for caching, gets tests working, remove unneeded arg from build_all_requests().

* Update arg description to simple_evaluate.

* Update pyproject.toml

* Fix typehint

* Remove the use of random() for creating default cache pickle hash.

* Check that cache dir exists before clearing it in request cache tests.

* Fix linting problems.

* Fix additional formatting errors.

* Remove trailing whitespace.

* Add new line to the end of .gitignore.

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

1e6c9272

Apply code autoformatting with Ruff to tasks/*.py an *__init__.py (#1469) · d27c0c08
LSinev authored Feb 26, 2024

d27c0c08

22 Feb, 2024 1 commit

feat: Add Weights and Biases support (#1339) · 2683fbbb

Ayush Thakur authored Feb 23, 2024



* add wandb as extra dependency

* wandb metrics logging

* refactor

* log samples as tables

* fix linter

* refactor: put in a class

* change dir

* add panels

* log eval as table

* improve tables logging

* improve reports logging

* precommit run

* ruff check

* handle importing reports api gracefully

* ruff

* compare results

* minor pre-commit fixes

* build comparison report

* ruff check

* log results as artifacts

* remove comparison script

* update dependency

* type annotate and docstring

* add example

* update readme

* fix typo

* teardown

* handle outside wandb run

* gracefully fail reports creation

* precommit checks

* add report url to summary

* use wandb  printer for better url stdout

* fix ruff

* handle N/A and groups

* fix eval table

* remove unused var

* update wandb version req + disable reports stdout

* remove reports feature to TODO

* add label to multi-choice question data

* log model predictions

* lints

* loglikelihood_rolling

* log eval result for groups

* log tables by group for better handling

* precommit

* choices column for multi-choice

* graciously fail wandb

* remove reports feature

* track system metrics + total eval time + stdout

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

2683fbbb

19 Feb, 2024 1 commit

update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) (#1356) · 89deeeaf

thnkinbtfly authored Feb 20, 2024



* update bbh, gsm8k, mmlu parsing logic and prompts

* remove the formatting prompt (bbh) + minor update (mmlu)

* update bbh, gsm8k, mmlu zeroshot, revert fewshots

* update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot

* remove take_last, update to use docs parameters

* add newline

* ruff formatting

* Update pyproject.toml

* fix format

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

89deeeaf

11 Feb, 2024 1 commit

Evaluate (#1385) · 1ff84897

Baber Abbasi authored Feb 11, 2024

* un-exclude `evaluate.py` from linting

* readability

* readability

* add task name to build info message

* fix link

* nit

* add functions for var and mean pooling

* add functions for var and mean pooling

* metadata compatibility with task

* rename `override_config` to `set_config` and move to `Task`

* add unit test

* nit

* nit

* bugfix

* nit

* nit

* nit

* add docstrings

* fix metadata-fewshot

* revert metric refactor

* nit

* type checking

* type hints

* type hints

* move `override_metric` to `Task`

* change metadata

* change name

* pre-commit

* rename

* remove

* remove

* `override_metric` backwards compatible with `Task`

* type hints

* use generic

* type hint

1ff84897

06 Feb, 2024 1 commit

adding hf_transfer (#1400) · 756eeb6f

Michael Feil authored Feb 06, 2024



* add hf_transfer

* update dependencies

* Delete stale `[linting]` extra

* Update README.md with extras table

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

756eeb6f

05 Feb, 2024 1 commit

Support for Inf2 optimum class [WIP] (#1364) · d17dcea0

Michael Feil authored Feb 05, 2024

* initial commit

* remove overwrite bs

* adding neuronx dependencies

* Update README.md

* update neuronx

d17dcea0

31 Jan, 2024 1 commit
- Make dependencies compatible with PyPI (#1378) · a0a2fec8
  Hailey Schoelkopf authored Jan 31, 2024
```
* make deps not point to github urls

* formatting

* try making PyPI only run on tag pushes
```
  a0a2fec8
26 Jan, 2024 1 commit

Add causalLM OpenVino models (#1290) · 97a67d27

NoushNabi authored Jan 26, 2024



* added intel optimum

* added intel optimum in readme

* modified intel optimum

* modified intel optimum

* modified intel optimum

* modified install optimum

* modified path of IR file

* added openvino_device

* added openvino_device2

* changed optimum-causal to openvino-causal

* Update README.md

* Update README.md

* remove `lm_eval.base` import

* update openvino-causal -> openvino ; pass device through super().__init__()

* Update README.md

* Add optimum to tests dependencies

* apply pre-commit

* fix so tests pass

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

97a67d27

18 Jan, 2024 2 commits
- Update pyproject.toml (#1314) · b4c6bdb7
  Hailey Schoelkopf authored Jan 18, 2024
  
  b4c6bdb7
- Update pyproject.toml (#1312) · 6f60a924
  Hailey Schoelkopf authored Jan 18, 2024
  
  6f60a924
25 Dec, 2023 1 commit
- pin vllm at < 0.2.6 (#1212) · af74a93d
  Hailey Schoelkopf authored Dec 25, 2023
  
  af74a93d
22 Dec, 2023 1 commit

Upstream Mamba Support (`mamba_ssm`) (#1110) · 5503b274

Hailey Schoelkopf authored Dec 22, 2023

* modularize HFLM code

* pass through extra kwargs to AutoModel.from_pretrained call

* remove explicit model_kwargs

* rename gptq -> autogptq

* fix tokenizer pad token errors

* ensure model always respects device_map and autogptq's selected devices

* add a _get_config helper fn

* add mambaLMWrapper

* add mamba extra

* add mamba extra

* fix conditional import

* Fix botched merge commit

* Remove beginning-of-file comment for consistency

* Add docstring for mambaLM re: supported kwargs

* Alphabetize extras

* Update extras table

* appease precommit

* run precommit on mamba_lm

5503b274

20 Dec, 2023 2 commits

Switch Linting to `ruff` (#1166) · 65b8761d

Baber Abbasi authored Dec 20, 2023

* add ruff and isort. remove black and flake8

* remove unnecessary dependencies

* remove dependency from table

* change order

* ran ruff

* check 3.9

* exclude evaluator

* update CI workflow

* use ruff config in pyproject.toml

* test

* add isort rules to ruff

* sort imports

* import `make_table`

* try stages for no-commit-to-branch

* turn on mypy for pre-commit

* test

* test

* test

* change no-commit-to-branch to default

* nits

* fixed dependency

65b8761d

feat: add option to upload results to Zeno (#990) · 21d4ae98

Alex Bäuerle authored Dec 20, 2023



* feat: add option to upload results to Zeno

* config-based upload supporting different task types and metrics

* upload tasks as individual projects

* wording

* readme

* add example notebook

* Update documentation for Zeno integration

* Make zeno deps an extra

* Update README.md

* Document extra deps installation

* Update zeno_visualize.py

* fix: balance parens

* fix typo

* fix merge commit I botched

* Update zeno_visualize.py

* Update logger warning stmt

* fix whitespace

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

21d4ae98

17 Dec, 2023 1 commit

[WIP] Add IFEval / Instruction-Following Eval (#1087) · aa61f940

Wis Kojohnjaratkul authored Dec 17, 2023

* Add IFEval task

* Check and download nltk punkt if not already downloaded

* Update gen_max_toks to 2048 to support "900 words+" instructions

* Resolve pre-commit linting issues

* Reduce max_gen_toks to 1280 to conserve token usage

* Add warning message in `process_results` call for non chat-finetuned models

aa61f940

15 Dec, 2023 1 commit
- Enabling OpenAI completions via gooseai (#1141) · bd0f2414
  Vicki Boykis authored Dec 15, 2023
```
* enabling OpenAI completions via gooseai

* openai-completions and pin openai
```
  bd0f2414
04 Dec, 2023 1 commit
- Update pyproject.toml · 4efa4a54
  Hailey Schoelkopf authored Dec 04, 2023
  
  4efa4a54
27 Nov, 2023 1 commit
- Update pyproject.toml · 00b48ef8
  Lintang Sutawika authored Nov 27, 2023
  
  00b48ef8
22 Nov, 2023 1 commit
- add dependency · b570fce7
  baberabb authored Nov 23, 2023
  
  b570fce7
24 Oct, 2023 5 commits
- Update pyproject.toml · 84db6359
  Hailey Schoelkopf authored Oct 24, 2023
  
  84db6359
- Update pyproject.toml · a1b589df
  Hailey Schoelkopf authored Oct 24, 2023
  
  a1b589df
- Update pyproject.toml · fa1f3571
  Hailey Schoelkopf authored Oct 24, 2023
  
  fa1f3571
- Update pyproject.toml · e7e3adeb
  Hailey Schoelkopf authored Oct 24, 2023
  
  e7e3adeb
- Update pyproject.toml · b3a2af33
  Hailey Schoelkopf authored Oct 24, 2023
  
  b3a2af33