Commits · 3bdf25ec00ba3be6fbce88361080cb31aff069f4 · gaoqiong / lm-evaluation-harness

10 Mar, 2024 1 commit

Support jinja templating for task descriptions (#1553) · 3bdf25ec

Hisham Alyahya authored Mar 10, 2024



* Support jinja templating for "description"

* Update task_guide.md

* Update lm_eval/api/task.py

* fix format?

* whitespace errors

* fix whitespace

* fix bad variable reference

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

3bdf25ec

06 Mar, 2024 2 commits

Cleanup and fixes (Task, Instance, and a little bit of *evaluate) (#1533) · 4ee1b386

LSinev authored Mar 06, 2024



* Remove unused `decontamination_ngrams_path` and all mentions (still no alternative path provided)

* Fix improper import of LM and usage of evaluator in one of scripts

* update type hints in instance and task api

* raising errors in task.py instead of asserts

* Fix warnings from ruff

* raising errors in __main__.py instead of asserts

* raising errors in tasks/__init__.py instead of asserts

* raising errors in evaluator.py instead of asserts

* evaluator: update type hints and remove unused variables in code

* Update lm_eval/__main__.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/__main__.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/evaluator.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* pre-commit induced fixes

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

4ee1b386

Update docs on LM.loglikelihood_rolling abstract method (#1532) · 525b8f5d
Hailey Schoelkopf authored Mar 06, 2024

525b8f5d

27 Feb, 2024 1 commit

Refactor `evaluater.evaluate` (#1441) · 5ccd65d4

Baber Abbasi authored Feb 27, 2024



* change `all_gather` to `gather`

* add TaskOutput utility class

* Add FilterResults class and refactor task handling.

* Rename `key` to `filter_key` for clarity

* Add `print_writeout` function in utils.py

* Add function to calculate limit size.

* Add doc_iterator method to Task class

* Refactor `doc_iterator` and cleanup in Task class

* remove superfluous bits

* change `all_gather` to `gather`

* bugfix

* bugfix

* fix `gather`

* Refactor `gather` loop

* Refactor aggregate metrics calculation

* Refactor and simplify aggregate metrics calculation
Removed unused code

* Simplify metrics calculation and remove unused code.

* simplify the metrics calculation in `utils.py` and `evaluator.py`.

* Fix group metric

* change evaluate to hf_evaluate

* change evaluate to hf_evaluate

* add docs

* add docs

* nits

* make isslice keyword only

* nit

* add todo

* nit

* nit

* nit: swap order samples_metrics tuple

* move instance sorting outside loop

* nit

* nit

* Add __repr__ for ConfigurableTask

* nit

* nit

* Revert "nit"

This reverts commit dab8d9977a643752a17f840fd8cf7e4b107df28f.

* fix some logging

* nit

* fix `predict_only` bug. thanks to `@LSinev`!

* change `print_tasks` to `prepare_print_tasks`

* nits

* move eval utils

* move eval utils

* nit

* add comment

* added tqdm descriptions

* Update lm_eval/evaluator_utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix mgsm bug

* nit

* fix `build_all_requests`

* pre-commit

* add ceil to limit

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

5ccd65d4

26 Feb, 2024 3 commits

Cont metrics (#1475) · 96d185fa

Lintang Sutawika authored Feb 26, 2024



* add brier_score

* process brier_score

* brier score is working for N-sized class

* fxied brier score

* add TED to BigBench and Brier score to MMLU

* format

* Update metrics.py

* Update task.py

* Update generate_until_template_yaml

* Delete lm_eval/tasks/bigbench/aux_metric.py

* Update generate_until_template_yaml

* Update _default_template_yaml

* Update _generate_configs.py

* Update _generate_configs.py

* Update _generate_configs.py

* fix (format?)

* format?

* format, once more

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

96d185fa

Create a means for caching task registration and request building. Ad… (#1372) · 1e6c9272

Aaron V authored Feb 26, 2024



* Create a means for caching task registration and request building. Add the ability to specify an args dict for simple_evaluate().

* Remove extra S in cache path in caching module
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Rename requests cache args, make model_args polymorphic so that a dict can also be accepted.

* Update docs to reflect new caching behavior, add CLI args for requests caching. Create a function for deleting items in the cache.

* Update documentation, fix minor bug with arg parsing for requests caching where an undefined variable was used.

* Remove line from gitignore, add to cli for caching datasets.

* Add hashing suffix to .pickles. Update test script typo.

* Favor isinstance() over type() in evaluator.py

* Add tests for caching, gets tests working, remove unneeded arg from build_all_requests().

* Update arg description to simple_evaluate.

* Update pyproject.toml

* Fix typehint

* Remove the use of random() for creating default cache pickle hash.

* Check that cache dir exists before clearing it in request cache tests.

* Fix linting problems.

* Fix additional formatting errors.

* Remove trailing whitespace.

* Add new line to the end of .gitignore.

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

1e6c9272

Add Gemma support (Add flag to control BOS token usage) (#1465) · 4c51111c

Hailey Schoelkopf authored Feb 26, 2024



* add add_bos_token to HFLM

* add BOS token flag to other local model classes

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

4c51111c

22 Feb, 2024 2 commits

Fixed generation args issue affection OpenAI completion model (#1458) · 75ac1f47

Amine Elhattami authored Feb 23, 2024



* Fixed generation args issue affection openai completion model

* Fixed hf unit test; removed pop attributes in OpenAi completion.

* fix format

* fix format

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

75ac1f47

Add TemplateLM boilerplate LM class (#1279) · ba5cdf0f

Anjor Kanekar authored Feb 22, 2024

* loglikelihood refactor using template lm

* linter

* fix whitespace in target + prompt for CoT gsm8k (#1275)

* Make `parallelize=True` vs. `accelerate launch` distinction clearer in docs (#1261)

* Make parallelize=True distinction clearer in documentation.

* run linter

* Allow parameter edits for registered tasks when listed in a benchmark (#1273)

* benchmark yamls allow minor edits of already registered tasks

* add documentation

* removed print

* Fix data-parallel evaluation with quantized models (#1270)

* add WIP device_map overrides

* update handling outside of accelerate launcher

* change .to(device) log to debug level

* run linter

* Rework documentation for explaining local dataset (#1284)

* rewor documentation for explaining local dataset

* fix typo

* Update new_task_guide.md

* Re-add citation

It looks like Google Scholar has [already noticed](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C9&authuser=2&q=%22A+framework+for+few-shot+language+model+evaluation%2C+12+2023%22&btnG=

) the updated citation block so let's add it back in.

* Update CITATION.bib (#1285)

Bumping CITATION.bib to match re-adding the citation in readme. 

cc @StellaAthena

* Update nq_open.yaml (#1289)

* Update README.md with custom integration doc (#1298)

* Update README.md

* punctuation

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update nq_open.yaml (#1305)

* Update nq_open.yaml

change regex

* Bump NQ version

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update task_guide.md (#1306)

* Update pyproject.toml (#1312)

* Fix polemo2_in.yaml config name (#1313)

* Update pyproject.toml (#1314)

* Fix group register (#1315)

* tuple should be considered as well

* set option to keep callable as callable

* Update task_guide.md (#1316)

* Update polemo2_in.yaml (#1318)

* don't pass extra kwargs to mamba any more (#1328)

* Fix Issue regarding stderr (#1327)

* add fix fordeciding if stderr is N/A or not

* process N/A

* Add `local-completions` support using OpenAI interface (#1277)

* Add `local-completions` support using OpenAI interface

* Refactor oa_completion

* Address tokenizer comments and change request chunks to batch size

* Add warning message for tiktoken backend

* fix formatting

* fix whitespace

* Update README.md

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fallback to classname when LM doesnt have config (#1334)

* fix a trailing whitespace that breaks a lint job (#1335)

* skip "benchmarks" in changed_tasks (#1336)

* Update migrated HF dataset paths (#1332)

* Update arc_easy.yaml

* Update flan_cot.yaml

* update HF dataset path

* Update freeform.yaml

* Update flan_cot.yaml

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Don't use `get_task_dict()` in task registration / initialization (#1331)

* don't use get_task_dict() as a helper, it will download the dataset!

* pre-commit

* Update README.md

---------
Co-authored-by: lintangsutawika <lintang@eleuther.ai>

* manage default (greedy) gen_kwargs in vllm (#1341)

* manage default (greedy) gen_kwargs in vllm better

* mirror HF `do_sample`

* just need to set temp=0 for greedy

* modified default gen_kwargs to work better with CLI; changed prompt_logprobs=1 (#1345)

* update links to task_guide.md (#1348)

* `Filter` docs not offset by `doc_id`  (#1349)

* get `doc` from instance

* acceletate bugfix: get ground doc from instance

* convert filter to `process_result`

* get docs from instances in `FilterEnsemble`

* rename

* nit

* better looping

* fix typehint

* Add FAQ on `lm_eval.tasks.initialize_tasks()` to README (#1330)

* Update README.md

* [!Tip]

* Refix issue regarding stderr (#1357)

* Add causalLM OpenVino models (#1290)

* added intel optimum

* added intel optimum in readme

* modified intel optimum

* modified intel optimum

* modified intel optimum

* modified install optimum

* modified path of IR file

* added openvino_device

* added openvino_device2

* changed optimum-causal to openvino-causal

* Update README.md

* Update README.md

* remove `lm_eval.base` import

* update openvino-causal -> openvino ; pass device through super().__init__()

* Update README.md

* Add optimum to tests dependencies

* apply pre-commit

* fix so tests pass

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Apply some best practices and guideline recommendations to code (#1363)

* raise Exception, not a string

Additional info https://peps.python.org/pep-0352/#exception-hierarchy-changes
https://docs.python.org/3.8/tutorial/errors.html#raising-exceptions

* Apply PEP8 recommendation to prefer isinstance

"Object type comparisons should always use isinstance() instead of comparing types directly"
https://peps.python.org/pep-0008/

* Remove dangerous default mutable values in arguments

https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/dangerous-default-value.html

* Format logging messages with fstring (not with format)

Additional info
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/logging-format-interpolation.html
There are also discussions about the speed of formatting while logging or some unintended code executions
https://github.com/pylint-dev/pylint/issues/2395
https://stackoverflow.com/a/54368109
but at least one format (fstring one) will be used throughout the project

* Specify utf-8 encoding for `open` explicitly

If not specified, it may be supposed differently in different environments, OSes, and Python versions. See
https://peps.python.org/pep-0597/
https://docs.python.org/3.11/library/locale.html#locale.getencoding
https://docs.python.org/3.10/library/os.html#utf8-mode
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/unspecified-encoding.html

Helps also if some code from English language tasks is taken as inspiration for tasks in non-English languages.

* Use inline-ignoring comments to pass pre-commit instead of identity process

https://flake8.pycqa.org/en/3.0.1/user/ignoring-errors.html#in-line-ignoring-errors
https://www.flake8rules.com/rules/F841.html

flake8 comments are supported by ruff: https://docs.astral.sh/ruff/linter/#error-suppression



* serialize callable functions in config (#1367)

* delay filter init; remove `*args` (#1369)

* delay filter init; remove `*args`

* bugfix

* optimize

* type hint

* Fix unintuitive `--gen_kwargs` behavior (#1329)

* don't override do_sample if no value for it is passed

* Update gen_kwargs override condition

* Update huggingface.py

* Update huggingface.py

* run linters

* silence an erroneous warning

* Publish to pypi (#1194)

* publish to pypi

* lint

* Update publish.yml

* minor

* Make dependencies compatible with PyPI (#1378)

* make deps not point to github urls

* formatting

* try making PyPI only run on tag pushes

* Add support for RWKV models with World tokenizer (#1374)

* Add support for RWKV models with World tokenizer

The RWKV line of model with the World tokenizer, does not allow the padding token to be configured, and has its value preset as 0

This however fails all the "if set" checks, and would cause the tokenizer to crash.

A tokenizer class name check was added, in addition to a model type check, as there exists RWKV models which uses the neox tokenizers

* Update huggingface.py

Genericized so that this supports any RWKVWorld tokenizer, and added a fall-back for if the HF implementation name changes.

* Comply with formatting guidelines

* fix format

---------
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add bypass metric (#1156)

* add bypass metric

* fixed `bypass` metric.

* add task attributes if predict_only

* add `predict_only` checks

* add docs

* added `overide_metric`, `override_config` to `Task`

* nits

* nit

* changed --predict_only to generations; nits

* nits

* nits

* change gen_kwargs warning

* add note about `--predict_only` in README.md

* added `predict_only`

* move table to bottom

* nit

* change null aggregation to bypass (conflict)

* bugfix; default `temp=0.0`

* typo

* loglikelihood refactor using template lm

* lint

* code review

* neuron optimum

* Mention TemplateLM in model_guide.md

* Update lm_eval/api/model.py

* fix linter

* fix format

* fix format

* fix format

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
Co-authored-by: Hannibal046 <38466901+Hannibal046@users.noreply.github.com>
Co-authored-by: Danielle Pintz <38207072+daniellepintz@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: kwrobel.eth <djstrong@gmail.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Brian Vaughan <nairbv@users.noreply.github.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: thnkinbtfly <70014488+thnkinbtfly@users.noreply.github.com>
Co-authored-by: NoushNabi <33136068+NoushNabi@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: LSinev <LSinev@users.noreply.github.com>
Co-authored-by: Eugene Cheah <PicoCreator@users.noreply.github.com>

ba5cdf0f

20 Feb, 2024 1 commit

Group reqs by context (#1425) · 45941c67

Baber Abbasi authored Feb 20, 2024



* add key lookup for same contexts

* nit

* appease pre-commit

* nit

* use `expand` (in-place view) rather than `repeat`

* try mixed grouping

* add docs.

* nit

* nit

* nits

* fix tests

* Move greedy_tokens calculation out of cache loop

* nit

* nits

* add test

* nits

* fix name conflict

* fix name conflict

* chunk tensor

* move Collator

* nits/docstring

* fixup

* fixup

* group contexts only for decoders

* pre-commit

* fix `generate_until` test

* fix `generate_until` test

* Update lm_eval/models/huggingface.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add docs

* nit

* add docs

* add docs

* add 'logits_cache' arg

* bugfix

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

45941c67

13 Feb, 2024 1 commit
- Fix: task weighting by subtask size ; update Pooled Stderr formula slightly (#1427) · 620d6a15
  Hailey Schoelkopf authored Feb 13, 2024
```
* fix weight_by_size condition

* add tests, update stderr formula slightly

* apply pre-commit
```
  620d6a15
11 Feb, 2024 1 commit

Evaluate (#1385) · 1ff84897

Baber Abbasi authored Feb 11, 2024

* un-exclude `evaluate.py` from linting

* readability

* readability

* add task name to build info message

* fix link

* nit

* add functions for var and mean pooling

* add functions for var and mean pooling

* metadata compatibility with task

* rename `override_config` to `set_config` and move to `Task`

* add unit test

* nit

* nit

* bugfix

* nit

* nit

* nit

* add docstrings

* fix metadata-fewshot

* revert metric refactor

* nit

* type checking

* type hints

* type hints

* move `override_metric` to `Task`

* change metadata

* change name

* pre-commit

* rename

* remove

* remove

* `override_metric` backwards compatible with `Task`

* type hints

* use generic

* type hint

1ff84897

06 Feb, 2024 1 commit

Use Pooled rather than Combined Variance for calculating stderr of task groupings (#1390) · 94cc1850

Hailey Schoelkopf authored Feb 06, 2024

* update formula for stderr aggregation

* hack: see what happens when using stderr_for_metric bootstrapping on a group

* undo bootstrap_for_stderr test

* factor out variance-aggregation formulas into api.metrics

* fix failing tests

* remove stray print

* update comment

* further detail in comment

* add back initialize_tasks() call

* fix format

94cc1850

01 Feb, 2024 2 commits

Faster Task and Group Loading, Allow Recursive Groups (#1321) · d714fc95

Lintang Sutawika authored Feb 01, 2024



* add trust_remote_code as default

* task for testing recursive

* changed source of ALL_TASKS

* tasks should only accept TaskObjects

* initialize_tasks returns list of tasks and groups

* remove trust_remote_code for now

* moved constructor process to inside load_yaml_config

* more comprehensive way to index tasks and groups

* pre-commit format

* add exit after error

* adjust how task objects are called

* no need to use get_task_dict

* load_task_or_group works but only for tasks

* pre-commit format

* half working for nested groups

* changed variable names

* allow groups and tasks to work

* temp save

* indexing and loading are part of a task_manager object

* adapted initialize_tasks

* iron out bugs

* fixed typo

* fixed typo

* simplified code

* further tidy up

* remove lines for testing

* removed test lines

* removed unused code

* remove unused import

* fixed bug

* removed comments

* group in a list of group can accept parameter changes like `num_fewshot`

* add trust_remote_code as default

* task for testing recursive

* changed source of ALL_TASKS

* tasks should only accept TaskObjects

* initialize_tasks returns list of tasks and groups

* remove trust_remote_code for now

* moved constructor process to inside load_yaml_config

* more comprehensive way to index tasks and groups

* pre-commit format

* add exit after error

* adjust how task objects are called

* no need to use get_task_dict

* load_task_or_group works but only for tasks

* pre-commit format

* half working for nested groups

* changed variable names

* allow groups and tasks to work

* temp save

* indexing and loading are part of a task_manager object

* adapted initialize_tasks

* iron out bugs

* fixed typo

* fixed typo

* simplified code

* further tidy up

* remove lines for testing

* removed test lines

* removed unused code

* remove unused import

* fixed bug

* removed comments

* group in a list of group can accept parameter changes like `num_fewshot`

* check if config is task update

* add GroupConfig object

* edit test yaml

* remove args

* testing returning to python task list

* add weight_by_size config

* describe weight_by_size in docs

* fix weight by size potential error

* can load individual custom python class task

* moved import_function into the config loading file

* remove print lines

* add squadv2 yaml

* temporary scroll implementation

* revert back to use load_yaml_config but with modes

* fix group being loaded with a None

* reformat

* can load unregistered tasks from a group

* update scrolls

* edit scrolls multiplechoice task

* adjust class initialization

* fix initialization

* changed how to identify group and python tasks, fix logger

* allow loading "include" that is nested in a group config

* reworked flan benchmark

* allow duplicate task in the same group to co-exist

* process group_alias

* removed group_alias

* allow parameters set in group_config to apply to all tasks in tasklist

* add function, but comment for now

* reworked processing dict-base config

* fixed how configs in group are processed

* update to allow root group to have its alias used

* remove unused classes

* remove unused classes

* revert some parts to original

* forgot to change one variable

* adapt the new process to use get_task_dict

* fix for singular group call

* fix variable names

* add TaskManager into the evaluator

* format

* changed how dict tasks are loaded

* add docs

* Update docs/new_task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update evaluator.py

* Update evaluator.py

* remove groupconfig for now

* changed _config to config

* update interface.md to explain TaskManager

* added property functions

* adjusted logger

* update write_out.py

* updated tests

* added documentation and some modifications

* added docstring documentation

* precommit format

* updated task loading for tests

* updates tests

* changed arg order for load_yaml_config

* update to handle scrolls and edit log message

* remove unused lines

* return a list of task classes and not a dict

* Update __init__.py

* Delete lm_eval/tasks/benchmarks/test.yaml

* Update task.py

* Update lm_eval/utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update utils.py

* re-added old functions with new log message

* Update docs/new_task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update new_task_guide.md

* added infor regarding `get_task_dict` and documentation

* add get_config for Task

* pre-commit formatting

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

d714fc95

Enable override of printed `n-shot` in table (#1379) · 17191063
Hailey Schoelkopf authored Feb 01, 2024
```
* allow tasks to specify printed fewshot val

* fix to belebele

* update metadata field's documentation
```
17191063

31 Jan, 2024 1 commit

add bypass metric (#1156) · f8203de1

Baber Abbasi authored Feb 01, 2024

* add bypass metric

* fixed `bypass` metric.

* add task attributes if predict_only

* add `predict_only` checks

* add docs

* added `overide_metric`, `override_config` to `Task`

* nits

* nit

* changed --predict_only to generations; nits

* nits

* nits

* change gen_kwargs warning

* add note about `--predict_only` in README.md

* added `predict_only`

* move table to bottom

* nit

* change null aggregation to bypass (conflict)

* bugfix; default `temp=0.0`

* typo

f8203de1

30 Jan, 2024 1 commit
- delay filter init; remove `*args` (#1369) · 1554066c
  Baber Abbasi authored Jan 30, 2024
```
* delay filter init; remove `*args`

* bugfix

* optimize

* type hint
```
  1554066c
29 Jan, 2024 1 commit
- serialize callable functions in config (#1367) · 7fc43656
  Baber Abbasi authored Jan 29, 2024
  
  7fc43656
28 Jan, 2024 1 commit

Apply some best practices and guideline recommendations to code (#1363) · 488759d2

LSinev authored Jan 28, 2024

* raise Exception, not a string

Additional info https://peps.python.org/pep-0352/#exception-hierarchy-changes
https://docs.python.org/3.8/tutorial/errors.html#raising-exceptions

* Apply PEP8 recommendation to prefer isinstance

"Object type comparisons should always use isinstance() instead of comparing types directly"
https://peps.python.org/pep-0008/

* Remove dangerous default mutable values in arguments

https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/dangerous-default-value.html

* Format logging messages with fstring (not with format)

Additional info
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/logging-format-interpolation.html
There are also discussions about the speed of formatting while logging or some unintended code executions
https://github.com/pylint-dev/pylint/issues/2395
https://stackoverflow.com/a/54368109
but at least one format (fstring one) will be used throughout the project

* Specify utf-8 encoding for `open` explicitly

If not specified, it may be supposed differently in different environments, OSes, and Python versions. See
https://peps.python.org/pep-0597/
https://docs.python.org/3.11/library/locale.html#locale.getencoding
https://docs.python.org/3.10/library/os.html#utf8-mode
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/unspecified-encoding.html

Helps also if some code from English language tasks is taken as inspiration for tasks in non-English languages.

* Use inline-ignoring comments to pass pre-commit instead of identity process

https://flake8.pycqa.org/en/3.0.1/user/ignoring-errors.html#in-line-ignoring-errors
https://www.flake8rules.com/rules/F841.html

flake8 comments are supported by ruff: https://docs.astral.sh/ruff/linter/#error-suppression

488759d2

25 Jan, 2024 1 commit

`Filter` docs not offset by `doc_id` (#1349) · a0f1cacd

Baber Abbasi authored Jan 26, 2024

* get `doc` from instance

* acceletate bugfix: get ground doc from instance

* convert filter to `process_result`

* get docs from instances in `FilterEnsemble`

* rename

* nit

* better looping

* fix typehint

a0f1cacd

18 Jan, 2024 1 commit
- Fix group register (#1315) · 72ea626e
  Lintang Sutawika authored Jan 19, 2024
```
* tuple should be considered as well

* set option to keep callable as callable
```
  72ea626e
12 Jan, 2024 2 commits
- apply process_docs() to fewshot_split too (#1276) · 89618bf8
  Hailey Schoelkopf authored Jan 12, 2024
  
  89618bf8
- update versioning logging (#1271) · 75dc2b87
  Hailey Schoelkopf authored Jan 11, 2024
  
  75dc2b87
10 Jan, 2024 1 commit

Call "exact_match" once for each multiple-target sample (#1266) · 692e0f83

Baber Abbasi authored Jan 10, 2024

* Refine scoring logic for multiple_target "exact_match" metric

* skip old tests from master

* skip old tests from master

* delete tests from master

692e0f83

08 Jan, 2024 1 commit
- fixed fewshot loading for multiple input tasks (#1255) · cf6a8321
  Lintang Sutawika authored Jan 08, 2024
  
  cf6a8321
05 Jan, 2024 1 commit

Add multilingual HellaSwag task (#1228) · 28bb45fb

JorgeDeCorte authored Jan 05, 2024



* add hellaswag_nl

* add other languages and update readme to hellaswag

* refactor as new task

* update readme

* add endline to yaml files and readme.md

* add group, change folder location and update yaml file

* rename default hellaswag yaml file

* fix whitespace error in some labels

* downgrade log level of whitespace checking

---------
Co-authored-by: JorgeDeCorte <jorge.decorte@ravago.be>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

28bb45fb

04 Jan, 2024 1 commit
- Remove self.dataset_path post_init process (#1243) · e7c03d0c
  Lintang Sutawika authored Jan 04, 2024
```
* Remove self.dataset_path post_init process

* Update task.py

* Update task.py
```
  e7c03d0c
20 Dec, 2023 1 commit

Switch Linting to `ruff` (#1166) · 65b8761d

Baber Abbasi authored Dec 20, 2023

* add ruff and isort. remove black and flake8

* remove unnecessary dependencies

* remove dependency from table

* change order

* ran ruff

* check 3.9

* exclude evaluator

* update CI workflow

* use ruff config in pyproject.toml

* test

* add isort rules to ruff

* sort imports

* import `make_table`

* try stages for no-commit-to-branch

* turn on mypy for pre-commit

* test

* test

* test

* change no-commit-to-branch to default

* nits

* fixed dependency

65b8761d

15 Dec, 2023 1 commit
- fixed how to check if dataset_path is a local directory or not (#1127) · 04707a2d
  Lintang Sutawika authored Dec 15, 2023
  
  04707a2d
14 Dec, 2023 2 commits

doc_to_decontamination_query can use function (#1082) · fcb39a5a

Lintang Sutawika authored Dec 14, 2023

* doc_to_decontamination_query can use function

* add option for doc_to_decontamination_query to follow doc_to_text

* added documentation for doc_to_decontamination_query

* adjust description

* format

fcb39a5a

Additional process for doc_to_choice (#1093) · a2ed953f
Lintang Sutawika authored Dec 14, 2023
```
* Additional process for doc_to_choice

* doc_to_choice can also parse a string
```
a2ed953f

04 Dec, 2023 1 commit
- Update samplers.py · 3512f742
  Hailey Schoelkopf authored Dec 04, 2023
  
  3512f742
29 Nov, 2023 2 commits
- fixed sampler issue with new default num_fewshot value; n_shot value changed · 38ac7b2a
  baberabb authored Nov 29, 2023
  
  38ac7b2a
- change torch req for mps · 7bb147b5
  baberabb authored Nov 29, 2023
  
  7bb147b5
28 Nov, 2023 3 commits
- reformat · 37a46351
  lintangsutawika authored Nov 28, 2023
  
  37a46351
- add versions · 0d03a9f3
  lintangsutawika authored Nov 28, 2023
  
  0d03a9f3
- fixes how num_fewshot from config is read · 9e65d0ce
  lintangsutawika authored Nov 28, 2023
  
  9e65d0ce
17 Nov, 2023 1 commit
- removed provide_description arg · 0d209e2e
  lintangsutawika authored Nov 17, 2023
  
  0d209e2e
16 Nov, 2023 1 commit
- remove provide_description flag · a745d589
  haileyschoelkopf authored Nov 16, 2023
  
  a745d589
09 Nov, 2023 1 commit
- reformat · 9a64e642
  lintangsutawika authored Nov 09, 2023
  
  9a64e642