1. 28 Jan, 2025 1 commit
  2. 17 Jan, 2025 1 commit
  3. 15 Jan, 2025 2 commits
    • Baber Abbasi's avatar
      assistant prefill (#2615) · 703fbffd
      Baber Abbasi authored
      * add assistant prefix
      
      * add arc_challenge from llama
      
      * nit
      
      * nit
      
      * nit
      
      * add assistant prefix
      
      * add mmlu_llama
      
      * nit
      
      * nit
      
      * Revert "nit"
      
      This reverts commit 6a97f8356237305e375212b966b30e8de59dd4bc.
      
      * fix regex bug
      
      * add assistant_prefix to vllm
      
      * add `Question:`
      
      * add mmlu_pro
      
      * add fewshot assistant_prefix
      
      * use `assistant_prefill`
      
      * typehints
      
      * nits
      
      * nits
      
      * add to docs
      
      * add readme
      703fbffd
    • Hojin Lee's avatar
      Add HumanEval (#1992) · 4c11206b
      Hojin Lee authored
      
      
      * add custom filter
      
      * fix type casting of references
      
      * add humaneval
      
      * fix a bug in humaneval
      
      * add greedy version of humaneval
      
      * update tasks README
      
      * test humaneval
      
      * return multiple metrics
      
      * nit
      
      * add confirmation to run code tasks
      
      * nit
      
      * nit
      
      ---------
      Co-authored-by: default avatarHojin Lee <19949034+hjlee1371@users.noreply.github.com>
      Co-authored-by: default avatarBaber <baber@hey.com>
      4c11206b
  4. 04 Jan, 2025 1 commit
    • Baber Abbasi's avatar
      some minor logging nits (#2609) · 888ac292
      Baber Abbasi authored
      * remove yaml extension from phraes_va_common
      
      * remove yaml extension from winogenerated
      
      * remove yaml extension from phrases_es
      
      * no cache debug logging when not used
      888ac292
  5. 29 Nov, 2024 1 commit
  6. 11 Nov, 2024 1 commit
  7. 07 Nov, 2024 1 commit
  8. 08 Oct, 2024 1 commit
  9. 07 Oct, 2024 1 commit
  10. 04 Oct, 2024 1 commit
  11. 17 Sep, 2024 1 commit
  12. 13 Sep, 2024 1 commit
    • Lintang Sutawika's avatar
      Multimodal prototyping (#2243) · fb963f0f
      Lintang Sutawika authored
      
      
      * add WIP hf vlm class
      
      * add doc_to_image
      
      * add mmmu tasks
      
      * fix merge conflicts
      
      * add lintang's changes to hf_vlms.py
      
      * fix doc_to_image
      
      * added yaml_path for config-loading
      
      * revert
      
      * add line to process str type v
      
      * update
      
      * modeling cleanup
      
      * add aggregation for mmmu
      
      * rewrite MMMU processing code based on only MMMU authors' repo (doc_to_image still WIP)
      
      * implemented doc_to_image
      
      * update doc_to_image to accept list of features
      
      * update functions
      
      * readd image processed
      
      * update args process
      
      * bugfix for repeated images fed to model
      
      * push WIP loglikelihood code
      
      * commit most recent code (generative ; qwen2-vl testing)
      
      * preliminary image_token_id handling
      
      * small mmmu update: some qs have >4 mcqa options
      
      * push updated modeling code
      
      * use processor.apply_chat_template
      
      * add mathvista draft
      
      * nit
      
      * nit
      
      * ensure no footguns in text<>multimodal LM<>task incompatibility
      
      * add notification to readme regarding launch of prototype!
      
      * fix compatibility check
      
      * reorganize mmmu configs
      
      * chat_template=None
      
      * add interleave chat_template
      
      * add condition
      
      * add max_images; interleave=true
      
      * nit
      
      * testmini_mcq
      
      * nit
      
      * pass image string; convert img
      
      * add vllm
      
      * add init
      
      * vlm add multi attr
      
      * fixup
      
      * pass max images to vllm model init
      
      * nit
      
      * encoding to device
      
      * fix HFMultimodalLM.chat_template ?
      
      * add mmmu readme
      
      * remove erroneous prints
      
      * use HFMultimodalLM.chat_template ; restore tasks/__init__.py
      
      * add docstring for replace_placeholders in utils
      
      * fix `replace_placeholders`; set image_string=None
      
      * fix typo
      
      * cleanup + fix merge conflicts
      
      * update MMMU readme
      
      * del mathvista
      
      * add some sample scores
      
      * Update README.md
      
      * add log msg for image_string value
      
      ---------
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      Co-authored-by: default avatarBaber Abbasi <baber@eleuther.ai>
      Co-authored-by: default avatarBaber <baber@hey.com>
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      fb963f0f
  13. 05 Aug, 2024 1 commit
    • Yu Shi Jie's avatar
      Mmlu Pro (#1961) · 69d56f45
      Yu Shi Jie authored
      
      
      * initialized mmlu_pro task
      
      * added generative mmlu-pro
      
      * added cot fewshot for mmlu-pro
      
      * Initial commit
      
      * updated mmlu-pro to take on 3 splits: test, val, dev
      
      * mmlu-pro: added continuation and flan_cot_zeroshot
      
      * added README.md for mmlu_pro
      
      * removed
      
      * update files
      
      * moved files out, and removed unused versions
      
      * updated
      
      * mmlu_pro:
      
      -changed task 'other' to 'miscellaneous'
      there is already a group named 'other'
      task and group with the same alias (e.g. mmlu_pro_other_generative) throws an error
      
      -fixed yaml backslash escape for fewshot cot
      
      * changed choices -> options in yaml config to fit dataset schema
      
      * ONLY FOR DEFAULT: fixed yaml file to use variable number of choices
      
      * mmlu-pro: fixed doc_to_text/choice/target configs for all variants
      
      * mmlu-pro: minor fixes
      
      * mmlu-pro/default: aligned with mmlu updates
      
      * mmlu-pro: update yaml content in line with mmlu
      
      * mmlu-pro: fixed mislabelling of task (math->chemistry)
      
      * mmlu-pro: fixed yaml formatting
      
      * add custom fewshot doc_to_text, target, and choice
      
      * add process for each subtask
      
      * add process for each subtask
      
      * pre-commit
      
      * pre-commit
      
      * format
      
      * resolved left out merge
      
      * deleted folders + updated readme
      
      * Update evaluator.py
      
      * Update evaluator.py
      
      ---------
      Co-authored-by: default avatarYu Shi Jie <shijie@tensorplex.ai>
      Co-authored-by: default avatarlintangsutawika <lintang@eleuther.ai>
      Co-authored-by: default avatarroot <root@455bdd73-01.cloud.together.ai>
      Co-authored-by: default avatarLintang Sutawika <lintang@sutawika.com>
      69d56f45
  14. 15 Jul, 2024 1 commit
  15. 12 Jul, 2024 1 commit
    • Jess's avatar
      Irokobench: Benchmark Dataset for African languages (#2042) · 383bbd54
      Jess authored
      
      
      * add afrixnli to task
      
      * add chat completion
      
      * remove chat completion -untested
      
      * afrimmlu added
      
      * afrimmlu folder update
      
      * afrimmlu folder update
      
      * updated prompt
      
      * remove print
      
      * add afrimgsm -direct
      
      * add squad metric
      
      * fix bash script
      
      * remove direct util, update common yaml
      
      * remove print
      
      * add few show. metric fixes
      
      * fix direct path, add bash script for gpt models
      
      * added transate test
      
      * update afrixnli tasks
      
      * update afrixnli tasks
      
      * update metrics for afrixnli
      
      * prompt translations fix
      
      * prompt translations fix
      
      * filter and metric fix -mgsm
      
      * remove squad metric
      
      * remove squad metric
      
      * add f1 score to mgsm
      
      * add f1 score to mgsm
      
      * update native-direct with lin
      
      * change f1 function
      
      * add lin to utils
      
      * add utils
      
      * remove test limit
      
      * remove test configs
      
      * add swahili to mmlu
      
      * change eng to ewe in ewe yaml mmlu
      
      * add squad metric to mgsm, remove whitespace filter
      
      * added translate test
      
      * added afrixnli_translate
      
      * fix exact match valueError
      
      * fix exact match valueError
      
      * restructure mmlu folder
      
      * spacing
      
      * remove afrimmlu_translate folder
      
      * add utility
      
      * format task name, clean ups
      
      * modefied mgsm
      
      * update on afrimgsm
      
      * update on afrimgsm
      
      * removed utils
      
      * other mgsm varieties
      
      * other mgsm varieties
      
      * adding trasnslate direct
      
      * Update translate_direct_yaml
      
      * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model
      
      * edit for open models
      
      * Update translate_direct_yaml
      
      * add verbalizer for xnli
      
      * change xnli from multiple choice to generate
      
      * add manual accuracy scores
      
      * revert xnli to multiple choice
      
      * change afrimgsm utils
      
      * revert xnli to multiple_choice
      
      * cleanups and readmes
      
      * remove openai fixes and unused regex
      
      * pr review changes
      
      * revert metrics.py, task.py and extraction.py to main version
      
      ---------
      Co-authored-by: default avatarIsrael Abebe Azime <azime@cg.uni-saarland.de>
      Co-authored-by: default avatarIsrael Abebe Azime <se.israel.abebe@gmail.com>
      383bbd54
  16. 08 Jul, 2024 1 commit
    • Lintang Sutawika's avatar
      Group agg rework (#1741) · 517aadc4
      Lintang Sutawika authored
      
      
      * add greoup_config arg
      
      * add a group config that allows disabling table for group score and group aggregate in general
      
      * fixed size configuration
      
      * adjust config
      
      * add group config
      
      * adjust mmlu to use group_config
      
      * fixed args input in aggregate_subtask_metrics
      
      * fixed issues related to printing alias of group and updated yaml
      
      * update all mmlu variants to include group_config
      
      * edit format
      
      * modify mmlu tasks
      
      * adjust group to also be a configurable group
      
      * add configurable group
      
      * simplify get_task_list
      
      * adjust group scoring with using ConfigurableGroup
      
      * adjust args
      
      * update mmlu
      
      * update mmlu
      
      * update to work with new group and task configuration
      
      * readd group_agg
      
      * readd files
      
      * move prepare_print_tasks to evaluator_utils
      
      * sort set to False by default, fix predict_only arg
      
      * add version for groups
      
      * reversed task list
      
      * update additional condition when loading a group in a group yaml
      
      * update truthfulqa
      
      * add description regarding tags replacing group
      
      * replace group to tag
      
      * fixed conditional statement
      
      * remove warning
      
      * update loading of task group and newly added tags
      
      * reformat with pre-commit
      
      * fixed info log
      
      * update
      
      * fix bug
      
      * fix bug
      
      * use task id to differentiate tasks
      
      * convert all groups to configurable groups
      
      * use task_id
      
      * reformat
      
      * add task_id for python tasks as well
      
      * add task_id for python tasks as well
      
      * add task_id for python tasks as well
      
      * revert truthfulqa
      
      * revert mmlu tasks
      
      * new mmlu config
      
      * new group config parameter `tag_to_task`
      
      * Update truthfulqa_mc2.yaml
      
      * reformate
      
      * add _process_group_config
      
      * adjust task_id
      
      * add get_subtask_list function to get proper subtask list
      
      * group config to_dict update
      
      * remove tag check
      
      * update mmlu
      
      * fix config passing issues
      
      * add test yaml
      
      * format fix
      
      * add documentation
      
      * corner case for single tag being called
      
      * fix indentation
      
      * formatting
      
      * update all mmlu variants
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * remove group_alias
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * remove version for metadata
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * update mmlu/
      
      * removed " " in make_table
      
      * change how aggregate_metric is loaded
      
      * change how aggregate_metric is loaded
      
      * update aggregate_metric arg
      
      * update format
      
      * update format
      
      * some docs fixes
      
      * add groups for agieval, aexams, aclue
      
      * add more explicit aggregation groups
      
      * add more groupings / tags distinctions
      
      * add more groupings
      
      * more groupings
      
      * add many explicit group configs
      
      * add many explicit group configs
      
      * add more explicit group configs
      
      * add more explicit group configs
      
      * add more error msgs, agg_metric -> agg_metric_list
      
      * some docs updates
      
      * update task_id to be updateable and uses group:task format
      
      * make KMMLU a tag for now
      
      * update docs
      
      * don't duplicate task names
      
      * fix merge conflicts?
      
      * giving this a try
      
      * clean up diff
      
      * switch mmlu variants over to using
      
      * don't use to-be-deprecated group: config field in overview notebook
      
      * Python tasks which subclass ConfigurableTask now run
      
      * update mmlu
      
      * pre-commit format
      
      * fixed sorting for multi-level printing
      
      * move group api to separate file
      
      * fix bbh aggregation filter usage
      
      * track api/group.py
      
      * adjust group and tags loading
      
      * make explicit group configs for leaderboard and other newer tasks
      
      * fix arabicmmlu
      
      * update
      
      * change arabicmmlu template name???
      
      * update group alias
      
      * fix printing bugs
      
      * check table printing is correct ; update tests
      
      * use mmlu_stem to have a group included in print tests
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      517aadc4
  17. 25 Jun, 2024 1 commit
  18. 13 Jun, 2024 1 commit
  19. 03 Jun, 2024 1 commit
  20. 31 May, 2024 1 commit
  21. 06 May, 2024 1 commit
    • LSinev's avatar
      Provide ability for custom sampler for ConfigurableTask (#1616) · ae72cebc
      LSinev authored
      * Added fewshot sampling seeds to evaluator.simple_evaluate signature
      
      Way to control seed of fewshot sampling
      may help with #1591
      
      * Added ability for custom sampler for ConfigurableTask
      
      May be set in config like
      ```
      fewshot_config:
        sampler: !function utils.MyFewshotSampler
      ```
      
      * explicitly set fewshot random generator seed for HFLM generate_until_task test
      
      * add backward compatibility for three args seed setup
      
      * save seeds info to logs/reports
      ae72cebc
  22. 25 Apr, 2024 1 commit
  23. 18 Mar, 2024 1 commit
    • Hailey Schoelkopf's avatar
      Cleanup for v0.4.2 release (#1573) · 5627e819
      Hailey Schoelkopf authored
      * Update interface.md
      
      * fix: make caching reqs always work with accelerate launch
      
      * remove stale task migration checklist
      
      * remove deprecation warnings
      
      * make informative TypeErrors for get_task_dict
      
      * bump version metadata
      
      * fix num_fewshot printing bug
      
      * add fewshot value to cache key
      5627e819
  24. 10 Mar, 2024 1 commit
  25. 06 Mar, 2024 1 commit
  26. 27 Feb, 2024 1 commit
    • Baber Abbasi's avatar
      Refactor `evaluater.evaluate` (#1441) · 5ccd65d4
      Baber Abbasi authored
      
      
      * change `all_gather` to `gather`
      
      * add TaskOutput utility class
      
      * Add FilterResults class and refactor task handling.
      
      * Rename `key` to `filter_key` for clarity
      
      * Add `print_writeout` function in utils.py
      
      * Add function to calculate limit size.
      
      * Add doc_iterator method to Task class
      
      * Refactor `doc_iterator` and cleanup in Task class
      
      * remove superfluous bits
      
      * change `all_gather` to `gather`
      
      * bugfix
      
      * bugfix
      
      * fix `gather`
      
      * Refactor `gather` loop
      
      * Refactor aggregate metrics calculation
      
      * Refactor and simplify aggregate metrics calculation
      Removed unused code
      
      * Simplify metrics calculation and remove unused code.
      
      * simplify the metrics calculation in `utils.py` and `evaluator.py`.
      
      * Fix group metric
      
      * change evaluate to hf_evaluate
      
      * change evaluate to hf_evaluate
      
      * add docs
      
      * add docs
      
      * nits
      
      * make isslice keyword only
      
      * nit
      
      * add todo
      
      * nit
      
      * nit
      
      * nit: swap order samples_metrics tuple
      
      * move instance sorting outside loop
      
      * nit
      
      * nit
      
      * Add __repr__ for ConfigurableTask
      
      * nit
      
      * nit
      
      * Revert "nit"
      
      This reverts commit dab8d9977a643752a17f840fd8cf7e4b107df28f.
      
      * fix some logging
      
      * nit
      
      * fix `predict_only` bug. thanks to `@LSinev`!
      
      * change `print_tasks` to `prepare_print_tasks`
      
      * nits
      
      * move eval utils
      
      * move eval utils
      
      * nit
      
      * add comment
      
      * added tqdm descriptions
      
      * Update lm_eval/evaluator_utils.py
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * fix mgsm bug
      
      * nit
      
      * fix `build_all_requests`
      
      * pre-commit
      
      * add ceil to limit
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      5ccd65d4
  27. 26 Feb, 2024 2 commits
    • Lintang Sutawika's avatar
      Cont metrics (#1475) · 96d185fa
      Lintang Sutawika authored
      
      
      * add brier_score
      
      * process brier_score
      
      * brier score is working for N-sized class
      
      * fxied brier score
      
      * add TED to BigBench and Brier score to MMLU
      
      * format
      
      * Update metrics.py
      
      * Update task.py
      
      * Update generate_until_template_yaml
      
      * Delete lm_eval/tasks/bigbench/aux_metric.py
      
      * Update generate_until_template_yaml
      
      * Update _default_template_yaml
      
      * Update _generate_configs.py
      
      * Update _generate_configs.py
      
      * Update _generate_configs.py
      
      * fix (format?)
      
      * format?
      
      * format, once more
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      96d185fa
    • Aaron V's avatar
      Create a means for caching task registration and request building. Ad… (#1372) · 1e6c9272
      Aaron V authored
      
      
      * Create a means for caching task registration and request building. Add the ability to specify an args dict for simple_evaluate().
      
      * Remove extra S in cache path in caching module
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Rename requests cache args, make model_args polymorphic so that a dict can also be accepted.
      
      * Update docs to reflect new caching behavior, add CLI args for requests caching. Create a function for deleting items in the cache.
      
      * Update documentation, fix minor bug with arg parsing for requests caching where an undefined variable was used.
      
      * Remove line from gitignore, add to cli for caching datasets.
      
      * Add hashing suffix to .pickles. Update test script typo.
      
      * Favor isinstance() over type() in evaluator.py
      
      * Add tests for caching, gets tests working, remove unneeded arg from build_all_requests().
      
      * Update arg description to simple_evaluate.
      
      * Update pyproject.toml
      
      * Fix typehint
      
      * Remove the use of random() for creating default cache pickle hash.
      
      * Check that cache dir exists before clearing it in request cache tests.
      
      * Fix linting problems.
      
      * Fix additional formatting errors.
      
      * Remove trailing whitespace.
      
      * Add new line to the end of .gitignore.
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      1e6c9272
  28. 22 Feb, 2024 1 commit
  29. 11 Feb, 2024 1 commit
    • Baber Abbasi's avatar
      Evaluate (#1385) · 1ff84897
      Baber Abbasi authored
      * un-exclude `evaluate.py` from linting
      
      * readability
      
      * readability
      
      * add task name to build info message
      
      * fix link
      
      * nit
      
      * add functions for var and mean pooling
      
      * add functions for var and mean pooling
      
      * metadata compatibility with task
      
      * rename `override_config` to `set_config` and move to `Task`
      
      * add unit test
      
      * nit
      
      * nit
      
      * bugfix
      
      * nit
      
      * nit
      
      * nit
      
      * add docstrings
      
      * fix metadata-fewshot
      
      * revert metric refactor
      
      * nit
      
      * type checking
      
      * type hints
      
      * type hints
      
      * move `override_metric` to `Task`
      
      * change metadata
      
      * change name
      
      * pre-commit
      
      * rename
      
      * remove
      
      * remove
      
      * `override_metric` backwards compatible with `Task`
      
      * type hints
      
      * use generic
      
      * type hint
      1ff84897
  30. 01 Feb, 2024 2 commits
    • Lintang Sutawika's avatar
      Faster Task and Group Loading, Allow Recursive Groups (#1321) · d714fc95
      Lintang Sutawika authored
      
      
      * add trust_remote_code as default
      
      * task for testing recursive
      
      * changed source of ALL_TASKS
      
      * tasks should only accept TaskObjects
      
      * initialize_tasks returns list of tasks and groups
      
      * remove trust_remote_code for now
      
      * moved constructor process to inside load_yaml_config
      
      * more comprehensive way to index tasks and groups
      
      * pre-commit format
      
      * add exit after error
      
      * adjust how task objects are called
      
      * no need to use get_task_dict
      
      * load_task_or_group works but only for tasks
      
      * pre-commit format
      
      * half working for nested groups
      
      * changed variable names
      
      * allow groups and tasks to work
      
      * temp save
      
      * indexing and loading are part of a task_manager object
      
      * adapted initialize_tasks
      
      * iron out bugs
      
      * fixed typo
      
      * fixed typo
      
      * simplified code
      
      * further tidy up
      
      * remove lines for testing
      
      * removed test lines
      
      * removed unused code
      
      * remove unused import
      
      * fixed bug
      
      * removed comments
      
      * group in a list of group can accept parameter changes like `num_fewshot`
      
      * add trust_remote_code as default
      
      * task for testing recursive
      
      * changed source of ALL_TASKS
      
      * tasks should only accept TaskObjects
      
      * initialize_tasks returns list of tasks and groups
      
      * remove trust_remote_code for now
      
      * moved constructor process to inside load_yaml_config
      
      * more comprehensive way to index tasks and groups
      
      * pre-commit format
      
      * add exit after error
      
      * adjust how task objects are called
      
      * no need to use get_task_dict
      
      * load_task_or_group works but only for tasks
      
      * pre-commit format
      
      * half working for nested groups
      
      * changed variable names
      
      * allow groups and tasks to work
      
      * temp save
      
      * indexing and loading are part of a task_manager object
      
      * adapted initialize_tasks
      
      * iron out bugs
      
      * fixed typo
      
      * fixed typo
      
      * simplified code
      
      * further tidy up
      
      * remove lines for testing
      
      * removed test lines
      
      * removed unused code
      
      * remove unused import
      
      * fixed bug
      
      * removed comments
      
      * group in a list of group can accept parameter changes like `num_fewshot`
      
      * check if config is task update
      
      * add GroupConfig object
      
      * edit test yaml
      
      * remove args
      
      * testing returning to python task list
      
      * add weight_by_size config
      
      * describe weight_by_size in docs
      
      * fix weight by size potential error
      
      * can load individual custom python class task
      
      * moved import_function into the config loading file
      
      * remove print lines
      
      * add squadv2 yaml
      
      * temporary scroll implementation
      
      * revert back to use load_yaml_config but with modes
      
      * fix group being loaded with a None
      
      * reformat
      
      * can load unregistered tasks from a group
      
      * update scrolls
      
      * edit scrolls multiplechoice task
      
      * adjust class initialization
      
      * fix initialization
      
      * changed how to identify group and python tasks, fix logger
      
      * allow loading "include" that is nested in a group config
      
      * reworked flan benchmark
      
      * allow duplicate task in the same group to co-exist
      
      * process group_alias
      
      * removed group_alias
      
      * allow parameters set in group_config to apply to all tasks in tasklist
      
      * add function, but comment for now
      
      * reworked processing dict-base config
      
      * fixed how configs in group are processed
      
      * update to allow root group to have its alias used
      
      * remove unused classes
      
      * remove unused classes
      
      * revert some parts to original
      
      * forgot to change one variable
      
      * adapt the new process to use get_task_dict
      
      * fix for singular group call
      
      * fix variable names
      
      * add TaskManager into the evaluator
      
      * format
      
      * changed how dict tasks are loaded
      
      * add docs
      
      * Update docs/new_task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update evaluator.py
      
      * Update evaluator.py
      
      * remove groupconfig for now
      
      * changed _config to config
      
      * update interface.md to explain TaskManager
      
      * added property functions
      
      * adjusted logger
      
      * update write_out.py
      
      * updated tests
      
      * added documentation and some modifications
      
      * added docstring documentation
      
      * precommit format
      
      * updated task loading for tests
      
      * updates tests
      
      * changed arg order for load_yaml_config
      
      * update to handle scrolls and edit log message
      
      * remove unused lines
      
      * return a list of task classes and not a dict
      
      * Update __init__.py
      
      * Delete lm_eval/tasks/benchmarks/test.yaml
      
      * Update task.py
      
      * Update lm_eval/utils.py
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update lm_eval/utils.py
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update utils.py
      
      * re-added old functions with new log message
      
      * Update docs/new_task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update new_task_guide.md
      
      * added infor regarding `get_task_dict` and documentation
      
      * add get_config for Task
      
      * pre-commit formatting
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      d714fc95
    • Hailey Schoelkopf's avatar
      Enable override of printed `n-shot` in table (#1379) · 17191063
      Hailey Schoelkopf authored
      * allow tasks to specify printed fewshot val
      
      * fix to belebele
      
      * update metadata field's documentation
      17191063
  31. 31 Jan, 2024 1 commit
    • Baber Abbasi's avatar
      add bypass metric (#1156) · f8203de1
      Baber Abbasi authored
      * add bypass metric
      
      * fixed `bypass` metric.
      
      * add task attributes if predict_only
      
      * add `predict_only` checks
      
      * add docs
      
      * added `overide_metric`, `override_config` to `Task`
      
      * nits
      
      * nit
      
      * changed --predict_only to generations; nits
      
      * nits
      
      * nits
      
      * change gen_kwargs warning
      
      * add note about `--predict_only` in README.md
      
      * added `predict_only`
      
      * move table to bottom
      
      * nit
      
      * change null aggregation to bypass (conflict)
      
      * bugfix; default `temp=0.0`
      
      * typo
      f8203de1
  32. 30 Jan, 2024 1 commit
  33. 29 Jan, 2024 1 commit
  34. 25 Jan, 2024 1 commit
    • Baber Abbasi's avatar
      `Filter` docs not offset by `doc_id` (#1349) · a0f1cacd
      Baber Abbasi authored
      * get `doc` from instance
      
      * acceletate bugfix: get ground doc from instance
      
      * convert filter to `process_result`
      
      * get docs from instances in `FilterEnsemble`
      
      * rename
      
      * nit
      
      * better looping
      
      * fix typehint
      a0f1cacd
  35. 18 Jan, 2024 1 commit
  36. 12 Jan, 2024 2 commits