1. 22 Sep, 2025 1 commit
    • priverabsc's avatar
      Add eqbench tasks in Spanish and Catalan (#3168) · de496b80
      priverabsc authored
      * Add eqbench tasks in Spanish and Catalan
      
      * Incremented catalan_bench and spanish_bench versions. Added 'multilingual' folder inside 'eq_bench' and moved the eqbench_ca and eqbench_es .yaml to that folder. Updated the tasks README with eqbench_es and eqbench_ca, expliciting inside each description both the Hugging Face link and the translation method.
      
      * Fixed tasks table.
      
      * remove test_task.sh and results folder
      
      * Add utils.py to multilingual folder
      de496b80
  2. 21 Sep, 2025 4 commits
  3. 08 Sep, 2025 1 commit
  4. 02 Sep, 2025 2 commits
  5. 25 Aug, 2025 3 commits
  6. 21 Aug, 2025 4 commits
  7. 22 Jul, 2025 1 commit
  8. 19 Jul, 2025 1 commit
  9. 14 Jul, 2025 1 commit
  10. 03 Jul, 2025 1 commit
    • Blanca Calvo's avatar
      Truthfulqa multi harness (#3062) · e0dc33ae
      Blanca Calvo authored
      
      
      * truthfulqa-multi task
      
      * truthfulqa-multi with chat few-shot
      
      * few shot chat implementation
      
      * changed until so it outputs lists
      
      * changed dataset location
      
      * added MT task
      
      * Create README.md
      
      * do not include MT
      
      * changes for PR
      
      * tag change
      
      * removed yaml extension
      
      * adding task to the table
      
      * fix task configs
      
      * add import exception
      
      ---------
      Co-authored-by: default avatarBaber <baber@hey.com>
      e0dc33ae
  11. 26 May, 2025 1 commit
  12. 19 May, 2025 1 commit
    • Harsha's avatar
      Adding ACPBench Hard tasks (#2980) · 0daf28fd
      Harsha authored
      * adding ACPBench_hard
      
      * adding Clingo
      
      * changing tarski to tarski[clingo]
      
      * denoting the main variants in each paper
      0daf28fd
  13. 15 May, 2025 1 commit
    • Yufeng Xu's avatar
      Added C4 Support (#2889) · 86a3b270
      Yufeng Xu authored
      * added c4 dataset (working)
      
      * fixed bugs in c4
      
      * fixed loading bugs in c4 dataset; using partial loading
      
      * cleaned the code
      
      * added version number for c4
      
      * removed irrelevant files
      86a3b270
  14. 06 May, 2025 1 commit
  15. 16 Apr, 2025 2 commits
  16. 14 Apr, 2025 1 commit
  17. 02 Apr, 2025 1 commit
  18. 29 Mar, 2025 1 commit
  19. 28 Mar, 2025 1 commit
  20. 27 Mar, 2025 1 commit
  21. 21 Mar, 2025 1 commit
    • heli-qi's avatar
      Add MMLU-ProX task (#2811) · 8aeff141
      heli-qi authored
      * update mmlu_prox configs
      
      * update tasks/README
      
      * correct hyphon to underline in task/README
      
      * update pre-commit codes
      8aeff141
  22. 18 Mar, 2025 3 commits
  23. 14 Mar, 2025 1 commit
    • Oskar van der Wal's avatar
      Add various social bias tasks (#1185) · 150a1852
      Oskar van der Wal authored
      
      
      * Implementation of Winogender
      
      * Minor fixes README.md
      
      * Add winogender
      
      * Clean winogender utils.py
      
      * Change dataset to one containing All subsets
      
      * Flesh out README for BBQ task
      
      * Add missing tasks for BBQ
      
      * Add simple cooccurrence bias task
      
      * Fix wrong mask for ambiguated context+rename metrics
      
      * Made generate_until evaluation (following PALM paper) default
      
      Also moved separate config files per category to separate metrics using custom function.
      Created config file for multiple_choice way of evaluating BBQ.
      
      * Add missing version metadata
      
      * Add missing versionmetadata for bbq multiple choice
      
      * Fix metrics and address edge cases
      
      * Made BBQ multiple choice the default version
      
      * Added settings following winogrande
      
      * Add num_fewshot to simple_cooccurrence_bias
      
      * Fixes for bbq (multiple choice)
      
      * Fix wrong dataset
      
      * CrowS-Pairs: make it easier to use another dataset by removing dataset_name from the subsets.
      
      * Use simplest prompt possible without description
      
      * Merge
      
      * BBQ: Fix np.NaN related bug
      
      * BBQ: Fix wrong aggregation method for disamb accuracy
      
      * BBQ: Make it possible to only evaluate on (dis)ambiguous subset (needed for few shot eval)
      
      * BBQ: fix showing one target in case of few-shot evals
      
      * BBQ: Fix few-shot example for bbq_generate
      
      * BBQ: simplify subtasks
      
      * BBQ: Minimize number of UNK variations to reduce inference time
      
      * BBQ: Add extra UNK keywords for the generate task
      
      * Add a generate_until version of simple_cooccurrence_bias
      
      * Change system/description prompt to include few-shot examples
      
      * Group agg rework
      
      * Run pre-commit
      
      * add tasks to readme table
      
      * remove trailing space from simple_cooccurrence_bias_gen.yaml `doc_to_text`
      
      * fix
      
      * fix
      
      * fix version
      
      ---------
      Co-authored-by: default avatarBaber <baber@hey.com>
      150a1852
  24. 11 Mar, 2025 1 commit
    • PabloAgustin's avatar
      New healthcare benchmark: careqa (#2714) · 7c9fbcf8
      PabloAgustin authored
      
      
      * New healthcare benchmark: careqa
      
      * LAUNCH_MN5_ACC <python main.py --config config/mn5.yml --models Llama-3.2-1B-Instruct --tasks careqa_open --num_fewshot 0>
      
      * Add fixes, READMES, and remove task_list.txt
      
      * pre-commit passed, add formatting updates; add nanmean agg_metric
      
      * Fix import error.
      
      * Wrapped imports in try excepts
      
      * Wrapped imports in try excepts; also metrics to catch bert_score import error
      
      * Try except to catch ImportErrors as well
      
      * use np.nan
      
      * pre-commit
      
      ---------
      Co-authored-by: default avatarPabloAgustin <pablo.martin@bsc.es>
      Co-authored-by: default avatarBaber <baber@hey.com>
      7c9fbcf8
  25. 03 Mar, 2025 1 commit
  26. 11 Feb, 2025 1 commit
    • Michele Resta's avatar
      Adding the Evalita-LLM benchmark (#2681) · b7fccef5
      Michele Resta authored
      
      
      * feat: initial commit with templates for evalita evaluation
      
      * fix: change rule for generate_until
      
      * feat: modified yaml to use reduced version of NER test datasets
      
      * feat: added templates to use reduced dataset for summarization (fanpage and ilpost)
      
      * Add Six Prompts for Each Multiple-Choice Task
      
      * feat: modified fewshot split for textual entailment task
      
      * fix: new doc_to_target function for NER tasks
      
      * Update prompt
      
      * Add partition for few-shot evaluation
      
      * Add partition for few-shot evaluation
      
      * Add partition for few-shot evaluation
      
      * Add partition for few-shot evaluatio
      
      * Update prompt
      
      * Add partition for few-shot evaluation
      
      * Rename file
      
      Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml
      
      * Add partition for few-shot evaluation
      
      * Add partition for few-shot evaluation
      
      * Enhance lexical substitution management
      
      - Improve scorer calculation for better accuracy
      - Update model output postprocessing for clearer results
      - Add support for few-shot relation extraction task
      
      * Add F1 macro measure for the document dating task
      
      * Add F1-macro measure to evaluate document dating
      
      * Use the whole dataset
      
      * Small changes
      
      * Add the two prompts for the task of lexical substitution
      
      * Add few-shot split configuration
      
      * Add few-shot split configuration
      
      * Add function for handling few-shot learning setup
      
      * Fix prompt
      
      * Remove configuration file
      
      * Update dataset from test_same to test_cross for evaluations
      
      * Remove whitespace at end of prompt
      
      * Fix configuration error: corrected parameter name for the dataset used in few-shot
      
      * Fix: Check if results is not empty before processing in lexical substitution task
      
      * added the prompts and functions for correct NER and RE execution
      
      * Add accuracy measure
      
      * Add tasks for the EVALITA-LLM benchmark evaluation
      
      * Small changes
      
      Add the alias of the task name that will be printed in the final table results.
      
      * Updated the prompts to reflect changes made to the extended dataset for the Admission Test task
      
      * chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks.
      
      * fix: add information on Evalita-LLM for PR
      
      * fix: rename folders and files
      
      * fix: remove unused imports
      
      * chore: run pre-commit
      
      * chore: add task description
      
      ---------
      Co-authored-by: default avatarrzanoli <zanoli@fbk.eu>
      Co-authored-by: default avatarMarco Madeddu <marco.madeddu.bra@gmail.com>
      b7fccef5
  27. 31 Jan, 2025 1 commit
  28. 29 Jan, 2025 1 commit
    • Irina Proskurina's avatar
      Add Histoires Morales task (#2662) · 1208afd3
      Irina Proskurina authored
      * Add Histoires Morales task
      
      * Histoires Morales task: fix mixed line endings
      
      * Histoires Morales task: fix mixed line endings
      
      * Remove tag for a single task
      
      * Add some MT for Histoires Morales
      1208afd3