1. 16 Apr, 2025 2 commits
  2. 15 Apr, 2025 1 commit
    • Jerry Zhang's avatar
      Add support for quantization_config (#2842) · 758c5ed8
      Jerry Zhang authored
      * Add support for quantization_config
      
      Summary:
      Previously quantization_config is ignored, so torchao quantized models are not supported,
      this PR adds that.
      
      Test Plan:
      lm_eval --model hf --model_args pretrained=jerryzh168/gemma3-int4wo --tasks hellaswag --device cuda:0 --batch_size 8
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * quantization_config is optional
      758c5ed8
  3. 14 Apr, 2025 2 commits
  4. 07 Apr, 2025 1 commit
    • Felipe Maia Polo's avatar
      Add `--samples` Argument for Fine-Grained Task Evaluation in... · d693dcd2
      Felipe Maia Polo authored
      
       Add `--samples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] (#2520)
      
      * added option --examples
      
      * specifying examples in dictionary
      
      * run pre-commit - fix arg type
      
      Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
      
      * fixing bug when examples==None
      
      * fixing bug when examples==None
      
      * limit or examples must be None in simple_evaluate.py and in evaluator.py
      
      * run pre-commit (fix formatting)
      
      Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
      
      * merge main and run pre-commit (fix formatting)
      
      Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
      
      * Update __main__.py
      
      undefined "limit" and "examples"
      
      * update branch, fix conflicts, run pre-commit
      
      * nits
      
      * nits
      
      * change 'examples' to 'samples'
      
      ---------
      
      Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
      Co-authored-by: default avatarmirianfrsilva <mirianfrsilva@ibm.com>
      Co-authored-by: default avatarStella Biderman <stellabiderman@gmail.com>
      Co-authored-by: default avatarBaber <baber@hey.com>
      d693dcd2
  5. 04 Apr, 2025 3 commits
  6. 02 Apr, 2025 2 commits
  7. 01 Apr, 2025 1 commit
  8. 30 Mar, 2025 1 commit
    • Alexandre Marques's avatar
      Adds MMLU CoT, gsm8k and arc_challenge for llama instruct (#2829) · 3816796e
      Alexandre Marques authored
      * llama-style MMLU CoT
      
      * Refactor MMLU CoT template YAML to simplify 'until' structure
      
      * Add GSM8K task configuration for LLaMA3 with few-shot examples
      
      * Fix missing newline at end of MMLU CoT YAML file
      
      * Add ARC-Challenge task configuration and processing utility
      
      * Add additional MMLU and ARC-Challenge task variants to README
      
      * Update README with notes on arc_challenge_llama dataset preprocessing
      3816796e
  9. 29 Mar, 2025 1 commit
  10. 28 Mar, 2025 2 commits
  11. 27 Mar, 2025 3 commits
  12. 26 Mar, 2025 1 commit
  13. 25 Mar, 2025 1 commit
  14. 21 Mar, 2025 2 commits
  15. 20 Mar, 2025 5 commits
  16. 18 Mar, 2025 8 commits
  17. 17 Mar, 2025 3 commits
  18. 14 Mar, 2025 1 commit
    • Oskar van der Wal's avatar
      Add various social bias tasks (#1185) · 150a1852
      Oskar van der Wal authored
      
      
      * Implementation of Winogender
      
      * Minor fixes README.md
      
      * Add winogender
      
      * Clean winogender utils.py
      
      * Change dataset to one containing All subsets
      
      * Flesh out README for BBQ task
      
      * Add missing tasks for BBQ
      
      * Add simple cooccurrence bias task
      
      * Fix wrong mask for ambiguated context+rename metrics
      
      * Made generate_until evaluation (following PALM paper) default
      
      Also moved separate config files per category to separate metrics using custom function.
      Created config file for multiple_choice way of evaluating BBQ.
      
      * Add missing version metadata
      
      * Add missing versionmetadata for bbq multiple choice
      
      * Fix metrics and address edge cases
      
      * Made BBQ multiple choice the default version
      
      * Added settings following winogrande
      
      * Add num_fewshot to simple_cooccurrence_bias
      
      * Fixes for bbq (multiple choice)
      
      * Fix wrong dataset
      
      * CrowS-Pairs: make it easier to use another dataset by removing dataset_name from the subsets.
      
      * Use simplest prompt possible without description
      
      * Merge
      
      * BBQ: Fix np.NaN related bug
      
      * BBQ: Fix wrong aggregation method for disamb accuracy
      
      * BBQ: Make it possible to only evaluate on (dis)ambiguous subset (needed for few shot eval)
      
      * BBQ: fix showing one target in case of few-shot evals
      
      * BBQ: Fix few-shot example for bbq_generate
      
      * BBQ: simplify subtasks
      
      * BBQ: Minimize number of UNK variations to reduce inference time
      
      * BBQ: Add extra UNK keywords for the generate task
      
      * Add a generate_until version of simple_cooccurrence_bias
      
      * Change system/description prompt to include few-shot examples
      
      * Group agg rework
      
      * Run pre-commit
      
      * add tasks to readme table
      
      * remove trailing space from simple_cooccurrence_bias_gen.yaml `doc_to_text`
      
      * fix
      
      * fix
      
      * fix version
      
      ---------
      Co-authored-by: default avatarBaber <baber@hey.com>
      150a1852