Commit abd17276 authored by Baber's avatar Baber
Browse files

Merge branch 'smolrefact' into tasklist

# Conflicts:
#	lm_eval/__main__.py
#	lm_eval/api/group.py
#	lm_eval/api/task.py
#	lm_eval/evaluator_utils.py
#	lm_eval/tasks/__init__.py
#	lm_eval/utils.py
#	pyproject.toml
parents 00afd536 70314843
......@@ -32,10 +32,8 @@ repos:
rev: v0.12.5
hooks:
# Run the linter.
- id: ruff
args:
- --fix
# Run the formatter.
- id: ruff-check
args: [--fix]
- id: ruff-format
- repo: https://github.com/codespell-project/codespell
rev: v2.4.1
......
......@@ -8,71 +8,160 @@ A majority of users run the library by cloning it from Github, installing the pa
Equivalently, running the library can be done via the `lm-eval` entrypoint at the command line.
This mode supports a number of command-line arguments, the details of which can also be seen via running with `-h` or `--help`:
### Subcommand Structure
- `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#model-apis-and-inference-servers) for a full list of enabled model names and supported libraries or APIs.
The CLI now uses a subcommand structure for better organization:
- `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)
- `lm-eval run` - Execute evaluations (default behavior)
- `lm-eval ls` - List available tasks, models, etc.
- `lm-eval validate` - Validate task configurations
- `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups. A list of supported tasks can be viewed with `--tasks list`.
For backward compatibility, if no subcommand is specified, `run` is automatically inserted. So `lm-eval --model hf --tasks hellaswag` is equivalent to `lm-eval run --model hf --tasks hellaswag`.
- `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer.
### Run Command Arguments
- `--gen_kwargs` : takes an arg string in same format as `--model_args` and creates a dictionary of keyword arguments. These will be passed to the models for all called `generate_until` (free-form or greedy generation task) tasks, to set options such as the sampling temperature or `top_p` / `top_k`. For a list of what args are supported for each model type, reference the respective library's documentation (for example, the documentation for `transformers.AutoModelForCausalLM.generate()`.) These kwargs will be applied to all `generate_until` tasks called--we do not currently support unique gen_kwargs or batch_size values per task in a single run of the library. To control these on a per-task level, set them in that task's YAML file.
The `run` command supports a number of command-line arguments. Details can also be seen via running with `-h` or `--help`:
- `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length.
#### Configuration
- `--max_batch_size` : Sets the maximum batch size to try to fit in memory, if `--batch_size auto` is passed.
- `--config` **[path: str]** : Set initial arguments from a YAML configuration file. Takes a path to a YAML file that contains argument values. This allows you to specify complex configurations in a file rather than on the command line. Further CLI arguments can override values from the configuration file.
- `--device` : Sets which device to place the model onto. Must be a string, for example, `"cuda", "cuda:0", "cpu", "mps"`. Defaults to "cuda", and can be ignored if running multi-GPU or running a non-local model type.
For the complete list of available configuration fields and their types, see [`EvaluatorConfig` in the source code](../lm_eval/config/evaluate_config.py).
- `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well.
#### Model and Tasks
- `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`.
- `--model` **[str, default: "hf"]** : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#model-apis-and-inference-servers) for a full list of enabled model names and supported libraries or APIs.
- `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.
- `--model_args` **[comma-sep str | json str → dict]** : Controls parameters passed to the model constructor. Can be provided as:
- Comma-separated string: `pretrained=EleutherAI/pythia-160m,dtype=float32`
- JSON string: `'{"pretrained": "EleutherAI/pythia-160m", "dtype": "float32"}'`
- `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again.
For a full list of supported arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)
- `--cache_requests` : Can be "true", "refresh", or "delete". "true" means that the cache should be used. "refresh" means that you wish to regenerate the cache, which you should run if you change your dataset configuration for a given task. "delete" will delete the cache. Cached files are stored under lm_eval/cache/.cache unless you specify a different path via the environment variable: `LM_HARNESS_CACHE_PATH`. e.g. `LM_HARNESS_CACHE_PATH=~/Documents/cache_for_lm_harness`.
- `--tasks` **[comma-sep str → list[str]]** : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups. A list of supported tasks can be viewed with `lm-eval list tasks`.
- `--check_integrity` : If this flag is used, the library tests for each task selected are run to confirm task integrity.
#### Evaluation Settings
- `--write_out` : Used for diagnostic purposes to observe the format of task documents passed to a model. If this flag is used, then prints the prompt and gold target string for the first document of each task.
- `--num_fewshot` **[int]** : Sets the number of few-shot examples to place in context. Must be an integer.
- `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes.
- `--batch_size` **[int | "auto" | "auto:N", default: 1]** : Sets the batch size used for evaluation. Options:
- Integer: Fixed batch size (e.g., `8`)
- `"auto"`: Automatically select the largest batch size that fits in memory
- `"auto:N"`: Re-select maximum batch size N times during evaluation
- `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`.
Auto mode is useful since `lm-eval` sorts documents in descending order of context length.
- `--system_instruction`: Specifies a system instruction string to prepend to the prompt.
- `--max_batch_size` **[int]** : Sets the maximum batch size to try when using `--batch_size auto`.
- `--apply_chat_template` : This flag specifies whether to apply a chat template to the prompt. It can be used in the following ways:
- `--apply_chat_template` : When used without an argument, applies the only available chat template to the prompt. For Hugging Face models, if no dedicated chat template exists, the default chat template will be applied.
- `--apply_chat_template template_name` : If the model has multiple chat templates, apply the specified template to the prompt.
- `--device` **[str]** : Sets which device to place the model onto. Examples: `"cuda"`, `"cuda:0"`, `"cpu"`, `"mps"`. Can be ignored if running multi-GPU or non-local model types.
For Hugging Face models, the default chat template can be found in the [`default_chat_template`](https://github.com/huggingface/transformers/blob/fc35907f95459d7a6c5281dfadd680b6f7b620e3/src/transformers/tokenization_utils_base.py#L1912) property of the Transformers Tokenizer.
- `--gen_kwargs` **[comma-sep str | json str → dict]** : Generation arguments for `generate_until` tasks. Same format as `--model_args`:
- Comma-separated: `temperature=0.8,top_p=0.95`
- JSON: `'{"temperature": 0.8, "top_p": 0.95}'`
- `--fewshot_as_multiturn` : If this flag is on, the Fewshot examples are treated as a multi-turn conversation. Questions are provided as user content and answers are provided as assistant responses. Requires `--num_fewshot` to be set to be greater than 0, and `--apply_chat_template` to be on.
See model documentation (e.g., `transformers.AutoModelForCausalLM.generate()`) for supported arguments. Applied to all generation tasks - use task YAML files for per-task control.
- `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.
#### Data and Output
- `--seed`: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility). E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`. E.g, `--seed 42` sets all three seeds to 42.
- `--output_path` **[path: str]** : Output location for results. Format options:
- Directory: `results/` - saves as `results/<model_name>_<timestamp>.json`
- File: `results/output.jsonl` - saves to specific file
- `--wandb_args`: Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list [here](https://docs.wandb.ai/ref/python/init). e.g., ```--wandb_args project=test-project,name=test-run```. Also allows for the passing of the step to log things at (passed to `wandb.run.log`), e.g., `--wandb_args step=123`.
When used with `--log_samples`, per-document outputs are saved in the directory.
- `--hf_hub_log_args` : Logs evaluation results to Hugging Face Hub. Accepts a string with the arguments separated by commas. Available arguments:
- `hub_results_org` - organization name on Hugging Face Hub, e.g., `EleutherAI`. If not provided, the results will be pushed to the owner of the Hugging Face token,
- `hub_repo_name` - repository name on Hugging Face Hub (deprecated, `details_repo_name` and `results_repo_name` should be used instead), e.g., `lm-eval-results`,
- `details_repo_name` - repository name on Hugging Face Hub to store details, e.g., `lm-eval-results`,
- `results_repo_name` - repository name on Hugging Face Hub to store results, e.g., `lm-eval-results`,
- `push_results_to_hub` - whether to push results to Hugging Face Hub, can be `True` or `False`,
- `push_samples_to_hub` - whether to push samples results to Hugging Face Hub, can be `True` or `False`. Requires `--log_samples` to be set,
- `public_repo` - whether the repository is public, can be `True` or `False`,
- `leaderboard_url` - URL to the leaderboard, e.g., `https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard`.
- `point_of_contact` - Point of contact for the results dataset, e.g., `yourname@example.com`.
- `gated` - whether to gate the details dataset, can be `True` or `False`.
- `--log_samples` **[flag, default: False]** : Save model outputs and inputs at per-document granularity. Requires `--output_path`. Automatically enabled when using `--predict_only`.
- `--metadata`: JSON string to pass to TaskConfig. Used for some tasks which require additional metadata to be passed for processing. E.g., `--metadata '{"key": "value"}'`.
- `--limit` **[int | float]** : Limit evaluation examples per task. **WARNING: Only for testing!**
- Integer: First N documents (e.g., `100`)
- Float (0.0-1.0): Percentage of documents (e.g., `0.1` for 10%)
- `--samples` **[path | json str | dict → dict]** : Evaluate specific sample indices only. Input formats:
- JSON file path: `samples.json`
- JSON string: `'{"hellaswag": [0, 1, 2], "arc_easy": [10, 20]}'`
- Dictionary (programmatic use)
Format: `{"task_name": [indices], ...}`. Incompatible with `--limit`.
#### Caching and Performance
- `--use_cache` **[path: str]** : SQLite cache database path prefix. Creates per-process cache files:
- Single GPU: `/path/to/cache.db`
- Multi-GPU: `/path/to/cache_rank0.db`, `/path/to/cache_rank1.db`, etc.
Caches model outputs to avoid re-running the same (model, task) evaluations.
- `--cache_requests` **["true" | "refresh" | "delete"]** : Dataset request caching control:
- `"true"`: Use existing cache
- `"refresh"`: Regenerate cache (use after changing task configs)
- `"delete"`: Delete cache
Cache location: `lm_eval/cache/.cache` or `$LM_HARNESS_CACHE_PATH` if set.
- `--check_integrity` **[flag, default: False]** : Run task integrity tests to validate configurations.
#### Instruct Formatting
- `--system_instruction` **[str]** : Custom system instruction to prepend to prompts. Used with instruction-following models.
- `--apply_chat_template` **[bool | str, default: False]** : Apply chat template formatting. Usage:
- No argument: Apply default/only available template
- Template name: Apply specific template (e.g., `"chatml"`)
For HuggingFace models, uses the tokenizer's chat template. Default template defined in [`transformers` documentation](https://github.com/huggingface/transformers/blob/fc35907f95459d7a6c5281dfadd680b6f7b620e3/src/transformers/tokenization_utils_base.py#L1912).
- `--fewshot_as_multiturn` **[flag, default: False]** : Format few-shot examples as multi-turn conversation:
- Questions → User messages
- Answers → Assistant responses
Requires: `--num_fewshot > 0` and `--apply_chat_template` enabled.
#### Task Management
- `--include_path` **[path: str]** : Directory containing custom task YAML files. All `.yaml` files in this directory will be registered as available tasks. Use for custom tasks outside of `lm_eval/tasks/`.
#### Logging and Tracking
- `--verbosity` **[str]** : **DEPRECATED** - Use `LOGLEVEL` environment variable instead.
- `--write_out` **[flag, default: False]** : Print first document's prompt and target for each task. Useful for debugging prompt formatting.
- `--show_config` **[flag, default: False]** : Display full task configurations after evaluation. Shows all non-default settings from task YAML files.
- `--wandb_args` **[comma-sep str → dict]** : Weights & Biases integration. Arguments for `wandb.init()`:
- Example: `project=my-project,name=run-1,tags=test`
- Special: `step=123` sets logging step
- See [W&B docs](https://docs.wandb.ai/ref/python/init) for all options
- `--wandb_config_args` **[comma-sep str → dict]** : Additional W&B config arguments, same format as `--wandb_args`.
- `--hf_hub_log_args` **[comma-sep str → dict]** : Hugging Face Hub logging configuration. Format: `key1=value1,key2=value2`. Options:
- `hub_results_org`: Organization name (default: token owner)
- `details_repo_name`: Repository for detailed results
- `results_repo_name`: Repository for aggregated results
- `push_results_to_hub`: Enable pushing (`True`/`False`)
- `push_samples_to_hub`: Push samples (`True`/`False`, requires `--log_samples`)
- `public_repo`: Make repo public (`True`/`False`)
- `leaderboard_url`: Associated leaderboard URL
- `point_of_contact`: Contact email
- `gated`: Gate the dataset (`True`/`False`)
- ~~`hub_repo_name`~~: Deprecated, use `details_repo_name` and `results_repo_name`
#### Advanced Options
- `--predict_only` **[flag, default: False]** : Generate outputs without computing metrics. Automatically enables `--log_samples`. Use to get raw model outputs.
- `--seed` **[int | comma-sep str → list[int], default: [0,1234,1234,1234]]** : Set random seeds for reproducibility:
- Single integer: Same seed for all (e.g., `42`)
- Four values: `python,numpy,torch,fewshot` seeds (e.g., `0,1234,8,52`)
- Use `None` to skip setting a seed (e.g., `0,None,8,52`)
Default preserves backward compatibility.
- `--trust_remote_code` **[flag, default: False]** : Allow executing remote code from Hugging Face Hub. **Security Risk**: Required for some models with custom code.
- `--confirm_run_unsafe_code` **[flag, default: False]** : Acknowledge risks when running tasks that execute arbitrary Python code (e.g., code generation tasks).
- `--metadata` **[json str → dict]** : Additional metadata for specific tasks. Format: `'{"key": "value"}'`. Required by tasks like RULER that need extra configuration.
## External Library Usage
......
import logging
import os
from .api import metrics, model, registry # initializes the registries
from .filters import *
__version__ = "0.4.9"
__version__ = "0.4.9.1"
# Lazy-load .evaluator module to improve CLI startup
......
This diff is collapsed.
"""
CLI subcommands to run from terminal.
"""
import argparse
import sys
import textwrap
from lm_eval._cli.ls import List
from lm_eval._cli.run import Run
from lm_eval._cli.validate import Validate
class HarnessCLI:
"""Main CLI parser that manages all subcommands."""
def __init__(self):
self._parser = argparse.ArgumentParser(
prog="lm-eval",
description="Language Model Evaluation Harness",
epilog=textwrap.dedent("""
quick start:
# Basic evaluation
lm-eval run --model hf --model_args pretrained=gpt2 --tasks hellaswag
# List available tasks
lm-eval ls tasks
# Validate task configurations
lm-eval validate --tasks hellaswag,arc_easy
legacy compatibility:
The harness maintains backward compatibility with the original interface.
If no command is specified, 'run' is automatically inserted:
lm-eval --model hf --tasks hellaswag # Equivalent to 'lm-eval run --model hf --tasks hellaswag'
For documentation, visit: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md
"""),
formatter_class=argparse.RawDescriptionHelpFormatter,
)
self._parser.set_defaults(func=lambda args: self._parser.print_help())
self._subparsers = self._parser.add_subparsers(
dest="command", help="Available commands", metavar="COMMAND"
)
Run.create(self._subparsers)
List.create(self._subparsers)
Validate.create(self._subparsers)
def parse_args(self) -> argparse.Namespace:
"""Parse arguments using the main parser."""
if len(sys.argv) > 2 and sys.argv[1] not in self._subparsers.choices:
# Backward compatibility: arguments provided but no valid subcommand - insert 'run'
# TODO: add warning
sys.argv.insert(1, "run")
elif len(sys.argv) == 2 and "run" in sys.argv:
# if only 'run' is specified, ensure it is treated as a subcommand
self._subparsers.choices["run"].print_help()
sys.exit(0)
return self._parser.parse_args()
def execute(self, args: argparse.Namespace) -> None:
"""Main execution method that handles subcommands and legacy support."""
args.func(args)
import argparse
import textwrap
from lm_eval._cli.subcommand import SubCommand
class List(SubCommand):
"""Command for listing available tasks."""
def __init__(self, subparsers: argparse._SubParsersAction, *args, **kwargs):
# Create and configure the parser
super().__init__(*args, **kwargs)
self._parser = subparsers.add_parser(
"ls",
help="List available tasks, groups, subtasks, or tags",
description="List available tasks, groups, subtasks, or tags from the evaluation harness.",
usage="lm-eval list [tasks|groups|subtasks|tags] [--include_path DIR]",
epilog=textwrap.dedent("""
examples:
# List all available tasks (includes groups, subtasks, and tags)
$ lm-eval ls tasks
# List only task groups (like 'mmlu', 'glue', 'superglue')
$ lm-eval ls groups
# List only individual subtasks (like 'mmlu_abstract_algebra')
$ lm-eval ls subtasks
# Include external task definitions
$ lm-eval ls tasks --include_path /path/to/external/tasks
# List tasks from multiple external paths
$ lm-eval ls tasks --include_path "/path/to/tasks1:/path/to/tasks2"
organization:
• Groups: Collections of tasks with aggregated metric across subtasks (e.g., 'mmlu')
• Subtasks: Individual evaluation tasks (e.g., 'mmlu_anatomy', 'hellaswag')
• Tags: Similar to groups but no aggregate metric (e.g., 'reasoning', 'knowledge', 'language')
• External Tasks: Custom tasks defined in external directories
evaluation usage:
After listing tasks, use them with the run command!
For more information tasks configs are defined in https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks
"""),
formatter_class=argparse.RawDescriptionHelpFormatter,
)
self._add_args()
self._parser.set_defaults(func=self._execute)
def _add_args(self) -> None:
self._parser.add_argument(
"what",
choices=["tasks", "groups", "subtasks", "tags"],
nargs="?",
help="What to list: tasks (all), groups, subtasks, or tags",
)
self._parser.add_argument(
"--include_path",
type=str,
default=None,
metavar="DIR",
help="Additional path to include if there are external tasks.",
)
def _execute(self, args: argparse.Namespace) -> None:
"""Execute the list command."""
from lm_eval.tasks import TaskManager
task_manager = TaskManager(include_path=args.include_path)
if args.what == "tasks":
print(task_manager.list_all_tasks())
elif args.what == "groups":
print(task_manager.list_all_tasks(list_subtasks=False, list_tags=False))
elif args.what == "subtasks":
print(task_manager.list_all_tasks(list_groups=False, list_tags=False))
elif args.what == "tags":
print(task_manager.list_all_tasks(list_groups=False, list_subtasks=False))
elif args.what is None:
self._parser.print_help()
import argparse
import json
import logging
import os
import textwrap
from functools import partial
from lm_eval._cli.subcommand import SubCommand
from lm_eval._cli.utils import (
_int_or_none_list_arg_type,
key_val_to_dict,
merge_dicts,
request_caching_arg_to_dict,
try_parse_json,
)
class Run(SubCommand):
"""Command for running language model evaluation."""
def __init__(self, subparsers: argparse._SubParsersAction, *args, **kwargs):
super().__init__(*args, **kwargs)
self._parser = subparsers.add_parser(
"run",
help="Run the evaluation harness on specified tasks",
description="Evaluate language models on various benchmarks and tasks.",
usage="lm-eval run --model <model> --tasks <task> <task> --model_args <arg=value> <arg=value> [options]",
epilog=textwrap.dedent("""
examples:
# Basic evaluation with HuggingFace model
$ lm-eval run --model hf --model_args pretrained=gpt2 dtype=float32 --tasks hellaswag
# Evaluate on multiple tasks with few-shot examples
$ lm-eval run --model vllm --model_args pretrained=EleutherAI/gpt-j-6B --tasks arc_easy arc_challenge --num_fewshot 5
# Evaluation with custom generation parameters
$ lm-eval run --model hf --model_args pretrained=gpt2 --tasks lambada --gen_kwargs temperature=0.8 top_p=0.95 'stop=["\\n\\n"]'
# Use configuration file
$ lm-eval run --config my_config.yaml --tasks mmlu
For more information, see: https://github.com/EleutherAI/lm-evaluation-harness
"""),
formatter_class=argparse.RawDescriptionHelpFormatter,
)
self._add_args()
self._parser.set_defaults(func=self._execute)
def _add_args(self) -> None:
self._parser = self._parser
# Defaults are set in config/evaluate_config.py
config_group = self._parser.add_argument_group("configuration")
config_group.add_argument(
"--config",
"-C",
default=None,
type=str,
metavar="YAML_PATH",
help="Set initial arguments from YAML config",
)
# Model and Tasks
model_group = self._parser.add_argument_group("model and tasks")
model_group.add_argument(
"--model",
"-m",
type=str,
default=None,
metavar="MODEL_NAME",
help="Model name (default: hf)",
)
model_group.add_argument(
"--tasks",
"-t",
default=None,
type=str,
nargs="*",
metavar="TASK1 TASK2",
help=textwrap.dedent("""
Space or Comma-separated list of task names or groupings.
Use 'lm-eval list tasks' to see all available tasks.
""").strip(),
)
model_group.add_argument(
"--model_args",
"-a",
default=None,
nargs="*",
type=key_val_to_dict,
metavar="ARGS",
help="Model arguments as 'key=val,key2=val2' or `key=val` `key2=val2`",
)
# Evaluation Settings
eval_group = self._parser.add_argument_group("evaluation settings")
eval_group.add_argument(
"--num_fewshot",
"-f",
type=int,
default=None,
metavar="N",
help="Number of examples in few-shot context",
)
eval_group.add_argument(
"--batch_size",
"-b",
type=str,
default=argparse.SUPPRESS,
metavar="auto|auto:N|N",
help=textwrap.dedent(
"Batch size: 'auto', 'auto:N' (auto-tune N times), or integer (default: 1)"
),
)
eval_group.add_argument(
"--max_batch_size",
type=int,
default=None,
metavar="N",
help="Maximum batch size when using --batch_size auto",
)
eval_group.add_argument(
"--device",
type=str,
default=None,
metavar="DEVICE",
help="Device to use (e.g. cuda, cuda:0, cpu, mps)",
)
eval_group.add_argument(
"--gen_kwargs",
type=key_val_to_dict,
default=None,
nargs="*",
metavar="KWARGS",
help=textwrap.dedent(
'Generation arguments as `temperature=0,stop=["stop"]` or `key=val` `key2=val2`.'
"Values should be parsable with ast.literal_eval."
),
)
# Data and Output
data_group = self._parser.add_argument_group("data and output")
data_group.add_argument(
"--output_path",
"-o",
default=None,
type=str,
metavar="OUTPUT_PATH",
help="Output dir or json file for results (and samples)",
)
data_group.add_argument(
"--log_samples",
"-s",
action="store_true",
default=argparse.SUPPRESS,
help="Save all model outputs and documents for post-hoc analysis",
)
data_group.add_argument(
"--limit",
"-L",
type=float,
default=None,
metavar="N|0.0-1.0",
help="Limit examples per task (integer count or fraction)",
)
data_group.add_argument(
"--samples",
"-E",
default=None,
type=try_parse_json,
metavar='"task1": [1,2,3,4,...]"',
help=textwrap.dedent(
"`...` `...` Sample indices for inputs. Incompatible with --limit."
" Values be parsable with ast.literal_eval."
),
)
# Caching and Performance
cache_group = self._parser.add_argument_group("caching and performance")
cache_group.add_argument(
"--use_cache",
"-c",
type=str,
default=None,
metavar="CACHE_DIR",
help="SQLite database path for caching model outputs.",
)
cache_group.add_argument(
"--cache_requests",
type=request_caching_arg_to_dict,
default=None,
choices=["true", "refresh", "delete"],
help="Cache dataset request building (true|refresh|delete)",
)
cache_group.add_argument(
"--check_integrity",
action="store_true",
default=argparse.SUPPRESS,
help="Run task test suite validation",
)
# Prompt Formatting
template_group = self._parser.add_argument_group("instruct formatting")
template_group.add_argument(
"--system_instruction",
type=str,
default=None,
metavar="INSTRUCTION",
help="Add custom system instruction.",
)
template_group.add_argument(
"--apply_chat_template",
type=str,
nargs="?",
const=True,
default=argparse.SUPPRESS,
metavar="TEMPLATE",
help="Apply chat template to prompts (optional template name)",
)
template_group.add_argument(
"--fewshot_as_multiturn",
action="store_true",
default=argparse.SUPPRESS,
help="Use fewshot examples as multi-turn conversation",
)
# Task Management
task_group = self._parser.add_argument_group("task management")
task_group.add_argument(
"--include_path",
type=str,
default=None,
metavar="TASK_DIR",
help="Additional directory for external tasks",
)
# Logging and Tracking
logging_group = self._parser.add_argument_group("logging and tracking")
logging_group.add_argument(
"--verbosity",
"-v",
type=str.upper,
default=None,
metavar="LEVEL",
help="(Deprecated) Log level. Use LOGLEVEL env var instead",
)
logging_group.add_argument(
"--write_out",
"-w",
action="store_true",
default=argparse.SUPPRESS,
help="Print prompts for first few documents",
)
logging_group.add_argument(
"--show_config",
action="store_true",
default=argparse.SUPPRESS,
help="Display full task configuration after evaluation",
)
logging_group.add_argument(
"--wandb_args",
type=key_val_to_dict,
default=argparse.SUPPRESS,
metavar="ARGS",
help="Weights & Biases init arguments key=val key2=val2",
)
logging_group.add_argument(
"--wandb_config_args",
type=key_val_to_dict,
default=argparse.SUPPRESS,
metavar="ARGS",
help="Weights & Biases config arguments key=val key2=val2",
)
logging_group.add_argument(
"--hf_hub_log_args",
type=key_val_to_dict,
default=argparse.SUPPRESS,
metavar="ARGS",
help="Hugging Face Hub logging arguments key=val key2=val2",
)
# Advanced Options
advanced_group = self._parser.add_argument_group("advanced options")
advanced_group.add_argument(
"--predict_only",
"-x",
action="store_true",
default=argparse.SUPPRESS,
help="Save predictions only, skip metric computation",
)
default_seed_string = "0,1234,1234,1234"
advanced_group.add_argument(
"--seed",
type=partial(_int_or_none_list_arg_type, 3, 4, default_seed_string),
default=None,
metavar="SEED|S1,S2,S3,S4",
help=textwrap.dedent(f"""
Random seeds for python,numpy,torch,fewshot (default: {default_seed_string}).
Use single integer for all, or comma-separated list of 4 values.
Use 'None' to skip setting a seed. Example: --seed 42 or --seed 0,None,8,52
""").strip(),
)
advanced_group.add_argument(
"--trust_remote_code",
action="store_true",
default=argparse.SUPPRESS,
help="Allow executing remote code from Hugging Face Hub",
)
advanced_group.add_argument(
"--confirm_run_unsafe_code",
action="store_true",
default=argparse.SUPPRESS,
help="Confirm understanding of unsafe code execution risks",
)
advanced_group.add_argument(
"--metadata",
type=json.loads,
default=None,
metavar="`key=val` `key2=val2`",
help=textwrap.dedent(
"""`key=val` `key2=val` args parsable by ast.literal_eval (merged with model_args),
required for some tasks such as RULER"""
),
)
@staticmethod
def _execute(args: argparse.Namespace) -> None:
"""Runs the evaluation harness with the provided arguments."""
os.environ["TOKENIZERS_PARALLELISM"] = "false"
MERGE_ARGS_DICTS = [
"model_args",
"gen_kwargs",
"wandb_args",
"wandb_config_args",
"hf_hub_log_args",
]
for arg_name in MERGE_ARGS_DICTS:
if current_value := getattr(args, arg_name, None):
setattr(args, arg_name, merge_dicts(*current_value))
from lm_eval.config.evaluate_config import EvaluatorConfig
eval_logger = logging.getLogger(__name__)
# Create and validate config (most validation now occurs in EvaluationConfig)
cfg = EvaluatorConfig.from_cli(args)
from lm_eval import simple_evaluate
from lm_eval.loggers import EvaluationTracker, WandbLogger
from lm_eval.utils import handle_non_serializable, make_table
# Set up logging
if cfg.wandb_args:
wandb_logger = WandbLogger(cfg.wandb_args, cfg.wandb_config_args)
# Set up evaluation tracker
if cfg.output_path:
cfg.hf_hub_log_args["output_path"] = cfg.output_path
if os.environ.get("HF_TOKEN", None):
cfg.hf_hub_log_args["token"] = os.environ.get("HF_TOKEN")
evaluation_tracker = EvaluationTracker(**cfg.hf_hub_log_args)
# Create task manager (metadata already set up in config validation)
task_manager = cfg.process_tasks(cfg.metadata)
# Validation warnings (keep these in CLI as they're logging-specific)
if "push_samples_to_hub" in cfg.hf_hub_log_args and not cfg.log_samples:
eval_logger.warning(
"Pushing samples to the Hub requires --log_samples to be set."
)
# Log task selection (tasks already processed in config)
if cfg.include_path is not None:
eval_logger.info(f"Including path: {cfg.include_path}")
eval_logger.info(f"Selected Tasks: {cfg.tasks}")
# Run evaluation
results = simple_evaluate(
model=cfg.model,
model_args=cfg.model_args,
tasks=cfg.tasks,
num_fewshot=cfg.num_fewshot,
batch_size=cfg.batch_size,
max_batch_size=cfg.max_batch_size,
device=cfg.device,
use_cache=cfg.use_cache,
cache_requests=cfg.cache_requests.get("cache_requests", False),
rewrite_requests_cache=cfg.cache_requests.get(
"rewrite_requests_cache", False
),
delete_requests_cache=cfg.cache_requests.get(
"delete_requests_cache", False
),
limit=cfg.limit,
samples=cfg.samples,
check_integrity=cfg.check_integrity,
write_out=cfg.write_out,
log_samples=cfg.log_samples,
evaluation_tracker=evaluation_tracker,
system_instruction=cfg.system_instruction,
apply_chat_template=cfg.apply_chat_template,
fewshot_as_multiturn=cfg.fewshot_as_multiturn,
gen_kwargs=cfg.gen_kwargs,
task_manager=task_manager,
verbosity=cfg.verbosity,
predict_only=cfg.predict_only,
random_seed=cfg.seed[0] if cfg.seed else None,
numpy_random_seed=cfg.seed[1] if cfg.seed else None,
torch_random_seed=cfg.seed[2] if cfg.seed else None,
fewshot_random_seed=cfg.seed[3] if cfg.seed else None,
confirm_run_unsafe_code=cfg.confirm_run_unsafe_code,
metadata=cfg.metadata,
)
# Process results
if results is not None:
if cfg.log_samples:
samples = results.pop("samples")
dumped = json.dumps(
results, indent=2, default=handle_non_serializable, ensure_ascii=False
)
if cfg.show_config:
print(dumped)
batch_sizes = ",".join(map(str, results["config"]["batch_sizes"]))
# W&B logging
if cfg.wandb_args:
try:
wandb_logger.post_init(results)
wandb_logger.log_eval_result()
if cfg.log_samples:
wandb_logger.log_eval_samples(samples)
except Exception as e:
eval_logger.info(f"Logging to W&B failed: {e}")
# Save results
evaluation_tracker.save_results_aggregated(
results=results, samples=samples if cfg.log_samples else None
)
if cfg.log_samples:
for task_name, _ in results["configs"].items():
evaluation_tracker.save_results_samples(
task_name=task_name, samples=samples[task_name]
)
if (
evaluation_tracker.push_results_to_hub
or evaluation_tracker.push_samples_to_hub
):
evaluation_tracker.recreate_metadata_card()
# Print results
cfg.model_args.pop("trust_remote_code", None)
print(
f"{cfg.model} ({cfg.model_args}), gen_kwargs: ({cfg.gen_kwargs}), "
f"limit: {cfg.limit}, num_fewshot: {cfg.num_fewshot}, "
f"batch_size: {cfg.batch_size}{f' ({batch_sizes})' if batch_sizes else ''}"
)
print(make_table(results))
if "groups" in results:
print(make_table(results, "groups"))
if cfg.wandb_args:
wandb_logger.run.finish()
import argparse
from abc import ABC, abstractmethod
class SubCommand(ABC):
"""Base class for all subcommands."""
def __init__(self, *args, **kwargs):
pass
@classmethod
def create(cls, subparsers: argparse._SubParsersAction):
"""Factory method to create and register a command instance."""
return cls(subparsers)
@abstractmethod
def _add_args(self) -> None:
"""Add arguments specific to this subcommand."""
pass
import argparse
import ast
import json
import logging
from typing import Any, Optional, Union
def try_parse_json(value: Union[str, dict, None]) -> Union[str, dict, None]:
"""Try to parse a string as JSON. If it fails, return the original string."""
if value is None:
return None
if isinstance(value, dict):
return value
try:
return json.loads(value)
except json.JSONDecodeError:
if "{" in value:
raise ValueError(
f"Invalid JSON: {value}. Hint: Use double quotes for JSON strings."
)
return value
def _int_or_none_list_arg_type(
min_len: int, max_len: int, defaults: str, value: str, split_char: str = ","
) -> list[Union[int, None]]:
"""Parses a string of integers or 'None' values separated by a specified character into a list.
Validates the number of items against specified minimum and maximum lengths and fills missing values with defaults."""
def parse_value(item):
"""Parses an individual item, converting it to an integer or `None`."""
item = item.strip().lower()
if item == "none":
return None
try:
return int(item)
except ValueError:
raise ValueError(f"{item} is not an integer or None")
items = [parse_value(v) for v in value.split(split_char)]
num_items = len(items)
if num_items == 1:
items = items * max_len
elif num_items < min_len or num_items > max_len:
raise ValueError(
f"Argument requires {max_len} integers or None, separated by '{split_char}'"
)
elif num_items != max_len:
logging.warning(
f"Argument requires {max_len} integers or None, separated by '{split_char}'. "
"Missing values will be filled with defaults."
)
default_items = [parse_value(v) for v in defaults.split(split_char)]
items.extend(default_items[num_items:])
return items
def request_caching_arg_to_dict(cache_requests: Optional[str]) -> dict[str, bool]:
"""Convert a request caching argument to a dictionary."""
if cache_requests is None:
return {}
request_caching_args = {
"cache_requests": cache_requests in {"true", "refresh"},
"rewrite_requests_cache": cache_requests == "refresh",
"delete_requests_cache": cache_requests == "delete",
}
return request_caching_args
def check_argument_types(parser: argparse.ArgumentParser) -> None:
"""
Check to make sure all CLI args are typed, raises error if not
"""
for action in parser._actions:
# Skip help, subcommands, and const actions
if action.dest in ["help", "command"] or action.const is not None:
continue
if action.type is None:
raise ValueError(f"Argument '{action.dest}' doesn't have a type specified.")
else:
continue
def handle_cli_value_string(arg: str) -> Any:
if arg.lower() == "true":
return True
elif arg.lower() == "false":
return False
elif arg.isnumeric():
return int(arg)
try:
return float(arg)
except ValueError:
try:
return ast.literal_eval(arg)
except (ValueError, SyntaxError):
return arg
def key_val_to_dict(args: str) -> dict:
"""Parse model arguments from a string into a dictionary."""
return (
{
k: handle_cli_value_string(v)
for k, v in (item.split("=") for item in args.split(","))
}
if args
else {}
)
def merge_dicts(*dicts):
return {k: v for d in dicts for k, v in d.items()}
import argparse
import sys
import textwrap
from lm_eval._cli.subcommand import SubCommand
class Validate(SubCommand):
"""Command for validating tasks."""
def __init__(self, subparsers: argparse._SubParsersAction, *args, **kwargs):
# Create and configure the self._parser
super().__init__(*args, **kwargs)
self._parser = subparsers.add_parser(
"validate",
help="Validate task configurations",
description="Validate task configurations and check for errors.",
usage="lm-eval validate --tasks <task1,task2> [--include_path DIR]",
epilog=textwrap.dedent("""
examples:
# Validate a single task
lm-eval validate --tasks hellaswag
# Validate multiple tasks
lm-eval validate --tasks arc_easy,arc_challenge,hellaswag
# Validate a task group
lm-eval validate --tasks mmlu
# Validate tasks with external definitions
lm-eval validate --tasks my_custom_task --include_path ./custom_tasks
# Validate tasks from multiple external paths
lm-eval validate --tasks custom_task1,custom_task2 --include_path "/path/to/tasks1:/path/to/tasks2"
validation check:
The validate command performs several checks:
• Task existence: Verifies all specified tasks are available
• Configuration syntax: Checks YAML/JSON configuration files
• Dataset access: Validates dataset paths and configurations
• Required fields: Ensures all mandatory task parameters are present
• Metric definitions: Verifies metric functions and aggregation methods
• Filter pipelines: Validates filter chains and their parameters
• Template rendering: Tests prompt templates with sample data
task config files:
Tasks are defined using YAML configuration files with these key sections:
• task: Task name and metadata
• dataset_path: HuggingFace dataset identifier
• doc_to_text: Template for converting documents to prompts
• doc_to_target: Template for extracting target answers
• metric_list: List of evaluation metrics to compute
• output_type: Type of model output (loglikelihood, generate_until, etc.)
• filter_list: Post-processing filters for model outputs
common errors:
• Missing required fields in YAML configuration
• Invalid dataset paths or missing dataset splits
• Malformed Jinja2 templates in doc_to_text/doc_to_target
• Undefined metrics or aggregation functions
• Invalid filter names or parameters
• Circular dependencies in task inheritance
• Missing external task files when using --include_path
debugging tips:
• Use --include_path to test external task definitions
• Check task configuration files for syntax errors
• Verify dataset access and authentication if needed
• Use 'lm-eval list tasks' to see available tasks
For task configuration guide, see: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md
"""),
formatter_class=argparse.RawDescriptionHelpFormatter,
)
self._add_args()
self._parser.set_defaults(func=self._execute)
def _add_args(self) -> None:
self._parser.add_argument(
"--tasks",
"-t",
required=True,
type=str,
metavar="TASK1,TASK2",
help="Comma-separated list of task names to validate",
)
self._parser.add_argument(
"--include_path",
type=str,
default=None,
metavar="DIR",
help="Additional path to include if there are external tasks.",
)
def _execute(self, args: argparse.Namespace) -> None:
"""Execute the validate command."""
from lm_eval.tasks import TaskManager
task_manager = TaskManager(include_path=args.include_path)
task_list = args.tasks.split(",")
print(f"Validating tasks: {task_list}")
# For now, just validate that tasks exist
task_names = task_manager.match_tasks(task_list)
task_missing = [task for task in task_list if task not in task_names]
if task_missing:
missing = ", ".join(task_missing)
print(f"Tasks not found: {missing}")
sys.exit(1)
else:
print("All tasks found and valid")
from abc import ABC, abstractmethod
from collections.abc import Iterable
from dataclasses import dataclass
from typing import Callable, Iterable, List, Union
from typing import Protocol, runtime_checkable
from lm_eval.api.instance import Instance
class Filter(ABC):
@runtime_checkable
class Filter(Protocol):
"""
Filter classes operate on a per-task level.
They take all model outputs (`instance.resps` for all `task.instances`)
......@@ -19,8 +20,9 @@ class Filter(ABC):
Can define custom behavior here, if an individual instantiation of a Filter class should have state.
"""
@abstractmethod
def apply(self, resps: Union[List, Iterable], docs: List[dict]) -> Iterable:
def apply(
self, resps: Iterable[list[str]], docs: Iterable[dict]
) -> Iterable[list[str]]:
"""
Defines the operation to perform on a list of the `inst.resps` properties of `Instance` objects.
Should return the list of (filtered) response lists *in the same order as they were input*, e.g.
......@@ -40,9 +42,9 @@ class FilterEnsemble:
"""
name: str
filters: List[Callable[[], Filter]]
filters: list[type[Filter]]
def apply(self, instances: List[Instance]) -> None:
def apply(self, instances: list[Instance]) -> None:
resps, docs = zip(*((inst.resps, inst.doc) for inst in instances))
resps, docs = list(resps), list(docs)
......
from dataclasses import asdict, dataclass
from dataclasses import asdict, dataclass, field
from inspect import getsource
from typing import Callable, Optional, Union
from datasets.features.pdf import field
@dataclass
class AggMetricConfig(dict):
metric: Optional[str] = None
aggregation: Optional[str] = "mean"
weight_by_size: Optional[str] = False
weight_by_size: bool = False
# list of filter names which should be incorporated into the aggregated metric.
filter_list: Optional[Union[str, list]] = "none"
......@@ -31,6 +29,7 @@ class GroupConfig:
aggregate_metric_list: Optional[
Union[list[AggMetricConfig], AggMetricConfig, dict]
] = None
version: Optional[str] = None
metadata: Optional[dict] = (
None # by default, not used in the code. allows for users to pass arbitrary info to tasks
)
......@@ -68,6 +67,11 @@ class GroupConfig:
AggMetricConfig(**item) if isinstance(item, dict) else item
for item in self.aggregate_metric_list
]
self.version = (
self.version or self.metadata.get("version", "1.0")
if self.metadata
else "1.0"
)
def to_dict(self, keep_callable: bool = False) -> dict:
"""dumps the current config as a dictionary object, as a printable format.
......
......@@ -14,10 +14,23 @@ class Instance:
arguments: tuple
idx: int
metadata: Tuple[Optional[str], Optional[int], Optional[int]] = field(
default_factory=lambda: (None, None, None)
default_factory=lambda: (None, None, None),
metadata=dict(
description="Metadata tuple containing task name, document ID, and number of repeats."
),
)
resps: list = field(
default_factory=list,
metadata=dict(
description="List of responses from the model for this instance."
),
)
filtered_resps: dict = field(
default_factory=dict,
metadata=dict(
description="List of filtered responses for this instance, keyed by filter name."
),
)
resps: list = field(default_factory=list)
filtered_resps: dict = field(default_factory=dict)
# initialized after init
task_name: Optional[str] = None
......@@ -29,7 +42,7 @@ class Instance:
self.task_name, self.doc_id, self.repeats = self.metadata
@property
def args(self):
def args(self) -> tuple:
"""
Returns (string,) where `string` is the string to calculate loglikelihood over
"""
......
from __future__ import annotations
import logging
import math
import os
import random
import re
import string
from collections.abc import Iterable
from typing import Callable, List, Optional, Sequence, TypeVar
from collections.abc import Callable, Iterable, Sequence
from typing import Generic, TypeVar
import numpy as np
import sacrebleu
from lm_eval.api.registry import register_aggregation, register_metric
......@@ -25,36 +26,36 @@ def bypass_agg(arr):
@register_aggregation("nanmean")
def nanmean(arr):
def nanmean(arr: list[float]) -> float:
if len(arr) == 0 or all(np.isnan(arr)):
return np.nan
return np.nanmean(arr)
@register_aggregation("mean")
def mean(arr):
def mean(arr: Sequence[float]) -> float:
return sum(arr) / len(arr)
@register_aggregation("median")
def median(arr):
def median(arr: list[float]) -> float:
return arr[len(arr) // 2]
# Certain metrics must be calculated across all documents in a benchmark.
# We use them as aggregation metrics, paired with no-op passthrough metric fns.
@register_aggregation("perplexity")
def perplexity(items):
def perplexity(items: list[float]) -> float:
return math.exp(-mean(items))
@register_aggregation("weighted_perplexity")
def weighted_perplexity(items):
def weighted_perplexity(items: list[tuple[float, float]]) -> float:
return math.exp(-weighted_mean(items))
@register_aggregation("bits_per_byte")
def bits_per_byte(items):
def bits_per_byte(items: list[tuple[float, float]]) -> float:
return -weighted_mean(items) / math.log(2)
......@@ -71,7 +72,7 @@ def f1_score(items):
@register_aggregation("matthews_corrcoef")
def matthews_corrcoef(items):
def matthews_corrcoef(items: Iterable[tuple[int, int] | tuple[str, str]]) -> float:
from sklearn.metrics import matthews_corrcoef
unzipped_list = list(zip(*items))
......@@ -81,7 +82,7 @@ def matthews_corrcoef(items):
@register_aggregation("bleu")
def bleu(items):
def bleu(items: Iterable[tuple[str, str]]):
"""The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric
for evaluating a generated sentence to a reference sentence. It counts matching
n-grams in the candidate translation to n-grams in the reference text, where
......@@ -92,6 +93,8 @@ def bleu(items):
Higher is better
"""
import sacrebleu
refs = list(zip(*items))[0]
preds = list(zip(*items))[1]
refs, preds = _sacreformat(refs, preds)
......@@ -107,6 +110,8 @@ def chrf(items):
Higher is better # TODO I think
"""
import sacrebleu
refs = list(zip(*items))[0]
preds = list(zip(*items))[1]
refs, preds = _sacreformat(refs, preds)
......@@ -114,7 +119,7 @@ def chrf(items):
@register_aggregation("ter")
def ter(items):
def ter(items: Iterable[tuple[str, str]]):
"""Translation Error Rate is an error metric for machine translation that
measures the number of edits required to change a system output into one
of the references
......@@ -123,6 +128,8 @@ def ter(items):
Lower is better
"""
import sacrebleu
refs = list(zip(*items))[0]
preds = list(zip(*items))[1]
refs, preds = _sacreformat(refs, preds)
......@@ -130,7 +137,9 @@ def ter(items):
@register_aggregation("brier_score")
def brier_score(items): # This is a passthrough function
def brier_score(
items: Iterable[tuple[str, float]],
): # This is a passthrough function
gold, predictions = list(zip(*items))
bs, num_class = np.array(predictions).shape
......@@ -198,13 +207,48 @@ def acc_mutual_info_fn(items): # This is a passthrough function
# See the License for the specific language governing permissions and
# limitations under the License.
def exact_match_hf_evaluate(
predictions,
references,
regexes_to_ignore=None,
ignore_case=False,
ignore_punctuation=False,
ignore_numbers=False,
predictions: Iterable[str] | str,
references: Iterable[str] | str,
regexes_to_ignore: list[str] | None = None,
ignore_case: bool = False,
ignore_punctuation: bool = False,
ignore_numbers: bool = False,
multi_target: bool = False,
):
"""
Compute exact match scores between predictions and references.
This function computes the exact match score by comparing predictions
and references. It supports optional preprocessing steps such as ignoring
case, punctuation, numbers, and specific regex patterns.
Note:
predictions and references can have different lengths.
numpy broadcasting rule applies
Args:
predictions (Iterable[str] | str): The predicted strings to evaluate.
references (Iterable[str] | str): The reference strings to compare against.
regexes_to_ignore (list[str], optional): A list of regex patterns to remove
from both predictions and references before comparison. Defaults to None.
ignore_case (bool, optional): If True, ignores case differences during comparison.
Defaults to False.
ignore_punctuation (bool, optional): If True, removes punctuation from strings
before comparison. Defaults to False.
ignore_numbers (bool, optional): If True, removes numeric characters from strings
before comparison. Defaults to False.
multi_target (bool, optional): If True, returns 1.0 if any prediction matches any
reference, otherwise 0.0. Defaults to False.
Returns:
dict: A dictionary containing the exact match score:
- "exact_match" (float): The mean exact match score or 1.0/0.0 if `multi_target` is True.
"""
predictions, references = list(predictions), list(references)
assert len(predictions) == len(references) if not multi_target else True, (
"predictions and references must have the same length unless `multi_target` is True"
)
if regexes_to_ignore is not None:
for s in regexes_to_ignore:
predictions = np.array([re.sub(s, "", x) for x in predictions])
......@@ -229,7 +273,11 @@ def exact_match_hf_evaluate(
score_list = predictions == references
return {"exact_match": np.mean(score_list)}
return {
"exact_match": np.mean(score_list)
if not multi_target
else float(np.any(score_list))
}
###
......@@ -241,8 +289,8 @@ def exact_match_hf_evaluate(
output_type="generate_until",
aggregation="mean",
)
def exact_match_fn(**kwargs):
return exact_match_hf_evaluate(**kwargs)
def exact_match_fn(references: list[str], predictions: list[str], **kwargs):
return exact_match_hf_evaluate(predictions, references, **kwargs)
@register_metric(
......@@ -261,7 +309,7 @@ def perplexity_fn(items): # This is a passthrough function
output_type="loglikelihood_rolling",
aggregation="weighted_perplexity",
)
def word_perplexity_fn(items): # This is a passthrough function
def word_perplexity_fn(items: T) -> T: # This is a passthrough function
return items
......@@ -271,7 +319,7 @@ def word_perplexity_fn(items): # This is a passthrough function
output_type="loglikelihood_rolling",
aggregation="weighted_perplexity",
)
def byte_perplexity_fn(items): # This is a passthrough function
def byte_perplexity_fn(items: T) -> T: # This is a passthrough function
return items
......@@ -281,7 +329,7 @@ def byte_perplexity_fn(items): # This is a passthrough function
output_type="loglikelihood_rolling",
aggregation="bits_per_byte",
)
def bits_per_byte_fn(items): # This is a passthrough function
def bits_per_byte_fn(items: T) -> T: # This is a passthrough function
return items
......@@ -290,7 +338,7 @@ def pop_stddev(arr):
return math.sqrt(sum([(x - mu) ** 2 for x in arr]) / len(arr))
def sample_stddev(arr: Sequence[T]) -> float:
def sample_stddev(arr: Sequence[float]) -> float:
mu = mean(arr)
return math.sqrt(sum([(x - mu) ** 2 for x in arr]) / (len(arr) - 1))
......@@ -411,7 +459,7 @@ def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
return max(scores_for_ground_truths)
def weighted_mean(items):
def weighted_mean(items: list[tuple[float, float]]) -> float:
a, b = zip(*items)
return sum(a) / sum(b)
......@@ -422,15 +470,15 @@ def is_non_str_iterable(obj):
def _sacreformat(refs, preds):
"""Format refs and preds for sacrebleu corpus calculation. It is very particular"""
# Sacrebleu expects (List[str], List[List[str])
# Sacrebleu expects (list[str], list[list[str])
# e.g. sacrebleu.corpus_bleu([pred_t], [[ref1_stream], [ref2_stream], ...])
# Note [ref1_stream] is the first reference for each pred.
# So lists are size N and (M, N) for N preds and M possible refs for each pred
# This is a different order of dimensions that I would expect
# We expect refs to be List[str] or List[List[str]], the outer list corresponding to preds
# Must become List[List[str]] with the inner list corresponding to preds
# We expect refs to be list[str] or list[list[str]], the outer list corresponding to preds
# Must become list[list[str]] with the inner list corresponding to preds
if not is_non_str_iterable(refs):
refs = list(refs)
if not is_non_str_iterable(refs[0]):
......@@ -438,7 +486,7 @@ def _sacreformat(refs, preds):
refs = list(zip(*refs))
# Note the number of refs in each ref list much match the number of preds
# We expect preds to be List[str] or List[List[str]]. Must become List[str]
# We expect preds to be list[str] or list[list[str]]. Must become list[str]
if not is_non_str_iterable(preds):
preds = list(preds)
if is_non_str_iterable(preds[0]):
......@@ -451,7 +499,7 @@ def _sacreformat(refs, preds):
# stderr stuff
class _bootstrap_internal:
class _bootstrap_internal(Generic[T]):
"""
Pool worker: `(i, xs)` → `n` bootstrap replicates
of `f(xs)`using a RNG seeded with `i`.
......@@ -534,7 +582,7 @@ def bootstrap_stderr(
def stderr_for_metric(
metric: Callable[[Sequence[T]], float], bootstrap_iters: int
) -> Optional[Callable[[Sequence[T]], float]]:
) -> Callable[[Sequence[T]], float] | None:
"""
Return a function that estimates the standard error of `metric(xs)`.
......@@ -564,10 +612,10 @@ def stderr_for_metric(
stderr = {mean: mean_stderr, acc_all: acc_all_stderr}
return stderr.get(metric, None)
return stderr.get(metric)
def pooled_sample_stderr(stderrs: List[float], sizes: List[int]):
def pooled_sample_stderr(stderrs: list[float], sizes: list[int]):
# Used to aggregate bootstrapped stderrs across subtasks in a group,
# when we are weighting by the size of each subtask.
#
......@@ -585,7 +633,7 @@ def pooled_sample_stderr(stderrs: List[float], sizes: List[int]):
return np.sqrt(pooled_sample_var / sum(sizes))
def combined_sample_stderr(stderrs: List[float], sizes: List[int], metrics=None):
def combined_sample_stderr(stderrs: list[float], sizes: list[int], metrics=None):
assert metrics is not None, (
"Need to pass a list of each subtask's metric for this stderr aggregation"
)
......@@ -617,7 +665,9 @@ def combined_sample_stderr(stderrs: List[float], sizes: List[int], metrics=None)
return np.sqrt(variance)
def aggregate_subtask_metrics(metrics, sizes, weight_by_size=True):
def aggregate_subtask_metrics(
metrics: list[float], sizes: list[float], weight_by_size: bool = True
):
# A helper function that is used to aggregate
# subtask scores cross-task.
# TODO: does not hold for non-mean aggregations
......@@ -626,4 +676,4 @@ def aggregate_subtask_metrics(metrics, sizes, weight_by_size=True):
assert len(metrics) == len(sizes)
return sum([metric * size for metric, size in zip(metrics, sizes)]) / sum(sizes)
return sum(metric * size for metric, size in zip(metrics, sizes)) / sum(sizes)
from __future__ import annotations
import abc
import hashlib
import json
import logging
import os
from typing import TYPE_CHECKING, Any, Iterable, Optional, Type, TypeVar, Union
from collections.abc import Iterable
from typing import TYPE_CHECKING, Any, TypeVar
from tqdm import tqdm
......@@ -24,17 +27,17 @@ T = TypeVar("T", bound="LM")
class LM(abc.ABC):
def __init__(self) -> None:
"""Defines the interface that should be implemented by all LM subclasses.
LMs are assumed to take text (strings) as input and yield strings as output
LMs are assumed to take text (strings) as input and yield strings or logprobabilities as output
(inputs/outputs should be tokenization-agnostic.)
"""
# set rank and world size to a single process, by default.
self._rank = 0
self._world_size = 1
self.cache_hook: "CacheHook" = CacheHook(None)
self.cache_hook: CacheHook = CacheHook(None)
@abc.abstractmethod
def loglikelihood(self, requests) -> list[tuple[float, bool]]:
def loglikelihood(self, requests: list[Instance]) -> list[tuple[float, bool]]:
"""Compute log-likelihood of generating a continuation from a context.
Downstream tasks should attempt to use loglikelihood instead of other
LM calls whenever possible.
......@@ -59,7 +62,7 @@ class LM(abc.ABC):
pass
@abc.abstractmethod
def loglikelihood_rolling(self, requests) -> list[float]:
def loglikelihood_rolling(self, requests: list[Instance]) -> list[float]:
"""Compute full log-likelihood of a string, with no truncation, for perplexity computation
- We will use the full max context length of the model.
- For inputs that exceed the max context length, we divide the tokenized string into chunks of up to
......@@ -67,7 +70,7 @@ class LM(abc.ABC):
- IMPORTANT: Each document's loglikelihood/perplexity is computed *separately*, unlike other implementations
which may simply concatenate multiple documents together.
- IMPORTANT: We maximize the amount of context for each prediction. Specifically, for inputs that we break into
multiple chunks, the last input will still a full-sized context.
multiple chunks, the last input will still have full-sized context.
Example:
Input tokens: [ 0 1 2 3 4 5 6 7 8 9 ]
Prefix: BOS/EOS
......@@ -101,7 +104,7 @@ class LM(abc.ABC):
# TODO: Add an optional max length
@abc.abstractmethod
def generate_until(self, requests) -> list[str]:
def generate_until(self, requests: list[Instance]) -> list[str]:
"""Generate greedily until a stopping sequence
:param requests: list[Instance]
......@@ -118,7 +121,7 @@ class LM(abc.ABC):
pass
def apply_chat_template(
self, chat_history: list[dict[str, str]], add_generation_prompt=True
self, chat_history: list[dict], add_generation_prompt=True
) -> str:
"""
Defines how to transform few-shot examples provided as chat history into a format that can be used as input to the LM.
......@@ -137,7 +140,7 @@ class LM(abc.ABC):
@classmethod
def create_from_arg_string(
cls: Type[T], arg_string: str, additional_config: Optional[dict] = None
cls: type[T], arg_string: str, additional_config: dict | None = None
) -> T:
"""
Creates an instance of the LM class using the given argument string and additional config.
......@@ -156,7 +159,7 @@ class LM(abc.ABC):
@classmethod
def create_from_arg_obj(
cls: Type[T], arg_dict: dict, additional_config: Optional[dict] = None
cls: type[T], arg_dict: dict, additional_config: dict | None = None
) -> T:
"""
Creates an instance of the LM class using the given arg_obj
......@@ -176,14 +179,16 @@ class LM(abc.ABC):
return cls(**arg_dict, **additional_config)
@property
def rank(self):
def rank(self) -> int:
"""Returns the rank of the current process in a distributed setting."""
# used in the case of parallelism. Hardcoded to
# ensure no errors arise using API models which do
# not support multi-device parallelism nor expect it.
return self._rank
@property
def world_size(self):
def world_size(self) -> int:
"""Returns the total number of processes in a distributed setting."""
# used in the case of parallelism. Hardcoded to
# ensure no errors arise using API models which do
# not support multi-device parallelism nor expect it.
......@@ -199,7 +204,7 @@ class LM(abc.ABC):
"To use this model with chat templates, please implement the 'tokenizer_name' property."
)
def chat_template(self, chat_template: Union[bool, str] = False) -> Optional[str]:
def chat_template(self, chat_template: bool | str = False) -> str | None:
"""Returns the chat template structure for user/assistant messages if a template is provided.
This method is intended to be overridden in a subclass to define a specific chat template format.
For models that do not support chat templates, this method returns None by default.
......@@ -207,7 +212,8 @@ class LM(abc.ABC):
return ""
def set_cache_hook(self, cache_hook: "CacheHook") -> None:
def set_cache_hook(self, cache_hook: CacheHook) -> None:
"""Sets the cache hook for the LM, which is used to cache responses from the LM."""
self.cache_hook = cache_hook
......@@ -218,14 +224,16 @@ def hash_args(attr: str, args: Iterable[Any]) -> str:
class CacheHook:
def __init__(self, cachinglm: Optional["CachingLM"]) -> None:
def __init__(self, cachinglm: CachingLM | None) -> None:
"""CacheHook is used to cache responses from the LM."""
if cachinglm is None:
self.dbdict: Optional["SqliteDict"] = None
self.dbdict: SqliteDict | None = None
return
self.dbdict = cachinglm.dbdict
def add_partial(self, attr: str, req: Iterable[Any], res: Any) -> None:
"""Adds a partial result to the cache."""
if self.dbdict is None:
return
hsh = hash_args(attr, req)
......@@ -258,7 +266,7 @@ class CachingLM:
eval_logger.debug(f"Passing through attribute '{attr}' to underlying LM")
return lm_attr
def _fn(requests: list["Instance"]) -> list["Instance"]:
def _fn(requests: list[Instance]) -> list[Instance]:
res = []
remaining_reqs = []
warned = False
......@@ -290,11 +298,8 @@ class CachingLM:
eval_logger.info(
f"Cached requests: {len(requests) - len(remaining_reqs)}, Requests remaining: {len(remaining_reqs)}"
)
if remaining_reqs:
# actually run the LM on the requests that do not have cached results
rem_res = getattr(self.lm, attr)(remaining_reqs)
else:
rem_res = []
rem_res = getattr(self.lm, attr)(remaining_reqs) if remaining_reqs else []
# stick the new ones back into the list and also cache any of the new ones
resptr = 0
......@@ -313,7 +318,7 @@ class CachingLM:
return _fn
def get_cache_hook(self) -> "CacheHook":
def get_cache_hook(self) -> CacheHook:
return CacheHook(self)
......@@ -327,12 +332,13 @@ class TemplateLM(LM):
@property
@abc.abstractmethod
def eot_token_id(self):
def eot_token_id(self) -> int:
"""Returns the token ID for the end-of-text token (e.g., EOS)."""
pass
@property
def prefix_token_id(self):
# it is used as prefix for loglikelihood
def prefix_token_id(self) -> int:
"""Returns the token ID for the prefix token (e.g., BOS or EOS)."""
return self.eot_token_id
@abc.abstractmethod
......@@ -344,13 +350,33 @@ class TemplateLM(LM):
@abc.abstractmethod
def _loglikelihood_tokens(
self, requests: list["Instance"], **kwargs
self, requests: list[tuple[tuple[str, str], list[int], list[int]]], **kwargs
) -> list[tuple[float, bool]]:
"""Called by loglikelihood to compute log likelihoods for a list of requests.
Args:
requests: list[tuple[tuple[str, str], list[int], list[int]]]
A list of tuples where each tuple contains:
- (context, continuation) as a tuple of strings
- context_enc: list of token IDs for the context
- continuation_enc: list of token IDs for the continuation
Returns:
list[tuple[float, bool]]
A list of tuples where each tuple contains:
- logprob: float, the (summed) log probability of the continuation given the context
- isgreedy: bool, whether the continuation would be generated by greedy sampling from the context
See LM.loglikelihood for more details.
"""
pass
def _encode_pair(
self, context: str, continuation: str
) -> tuple[list[int], list[int]]:
"""Encodes a pair of context and continuation strings into token IDs.
We encode using encode(context+continuation) and then split into context and continuation.
"""
import transformers
n_spaces = len(context) - len(context.rstrip())
......@@ -373,8 +399,12 @@ class TemplateLM(LM):
return context_enc, continuation_enc
def loglikelihood(
self, requests: list["Instance"], disable_tqdm: bool = False
self, requests: list[Instance], disable_tqdm: bool = False
) -> list[tuple[float, bool]]:
"""Compute log-likelihood of generating a continuation from a context.
This calls `_loglikelihood_tokens` to compute the log likelihoods for a list of requests, after encoding.
"""
new_reqs = []
for context, continuation in [req.args for req in requests]:
if context == "":
......@@ -394,14 +424,38 @@ class TemplateLM(LM):
def loglikelihood_rolling(
self, requests, disable_tqdm: bool = False
) -> list[float]:
"""Compute rolling log-likelihood of a sequence using non-overlapping windows.
See LM.loglikelihood_rolling for more details.
"""
pass
@abc.abstractmethod
def generate_until(self, requests, disable_tqdm: bool = False) -> list[str]:
def generate_until(
self, requests: list[Instance], disable_tqdm: bool = False
) -> list[str]:
"""Generate until a stopping sequence.
Args:
requests: list[Instance]
A list of Instance objects with property `args` which returns a tuple (context, gen_kwargs).
context: str
Context string
gen_kwargs: dict
A dictionary of keyword arguments to pass to the generation function e.g. top_k, until, etc.
Returns:
list[continuation, ...]
A list of model generated continuations.
continuation: str
The generated continuation.
See LM.generate_until for more details.
"""
pass
def chat_template(self, chat_template: Union[bool, str] = False) -> Optional[str]:
def chat_template(self, chat_template: bool | str = False) -> str | None:
"""
Assumes tokenizer has a chat_template attribute (self.tokenizer.chat_template: dict | str)
Set and get the appropriate chat template for the model.
This method sets the tokenizer's chat_template and returns the template string for reproducibility.
......
This diff is collapsed.
from __future__ import annotations
import logging
import warnings
from collections.abc import Iterable, Sequence
from functools import partial
from typing import TYPE_CHECKING, Iterable, Optional, Union
from typing import TYPE_CHECKING, Any
import datasets
......@@ -18,9 +21,9 @@ class ContextSampler:
def __init__(
self,
docs: list[dict],
task: Union["Task", "ConfigurableTask"],
fewshot_indices: Optional[Iterable] = None,
rnd: Optional["Random"] = None,
task: Task | ConfigurableTask,
fewshot_indices: Iterable | None = None,
rnd: Random | None = None,
) -> None:
self.rnd = rnd
if not self.rnd:
......@@ -75,7 +78,7 @@ class ContextSampler:
)
self.docs = self.docs.select(fewshot_indices)
def get_context(self, doc: dict, num_fewshot: int, gen_prefix: str = None):
def get_context(self, doc: dict, num_fewshot: int, gen_prefix: str | None = None):
# draw an extra fewshot sample if using same split as evaluating on
prefix = gen_prefix + " " if gen_prefix else ""
n_samples = (
......@@ -95,10 +98,13 @@ class ContextSampler:
for doc in selected_docs:
doc_content = self.doc_to_text(doc)
doc_target = self.doc_to_target(doc)
if self.config.doc_to_choice is None or isinstance(doc_content, str):
if (
self.config.doc_to_choice is None and isinstance(doc_content, str)
) or isinstance(doc_content, str):
labeled_examples += doc_content
else:
labeled_examples += self.doc_to_choice(doc)[doc_content]
if isinstance(doc_content, int):
labeled_examples += self.doc_to_choice(doc)[doc_content]
if doc_target != "":
if self.target_delimiter.isspace() and str(doc_target)[0].isspace():
......@@ -126,7 +132,7 @@ class ContextSampler:
doc: dict,
num_fewshot: int,
fewshot_as_multiturn: bool = False,
gen_prefix: Optional[str] = None,
gen_prefix: str | None = None,
):
# TODO: Do we need any other delimiter
prefix = gen_prefix + " " if gen_prefix else ""
......@@ -181,16 +187,22 @@ class ContextSampler:
return chat_history
def sample(self, n: int):
# @classmethod
# def from_fewshot_dfg(cls, cfg: FewshotConfig):
# if not
def sample(self, n: int) -> Sequence[dict]:
"""
Draw `n` samples from our fewshot docs. This method should be overridden by subclasses.
"""
assert self.rnd is not None, (
"Error: `rnd` must be set to a random.Random instance before sampling."
)
return self.rnd.sample(self.docs, n)
class FirstNSampler(ContextSampler):
def sample(self, n: int) -> None:
def sample(self, n: int) -> Sequence[dict[str, Any]]:
"""
Draw the first `n` samples in order from the specified split.
Used for tasks with "canonical" ordered fewshot examples, such as MMLU and CMMLU.
......@@ -202,22 +214,22 @@ class FirstNSampler(ContextSampler):
class BalancedSampler(ContextSampler):
def sample(self, n: int) -> None:
def sample(self, n: int):
"""
TODO: this should return approximately class-balanced samples from our fewshot examples.
TODO: what order should they be in? maybe random?
"""
pass
raise NotImplementedError
class ManualSampler(ContextSampler):
def sample(self, n: int) -> None:
def sample(self, n: int):
""" """
pass
raise NotImplementedError
SAMPLER_REGISTRY = {
SAMPLER_REGISTRY: dict[str, type[ContextSampler]] = {
"default": ContextSampler,
"first_n": FirstNSampler,
}
......@@ -226,7 +238,7 @@ SAMPLER_REGISTRY = {
def get_sampler(name: str):
try:
return SAMPLER_REGISTRY[name]
except KeyError:
raise ValueError(
except KeyError as e:
raise KeyError(
f"Attempted to use contextsampler '{name}', but no sampling strategy for this name found! Supported model names: {', '.join(SAMPLER_REGISTRY.keys())}"
)
) from e
This diff is collapsed.
from __future__ import annotations
def check_gold_index_error(
choices: list[int] | list[str], gold: list[int] | int | str
) -> tuple[int | list[int], bool]:
gold_index_error = False
if isinstance(gold, list):
gold = [i if i < len(choices) else -100 for i in gold]
if -100 in gold:
gold_index_error = True
return gold, gold_index_error
else:
if isinstance(gold, int):
gold = gold if gold < len(choices) else -100
elif isinstance(gold, str):
gold = choices.index(gold) if gold in choices else -100
if gold == -100:
gold_index_error = True
return gold, gold_index_error
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment