udpate with merge

51f27158 · lintangsutawika · 924c9790 · f5408b6b · 51f27158 · 51f27158
Commit 51f27158 authored Feb 01, 2024 by lintangsutawika
20 changed files
--- a/.github/workflows/publish.yml
+++ b/.github/workflows/publish.yml
+name: Publish Python distribution to PyPI
+on:
+  push:
+    tags:
+      - '*'
+jobs:
+  build:
+    name: Build distribution
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v4
+    - name: Set up Python
+      uses: actions/setup-python@v4
+      with:
+        python-version: "3.x"
+    - name: Install pypa/build
+      run: >-
+        python3 -m
+        pip install
+        build
+        --user
+    - name: Build a binary wheel and a source tarball
+      run: python3 -m build
+    - name: Store the distribution packages
+      uses: actions/upload-artifact@v3
+      with:
+        name: python-package-distributions
+        path: dist/
+  publish-to-pypi:
+    name: >-
+      Publish Python distribution to PyPI
+    if: startsWith(github.ref, 'refs/tags/')  # only publish to PyPI on tag pushes
+    needs:
+    - build
+    runs-on: ubuntu-latest
+    environment:
+      name: pypi
+      url: https://pypi.org/p/lm_eval
+    permissions:
+      id-token: write  # IMPORTANT: mandatory for trusted publishing
+    steps:
+    - name: Download all the dists
+      uses: actions/download-artifact@v3
+      with:
+        name: python-package-distributions
+        path: dist/
+    - name: Publish distribution to PyPI
+      uses: pypa/gh-action-pypi-publish@release/v1
+  publish-to-testpypi:
+    name: Publish Python distribution to TestPyPI
+    needs:
+    - build
+    runs-on: ubuntu-latest
+    environment:
+      name: testpypi
+      url: https://test.pypi.org/p/lm_eval
+    permissions:
+      id-token: write  # IMPORTANT: mandatory for trusted publishing
+    steps:
+    - name: Download all the dists
+      uses: actions/download-artifact@v3
+      with:
+        name: python-package-distributions
+        path: dist/
+    - name: Publish distribution to TestPyPI
+      uses: pypa/gh-action-pypi-publish@release/v1
+      with:
+        repository-url: https://test.pypi.org/legacy/
--- a/.github/workflows/unit_tests.yml
+++ b/.github/workflows/unit_tests.yml
@@ -56,7 +56,7 @@ jobs:
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
-        pip install -e '.[dev,anthropic,sentencepiece]' --extra-index-url https://download.pytorch.org/whl/cpu
+        pip install -e '.[dev,anthropic,sentencepiece,optimum]' --extra-index-url https://download.pytorch.org/whl/cpu
 #         Install optional git dependencies
 #                pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
 #        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi

--- a/README.md
+++ b/README.md
@@ -45,26 +45,7 @@ git clone https://github.com/EleutherAI/lm-evaluation-harness
 cd lm-evaluation-harness
 pip install -e .
 ```
+We also provide a number of optional dependencies for extended functionality. A detailed table is available at the end of this document.
-We also provide a number of optional dependencies for extended functionality. Extras can be installed via `pip install -e ".[NAME]"`
-| Name          | Use                                   |
-|---------------|---------------------------------------|
-| anthropic     | For using Anthropic's models          |
-| dev           | For linting PRs and contributions     |
-| gptq          | For loading models with GPTQ          |
-| ifeval        | For running the IFEval task           |
-| mamba         | For loading Mamba SSM models          |
-| math          | For running math task answer checking |
-| multilingual  | For multilingual tokenizers           |
-| openai        | For using OpenAI's models             |
-| promptsource  | For using PromptSource prompts        |
-| sentencepiece | For using the sentencepiece tokenizer |
-| testing       | For running library test suite        |
-| vllm          | For loading models with vLLM          |
-| zeno          | For visualizing results with Zeno     |
-|---------------|---------------------------------------|
-| all           | Loads all extras (not recommended)    |
 ## Basic Usage
@@ -145,6 +126,9 @@ For more advanced users or even larger models, we allow for the following argume
 These two options (`accelerate launch` and `parallelize=True`) are mutually exclusive.
+**Note: we do not currently support multi-node evaluations natively, and advise using either an externally hosted server to run inference requests against, or creating a custom integration with your distributed framework [as is done for the GPT-NeoX library](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py).**
 ### Tensor + Data Parallel and Optimized Inference with `vLLM`
 We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html), especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:
@@ -189,10 +173,10 @@ Note that for externally hosted models, configs such as `--device` and `--batch_
 | [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark:              | `gguf`, `ggml`                                                      | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp)                   | `generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented) |
 | vLLM                                                                                                                      | :heavy_check_mark:       | `vllm`                                                              | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
 | Mamba                       | :heavy_check_mark:       | `mamba_ssm`                                                                      | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces) | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                             |
-| Your local inference server!                                                                                              | :heavy_check_mark:                             | `local-completions` or `local-chat-completions` (using `openai-chat-completions` model type)    | Any server address that accepts GET requests using HF models and mirror's OpenAI's ChatCompletions interface                                  | `generate_until`                                           |                                | ...                |
+| Huggingface Optimum (Causal LMs)    | ✔️         | `openvino`                                 |     Any decoder-only AutoModelForCausalLM converted with Huggingface Optimum into OpenVINO™ Intermediate Representation (IR) format                           |  `generate_until`, `loglikelihood`, `loglikelihood_rolling`                         | ...                                                      |
-| `local-completions` (using `openai-completions` model type)    | Any server address that accepts GET requests using HF models and mirror's OpenAI's Completions interface                                  | `generate_until`                                           |                                | ...                                                      |
+| Your local inference server!                                                                                              | :heavy_check_mark:                             | `local-completions` or `local-chat-completions` (using `openai-chat-completions` model type)    | Any server address that accepts GET requests using HF models and mirror's OpenAI's Completions or ChatCompletions interface                                  | `generate_until`                                           |                                | ...                |
-Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while models that are local or APIs that supply logprobs/logits can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
+Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
 For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).
@@ -203,6 +187,8 @@ A number of other libraries contain scripts for calling the eval harness through
 To create your own custom integration you can follow instructions from [this tutorial](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage).
 ### Additional Features
+> [!Note]
+> For tasks unsuitable for direct evaluation — either due risks associated with executing untrusted code or complexities in the evaluation process — the `--predict_only` flag is available to obtain decoded generations for post-hoc evaluation.
 If you have a Metal compatible Mac, you can run the eval harness using the MPS back-end by replacing `--device cuda:0` with `--device mps` (requires PyTorch version 2.1 or higher).
@@ -252,6 +238,9 @@ Additionally, one can provide a directory with `--use_cache` to cache the result
 For a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md) guide in our documentation!
+> [!Tip]
+> Running lm-evaluation-harness as an external library and can't find (almost) any tasks available? Run `lm_eval.tasks.initialize_tasks()` to load the library's stock tasks before calling `lm_eval.evaluate()` or `lm_eval.simple_evaluate()` !
 ## Visualizing Results
 You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.
@@ -315,6 +304,28 @@ We try to prioritize agreement with the procedures used by other groups to decre
 The best way to get support is to open an issue on this repo or join the [EleutherAI Discord server](https://discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases. If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
+## Optional Extras
+Extras dependencies can be installed via `pip install -e ".[NAME]"`
+| Name          | Use                                   |
+|---------------|---------------------------------------|
+| anthropic     | For using Anthropic's models          |
+| dev           | For linting PRs and contributions     |
+| gptq          | For loading models with GPTQ          |
+| ifeval        | For running the IFEval task           |
+| mamba         | For loading Mamba SSM models          |
+| math          | For running math task answer checking |
+| multilingual  | For multilingual tokenizers           |
+| openai        | For using OpenAI's models             |
+| optimum       | For running Intel OpenVINO models     |
+| promptsource  | For using PromptSource prompts        |
+| sentencepiece | For using the sentencepiece tokenizer |
+| testing       | For running library test suite        |
+| vllm          | For loading models with vLLM          |
+| zeno          | For visualizing results with Zeno     |
+|---------------|---------------------------------------|
+| all           | Loads all extras (not recommended)    |
 ## Cite as
 ```

--- a/docs/CONTRIBUTING.md
+++ b/docs/CONTRIBUTING.md
+# Contributing to LM Evaluation Harness
+Welcome and thank you for your interest in the LM Evaluation Harness! We welcome contributions and feedback and appreciate your time spent with our library, and hope you find it useful!
+We intend LM Evaluation Harness to be a broadly useful and
+## Important Resources
+There are several places information about LM Evaluation Harness is located:
+- Our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)
+- We occasionally use [GitHub Milestones](https://github.com/EleutherAI/lm-evaluation-harness/milestones) to track progress toward specific near-term version releases.
+- We maintain a [Project Board](https://github.com/orgs/EleutherAI/projects/25) for tracking current work items and PRs, and for future roadmap items or feature requests.
+- Further discussion and support conversations are located in the #lm-thunderdome channel of the [EleutherAI discord](discord.gg/eleutherai).
+## Code Style
+LM Evaluation Harness uses [ruff](https://github.com/astral-sh/ruff) for linting via [pre-commit](https://pre-commit.com/).
+You can install linters and dev tools via
+```pip install lm_eval[dev]```
+Then, run
+```pre-commit install```
+in order to ensure linters and other checks will be run upon committing.
+## Testing
+We use [pytest](https://docs.pytest.org/en/latest/) for running unit tests. All library unit tests can be run via:
+```
+python -m pytest --ignore=tests/tests_master --ignore=tests/extra
+```
+## Contributor License Agreement
+We ask that new contributors agree to a Contributor License Agreement affirming that EleutherAI has the rights to use your contribution to our library.
+First-time pull requests will have a reply added by @CLAassistant containing instructions for how to confirm this, and we require it before merging your PR.
+## Contribution Best Practices
+We recommend a few best practices to make your contributions or reported errors easier to assist with.
+**For Pull Requests:**
+- PRs should be titled descriptively, and be opened with a brief description of the scope and intent of the new contribution.
+- New features should have appropriate documentation added alongside them.
+- Aim for code maintainability, and minimize code copying.
+- If opening a task, try to share test results on the task using a publicly-available model, and if any public results are available on the task, compare to them.
+**For Feature Requests:**
+- Provide a short paragraph's worth of description. What is the feature you are requesting? What is its motivation, and an example use case of it? How does this differ from what is currently supported?
+**For Bug Reports**:
+- Provide a short description of the bug.
+- Provide a *reproducible example*--what is the command you run with our library that results in this error? Have you tried any other steps to resolve it?
+- Provide a *full error traceback* of the error that occurs, if applicable. A one-line error message or small screenshot snippet is unhelpful without the surrounding context.
+- Note what version of the codebase you are using, and any specifics of your environment and setup that may be relevant.
+**For Requesting New Tasks**:
+- Provide a 1-2 sentence description of what the task is and what it evaluates.
+- Provide a link to the paper introducing the task.
+- Provide a link to where the dataset can be found.
+- Provide a link to a paper containing results on an open-source model on the task, for use in comparisons and implementation validation.
+- If applicable, link to any codebase that has implemented the task (especially the original publication's codebase, if existent).
+## How Can I Get Involved?
+To quickly get started, we maintain a list of good first issues, which can be found [on our project board](https://github.com/orgs/EleutherAI/projects/25/views/8) or by [filtering GH Issues](https://github.com/EleutherAI/lm-evaluation-harness/issues?q=is%3Aopen+label%3A%22good+first+issue%22+label%3A%22help+wanted%22). These are typically smaller code changes or self-contained features which can be added without extensive familiarity with library internals, and we recommend new contributors consider taking a stab at one of these first if they are feeling uncertain where to begin.
+There are a number of distinct ways to contribute to LM Evaluation Harness, and all are extremely helpful! A sampling of ways to contribute include:
+- **Implementing and verifying new evaluation tasks**: Is there a task you'd like to see LM Evaluation Harness support? Consider opening an issue requesting it, or helping add it! Verifying and cross-checking task implementations with their original versions is also a very valuable form of assistance in ensuring standardized evaluation.
+- **Improving documentation** - Improvements to the documentation, or noting pain points / gaps in documentation, are helpful in order for us to improve the user experience of the library and clarity + coverage of documentation.
+- **Testing and devops** - We are very grateful for any assistance in adding tests for the library that can be run for new PRs, and other devops workflows.
+- **Adding new modeling / inference library integrations** - We hope to support a broad range of commonly-used inference libraries popular among the community, and welcome PRs for new integrations, so long as they are documented properly and maintainable.
+- **Proposing or Contributing New Features** - We want LM Evaluation Harness to support a broad range of evaluation usecases. If you have a feature that is not currently supported but desired, feel free to open an issue describing the feature and, if applicable, how you intend to implement it. We would be happy to give feedback on the cleanest way to implement new functionalities and are happy to coordinate with interested contributors via GH discussions or via discord.
+We hope that this has been helpful, and appreciate your interest in contributing! Further questions can be directed to [our Discord](discord.gg/eleutherai).
--- a/docs/interface.md
+++ b/docs/interface.md
@@ -44,6 +44,8 @@ This mode supports a number of command-line arguments, the details of which can
 * `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`
+* `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.
 ## External Library Usage
 We also support using the library's external API for use within model training loops or other scripts.

--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -256,7 +256,7 @@ metric_list:
 ```
 `aggregation` and `higher_is_better` can optionally be left out to default to the manually-set defaults if using a natively supported metric, otherwise it must be defined explicitly (for example, when using a custom metric implemented as a function).
-For a full list of natively supported metrics and aggregation functions see `docs/advanced_task_guide.md`. All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval` or `hf_evaluate` is set to `true`.
+For a full list of natively supported metrics and aggregation functions see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval` or `hf_evaluate` is set to `true`.
 ### Optional, More Advanced Setup
@@ -269,7 +269,7 @@ As a heuristic check:
 * Do you expect to compute metrics after applying multiple such processing steps on your model outputs?
 * Does your task rely on metrics that need a custom implementation?
-For more detail on the task system and advanced features, see `docs/advanced_task_guide.md` . If none of the above sound like they apply to your task, it's time to continue onto checking your task performance!
+For more detail on the task system and advanced features, see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md) . If none of the above sound like they apply to your task, it's time to continue onto checking your task performance!
 ### Task name + groups (registering a task)

--- a/lm_eval/__main__.py
+++ b/lm_eval/__main__.py
@@ -143,6 +143,13 @@ def parse_eval_args() -> argparse.Namespace:
        metavar="CRITICAL|ERROR|WARNING|INFO|DEBUG",
        help="Controls the reported logging error level. Set to DEBUG when testing + adding new task configurations for comprehensive log output.",
    )
+    parser.add_argument(
+        "--predict_only",
+        "-x",
+        action="store_true",
+        default=False,
+        help="Use with --log_samples. Only model outputs will be saved and metrics will not be evaluated.",
+    )
    return parser.parse_args()
@@ -156,6 +163,11 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
    eval_logger.info(f"Verbosity set to {args.verbosity}")
    os.environ["TOKENIZERS_PARALLELISM"] = "false"
+    if args.predict_only:
+        args.log_samples = True
+    if (args.log_samples or args.predict_only) and not args.output_path:
+        assert args.output_path, "Specify --output_path"
    initialize_tasks(args.verbosity)
    if args.limit:
@@ -171,7 +183,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
        task_names = ALL_TASKS
    elif args.tasks == "list":
        eval_logger.info(
-            "Available Tasks:\n - {}".format("\n - ".join(sorted(ALL_TASKS)))
+            f"Available Tasks:\n - {(os.linesep + ' - ').join(sorted(ALL_TASKS))}"
        )
        sys.exit()
    else:
@@ -223,8 +235,6 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
        else:
            path.mkdir(parents=True, exist_ok=True)
            output_path_file = path.joinpath("results.json")
-    elif args.log_samples and not args.output_path:
-        assert args.output_path, "Specify --output_path"
    eval_logger.info(f"Selected Tasks: {task_names}")
@@ -243,6 +253,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
        write_out=args.write_out,
        log_samples=args.log_samples,
        gen_kwargs=args.gen_kwargs,
+        predict_only=args.predict_only,
    )
    if results is not None:
@@ -257,7 +268,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
        batch_sizes = ",".join(map(str, results["config"]["batch_sizes"]))
        if args.output_path:
-            output_path_file.open("w").write(dumped)
+            output_path_file.open("w", encoding="utf-8").write(dumped)
            if args.log_samples:
                for task_name, config in results["configs"].items():

--- a/lm_eval/api/filter.py
+++ b/lm_eval/api/filter.py
+from abc import ABC, abstractmethod
 from dataclasses import dataclass
-from typing import List
+from typing import Callable, Iterable, List, Union
-from datasets import Dataset
 from lm_eval.api.instance import Instance
-class Filter:
+class Filter(ABC):
    """
    Filter classes operate on a per-task level.
    They take all model outputs (`instance.resps` for all `task.instances`)
@@ -15,12 +14,13 @@ class Filter:
    """
-    def __init__(self, *args, **kwargs) -> None:
+    def __init__(self, **kwargs) -> None:
        """
        Can define custom behavior here, if an individual instantiation of a Filter class should have state.
        """
-    def apply(self, resps, docs):
+    @abstractmethod
+    def apply(self, resps: Union[List, Iterable], docs: List[dict]) -> Iterable:
        """
        Defines the operation to perform on a list of the `inst.resps` properties of `Instance` objects.
        Should return the list of (filtered) response lists *in the same order as they were input*, e.g.
@@ -40,15 +40,15 @@ class FilterEnsemble:
    """
    name: str
-    filters: List[Filter]
+    filters: List[Callable[[], Filter]]
+    def apply(self, instances: List[Instance]) -> None:
+        resps, docs = zip(*((inst.resps, inst.doc) for inst in instances))
+        resps, docs = list(resps), list(docs)
-    def apply(self, instances: List[Instance], docs: List[Dataset]) -> None:
-        resps = [
-            inst.resps for inst in instances
-        ]  # operate just on the model responses
        for f in self.filters:
            # apply filters in sequence
-            resps = f.apply(resps, docs)
+            resps = f().apply(resps, docs)
        # add the end results after filtering to filtered_requests of their respective source instances.
        # has key `self.name`: each FilterEnsemble applied in a given run should use a different name.

--- a/lm_eval/api/instance.py
+++ b/lm_eval/api/instance.py
@@ -4,7 +4,12 @@ from typing import Literal, Tuple
 @dataclass
 class Instance:
-    request_type: Literal["loglikelihood", "loglikelihood_rolling", "generate_until"]
+    request_type: Literal[
+        "loglikelihood",
+        "loglikelihood_rolling",
+        "generate_until",
+        "multiple_choice",
+    ]
    doc: dict
    arguments: tuple
    idx: int

--- a/lm_eval/api/metrics.py
+++ b/lm_eval/api/metrics.py
@@ -15,6 +15,11 @@ eval_logger = logging.getLogger("lm-eval")
 # Register Aggregations First
+@register_aggregation("bypass")
+def bypass_agg(arr):
+    return 999
 @register_aggregation("mean")
 def mean(arr):
    return sum(arr) / len(arr)
@@ -207,6 +212,16 @@ def mean_stderr(arr):
    return sample_stddev(arr) / math.sqrt(len(arr))
+@register_metric(
+    metric="bypass",
+    higher_is_better=True,
+    output_type=["loglikelihood", "multiple_choice", "generate_until"],
+    aggregation="bypass",
+)
+def bypass(items):
+    return None
 @register_metric(
    metric="mcc",
    higher_is_better=True,

--- a/lm_eval/api/registry.py
+++ b/lm_eval/api/registry.py
@@ -152,18 +152,14 @@ def get_aggregation(name):
    try:
        return AGGREGATION_REGISTRY[name]
    except KeyError:
-        eval_logger.warning(
+        eval_logger.warning(f"{name} not a registered aggregation metric!")
-            "{} not a registered aggregation metric!".format(name),
-        )
 def get_metric_aggregation(name):
    try:
        return METRIC_AGGREGATION_REGISTRY[name]
    except KeyError:
-        eval_logger.warning(
+        eval_logger.warning(f"{name} metric is not assigned a default aggregation!")
-            "{} metric is not assigned a default aggregation!".format(name),
-        )
 def is_higher_better(metric_name):

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -5,6 +5,7 @@ import random
 import re
 from collections.abc import Callable
 from dataclasses import asdict, dataclass
+from inspect import getsource
 from typing import Any, List, Literal, Tuple, Union
 import datasets
@@ -37,7 +38,6 @@ ALL_OUTPUT_TYPES = [
    "generate_until",
 ]
 eval_logger = logging.getLogger("lm-eval")
@@ -74,7 +74,12 @@ class TaskConfig(dict):
    num_fewshot: int = None
    # scoring options
    metric_list: list = None
-    output_type: str = "generate_until"
+    output_type: Literal[
+        "loglikelihood",
+        "loglikelihood_rolling",
+        "generate_until",
+        "multiple_choice",
+    ] = "generate_until"
    generation_kwargs: dict = None
    repeats: int = 1
    filter_list: Union[str, list] = None
@@ -110,15 +115,13 @@ class TaskConfig(dict):
                    "do_sample": False,
                }
-        # TODO: how to make TaskConfigs be de- and re-serializable, even when using the !function constructor?
    def __getitem__(self, item):
        return getattr(self, item)
    def __setitem__(self, item, value):
        return setattr(self, item, value)
-    def to_dict(self, keep_callable=False):
+    def to_dict(self, keep_callable: bool = False) -> dict:
        """dumps the current config as a dictionary object, as a printable format.
        null fields will not be printed.
        Used for dumping results alongside full task configuration
@@ -133,14 +136,34 @@ class TaskConfig(dict):
        for k, v in list(cfg_dict.items()):
            if v is None:
                cfg_dict.pop(k)
-            elif isinstance(v, Callable):
+            elif k == "metric_list":
-                if keep_callable:
+                for metric_dict in v:
-                    cfg_dict[k] = v
+                    for metric_key, metric_value in metric_dict.items():
-                else:
+                        if callable(metric_value):
-                    # TODO: this should handle Promptsource template objects as a separate case?
+                            metric_dict[metric_key] = self.serialize_function(
-                    cfg_dict[k] = str(v)
+                                metric_value, keep_callable=keep_callable
+                            )
+                cfg_dict[k] = v
+            elif callable(v):
+                cfg_dict[k] = self.serialize_function(v, keep_callable=keep_callable)
        return cfg_dict
+    def serialize_function(
+        self, value: Union[Callable, str], keep_callable=False
+    ) -> Union[Callable, str]:
+        """Serializes a given function or string.
+        If 'keep_callable' is True, the original callable is returned.
+        Otherwise, attempts to return the source code of the callable using 'getsource'.
+        """
+        if keep_callable:
+            return value
+        else:
+            try:
+                return getsource(value)
+            except (TypeError, OSError):
+                return str(value)
 class Task(abc.ABC):
    """A task represents an entire benchmark including its dataset, problems,
@@ -490,7 +513,7 @@ class Task(abc.ABC):
    def apply_filters(self):
        if hasattr(self, "_filters"):
            for f in self._filters:
-                f.apply(self._instances, None)
+                f.apply(self._instances)
        else:
            eval_logger.warning("No filter defined, passing through instances")
            return self._instances
@@ -626,16 +649,15 @@ class ConfigurableTask(Task):
        if self.config.filter_list is not None:
            self._filters = []
            for filter_config in self.config.filter_list:
-                for filter_pipeline in filter_config:
+                filter_name = filter_config["name"]
-                    filter_name = filter_config["name"]
+                filter_functions = filter_config["filter"]
-                    filter_functions = filter_config["filter"]
+                components = []
-                    components = []
+                for function in filter_functions:
-                    for function in filter_functions:
+                    kwargs = {
-                        kwargs = {
+                        key: function[key] for key in function if key != "function"
-                            key: function[key] for key in function if key != "function"
+                    }
-                        }
+                    components.append([function["function"], kwargs])
-                        components.append([function["function"], kwargs])
+                filter_pipeline = build_filter_ensemble(filter_name, components)
-                    filter_pipeline = build_filter_ensemble(filter_name, components)
                self._filters.append(filter_pipeline)
        else:
            self._filters = [build_filter_ensemble("none", [["take_first", None]])]
@@ -813,7 +835,7 @@ class ConfigurableTask(Task):
    def apply_filters(self):
        if hasattr(self, "_filters"):
            for f in self._filters:
-                f.apply(self._instances, self.task_docs)
+                f.apply(self._instances)
        else:
            eval_logger.warning("No filter defined, passing through instances")
            return self._instances
@@ -1191,12 +1213,46 @@ class ConfigurableTask(Task):
        return result_dict
-    def aggregation(self):
+    def aggregation(self) -> dict:
        return self._aggregation_list
-    def higher_is_better(self):
+    def higher_is_better(self) -> dict:
        return self._higher_is_better
+    def get_config(self, key: str) -> Any:
+        return getattr(self._config, key, None)
+    def override_metric(self, metric_name: str) -> None:
+        """
+        Override the default metrics used for evaluation with custom metrics.
+        Parameters:
+        - metric_name (str): The name of the custom metric to override. Should be registered in api.metrics.
+        """
+        (
+            self._metric_fn_list,
+            self._aggregation_list,
+            self._metric_fn_kwargs,
+            self._higher_is_better,
+        ) = ({}, {}, {}, {})
+        self._metric_fn_list[metric_name] = get_metric(metric_name)
+        self._aggregation_list[metric_name] = get_metric_aggregation(metric_name)
+        self._higher_is_better[metric_name] = is_higher_better(metric_name)
+        self._metric_fn_kwargs[metric_name] = {}
+        setattr(self._config, "metric_list", [{"metric": metric_name}])
+        setattr(self._config, "process_results", None)
+    def override_config(
+        self, key: str = None, value: Any = None, update: bool = False
+    ) -> None:
+        if update:
+            current_value = getattr(self._config, key)
+            assert isinstance(current_value, dict)
+            current_value.update(value)
+            setattr(self._config, key, current_value)
+        else:
+            setattr(self._config, key, value)
 class MultipleChoiceTask(Task):
    OUTPUT_TYPE: str = "loglikelihood"

--- a/lm_eval/decontamination/archiver.py
+++ b/lm_eval/decontamination/archiver.py
@@ -30,7 +30,9 @@ class Archive:
        self.cctx = zstandard.ZstdCompressor(level=compression_level)
        self.compressor = self.cctx.stream_writer(self.fh)
-    def add_data(self, data, meta={}) -> None:
+    def add_data(self, data, meta=None) -> None:
+        if meta is None:
+            meta = {}
        self.compressor.write(
            json.dumps({"text": data, "meta": meta}, default=json_serial).encode(
                "UTF-8"
@@ -108,7 +110,7 @@ class TextReader:
    def read_tqdm(self, update_frequency: int = 10000):
        current_file_position = 0
        line_counter = 0
-        with open(self.file_path, "r") as fh, tqdm.tqdm(
+        with open(self.file_path, "r", encoding="utf-8") as fh, tqdm.tqdm(
            total=os.path.getsize(self.file_path),
            dynamic_ncols=True,
            unit="byte",

--- a/lm_eval/decontamination/decontaminate.py
+++ b/lm_eval/decontamination/decontaminate.py
@@ -38,7 +38,7 @@ def get_train_overlap(docs_by_task_set: dict, ngrams_path: str, limit: int) -> d
    # return get_train_overlap_stub(docs, ngrams_path, ngrams_n_size)
    info_dict_path = os.path.join(ngrams_path, "info.json")
-    info_dict = json.load(open(info_dict_path, "r"))
+    info_dict = json.load(open(info_dict_path, "r", encoding="utf-8"))
    ngrams_n_size = info_dict["ngram_size"]
    janitor = Janitor()

--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -25,7 +25,7 @@ from lm_eval.utils import (
 def simple_evaluate(
    model,
    model_args=None,
-    tasks=[],
+    tasks=None,
    num_fewshot=None,
    batch_size=None,
    max_batch_size=None,
@@ -38,6 +38,7 @@ def simple_evaluate(
    write_out: bool = False,
    log_samples: bool = True,
    gen_kwargs: str = None,
+    predict_only: bool = False,
 ):
    """Instantiate and evaluate a model on a list of tasks.
@@ -71,6 +72,9 @@ def simple_evaluate(
    :param gen_kwargs: str
        String arguments for model generation
        Ignored for all tasks with loglikelihood output_type
+    :param predict_only: bool
+        If true only model outputs will be generated and returned. Metrics will not be evaluated
    :return
        Dictionary of results
    """
@@ -80,6 +84,8 @@ def simple_evaluate(
        1234
    )  # TODO: this may affect training runs that are run with evaluation mid-run.
+    if tasks is None:
+        tasks = []
    assert (
        tasks != []
    ), "No tasks specified, or no tasks found. Please verify the task names."
@@ -87,7 +93,7 @@ def simple_evaluate(
    if gen_kwargs is not None:
        gen_kwargs = simple_parse_args_string(gen_kwargs)
        eval_logger.warning(
-            "generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks."
+            "generation_kwargs specified through cli, these settings will update set parameters in yaml tasks. Ensure 'do_sample=True' for non-greedy decoding!"
        )
        if gen_kwargs == "":
            gen_kwargs = None
@@ -122,27 +128,35 @@ def simple_evaluate(
    task_dict = lm_eval.tasks.get_task_dict(tasks)
    for task_name in task_dict.keys():
        task_obj = task_dict[task_name]
-        if type(task_obj) == tuple:
+        if isinstance(task_obj, tuple):
            _, task_obj = task_obj
            if task_obj is None:
                continue
-        config = task_obj._config
+        if task_obj.get_config("output_type") == "generate_until":
-        if config["output_type"] == "generate_until" and gen_kwargs is not None:
+            if gen_kwargs is not None:
-            config["generation_kwargs"].update(gen_kwargs)
+                task_obj.override_config(
+                    key="generation_kwargs", value=gen_kwargs, update=True
+                )
+            if predict_only:
+                log_samples = True
+                eval_logger.info(
+                    f"Processing {task_name} in output-only mode. Metrics will not be calculated!"
+                )
+                # we have to change the class properties post-hoc. This is pretty hacky.
+                task_obj.override_metric(metric_name="bypass")
        if num_fewshot is not None:
-            if config["num_fewshot"] == 0:
+            if (default_num_fewshot := task_obj.get_config("num_fewshot")) == 0:
                eval_logger.info(
                    f"num_fewshot has been set to 0 for {task_name} in its config. Manual configuration will be ignored."
                )
            else:
-                default_num_fewshot = config["num_fewshot"]
                eval_logger.warning(
                    f"Overwriting default num_fewshot of {task_name} from {default_num_fewshot} to {num_fewshot}"
                )
+                task_obj.override_config(key="num_fewshot", value=num_fewshot)
-                task_obj._config["num_fewshot"] = num_fewshot
    if check_integrity:
        run_task_tests(task_list=tasks)
@@ -218,6 +232,14 @@ def evaluate(
    # decontaminate = decontamination_ngrams_path is not None
+    for task_name, task in task_dict.items():
+        if isinstance(task, tuple):
+            _, task = task
+        if not log_samples:
+            assert (
+                "bypass" not in getattr(task, "_metric_fn_list", {}).keys()
+            ), f"log_samples must be True for 'bypass' only tasks: {task_name}"
    # stores the final result for each task, for each metric/filter pair.
    results = collections.defaultdict(dict)
    # Tracks each task's version.
@@ -242,7 +264,7 @@ def evaluate(
    # get lists of each type of request
    for task_name, task in task_dict.items():
-        if type(task) == tuple:
+        if isinstance(task, tuple):
            group_name, task = task
            task_hierarchy[group_name].append(task_name)
            versions[group_name] = "N/A"
@@ -316,7 +338,7 @@ def evaluate(
    ### Run LM on inputs, get all outputs ###
    # execute each type of request
    for reqtype, reqs in requests.items():
-        eval_logger.info("Running {} requests".format(reqtype))
+        eval_logger.info(f"Running {reqtype} requests")
        # create `K` copies of each request `req` based off `K = req.repeats`
        cloned_reqs = []
        for req in reqs:
@@ -339,7 +361,7 @@ def evaluate(
    ### Postprocess outputs ###
    # TODO: del model here, maybe (idea: allow user to specify device of e.g. reward model separately)
    for task_name, task in task_dict.items():
-        if type(task) == tuple:
+        if isinstance(task, tuple):
            group, task = task
            if task is None:
                continue
@@ -350,7 +372,7 @@ def evaluate(
    # unpack results and sort back in order and return control to Task
    for task_name, task in task_dict.items():
-        if type(task) == tuple:
+        if isinstance(task, tuple):
            group, task = task
            if task is None:
                continue
@@ -401,7 +423,7 @@ def evaluate(
        vals_torch = collections.defaultdict(list)
        for (task_name, key, metric), items in vals.items():
            numitem = 0
-            if type(items[0]) == tuple:
+            if isinstance(items[0], tuple):
                numitem = len(items[0])
            if isinstance(items[0], (str, list, tuple)):
@@ -447,7 +469,7 @@ def evaluate(
            task = task_dict[task_name]
            metric_key = metric + "," + key
-            if type(task) == tuple:
+            if isinstance(task, tuple):
                group_name, task = task
            else:
                group_name = None
@@ -474,7 +496,8 @@ def evaluate(
        if bool(results):
            for group, task_list in reversed(task_hierarchy.items()):
                if task_list == []:
-                    total_size = results[group]["samples"]
+                    # TODO: No samples when bypass
+                    total_size = results[group].get("samples", 999)
                else:
                    total_size = 0
@@ -510,7 +533,7 @@ def evaluate(
                                    + metric_score * current_size
                                ) / (total_size + current_size)
                                # $$s_z^2 = \frac{(n-1) s_x^2 + (m-1) s_y^2}{n+m-1} + \frac{nm(\bar x - \bar y)^2}{(n+m)(n+m-1)}.$$
-                                if var_score == "N/A":
+                                if var_score == "N/A" or results[group][stderr] == "N/A":
                                    results[group][stderr] = "N/A"
                                else:
                                    results[group][stderr] = (

--- a/lm_eval/filters/__init__.py
+++ b/lm_eval/filters/__init__.py
+from typing import List, Union
+from functools import partial
 from lm_eval.api.filter import FilterEnsemble
 from . import selection
 from . import extraction
@@ -20,24 +23,25 @@ FILTER_REGISTRY = {
 }
-def get_filter(filter_name):
+def get_filter(filter_name: str) -> Union[type, str]:
    if filter_name in FILTER_REGISTRY:
        return FILTER_REGISTRY[filter_name]
    else:
        return filter_name
-def build_filter_ensemble(filter_name, components):
+def build_filter_ensemble(
+    filter_name: str, components: List[List[str]]
+) -> FilterEnsemble:
    """
    Create a filtering pipeline.
    """
    filters = []
    for function, kwargs in components:
        if kwargs is None:
-            f = get_filter(function)()
+            kwargs = {}
-        else:
+        # create a filter given its name in the registry
-            # create a filter given its name in the registry
+        f = partial(get_filter(function), **kwargs)
-            f = get_filter(function)(**kwargs)  # TODO: pass kwargs to filters properly
        # add the filter as a pipeline step
        filters.append(f)

--- a/lm_eval/filters/selection.py
+++ b/lm_eval/filters/selection.py
@@ -17,12 +17,14 @@ class TakeFirstFilter(Filter):
 class TakeKFilter(Filter):
-    def __init__(self, *args, **kwargs) -> None:
+    def __init__(self, **kwargs) -> None:
        self.k = kwargs.pop("k")
-        super().__init__(*args, **kwargs)
+        super().__init__(**kwargs)
    def apply(self, resps, docs):
+        # need resp to be subscriptable to check below
+        resps = list(resps)
        # check we have at least k responses per doc, else we can't take the first k
        assert (
            len(resps[0]) >= self.k

--- a/lm_eval/filters/transformation.py
+++ b/lm_eval/filters/transformation.py
@@ -24,7 +24,7 @@ class UppercaseFilter(Filter):
 class MapFilter(Filter):
-    def __init__(self, mapping_dict: dict = {}, default_value=None) -> None:
+    def __init__(self, mapping_dict: dict = None, default_value=None) -> None:
        """
        Initializes the MapFilter with a given mapping dictionary and default value.
@@ -37,6 +37,8 @@ class MapFilter(Filter):
        Example:
        mapper = MapFilter({'A': 1, 'B': 2}, default_value=0)
        """
+        if mapping_dict is None:
+            mapping_dict = {}
        assert isinstance(
            mapping_dict, dict
        ), "Provided mapping_dict is not a dictionary"

--- a/lm_eval/models/__init__.py
+++ b/lm_eval/models/__init__.py
@@ -6,5 +6,6 @@ from . import anthropic_llms
 from . import gguf
 from . import vllm_causallms
 from . import mamba_lm
+from . import optimum_lm
 # TODO: implement __all__
--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -200,8 +200,9 @@ class HFLM(LM):
            )
        # access self._model through self.model property outside this method
-        self.model.eval()
+        if isinstance(self.model, torch.nn.Module):
-        self.model.tie_weights()
+            self.model.eval()
+            self.model.tie_weights()
        if isinstance(pretrained, str) and (gpus >= 1 or str(self.device) == "mps"):
            # TODO: can remove this whole snippet except in the mps case, perhaps?
@@ -238,6 +239,16 @@ class HFLM(LM):
            if self.config.model_type == "qwen":
                # Qwen's trust_remote_code tokenizer does not allow for adding special tokens
                self.tokenizer.pad_token = "<|endoftext|>"
+            elif (
+                self.tokenizer.__class__.__name__ == "RWKVWorldTokenizer"
+                or self.tokenizer.__class__.__name__ == "Rwkv5Tokenizer"
+            ):
+                # The RWKV world tokenizer, does not allow for adding special tokens / setting the pad token (which is set as 0)
+                # The additional tokenizer name check is needed, as there exists rwkv4 models with neox tokenizer
+                # ---
+                # Note that the world tokenizer class name, might change in the future for the final huggingface merge
+                # https://github.com/huggingface/transformers/pull/26963
+                assert self.tokenizer.pad_token_id == 0
            else:
                self.tokenizer.add_special_tokens({"pad_token": "<|pad|>"})
@@ -602,8 +613,7 @@ class HFLM(LM):
                    (batch_size, max_length), device=self.device
                ).long()
            for _ in range(5):
-                out = F.log_softmax(self._model_call(test_batch, **call_kwargs), dim=-1)
+                out = F.log_softmax(self._model_call(test_batch, **call_kwargs), dim=-1)  # noqa: F841
-                out = out  # Identity process so that it passes pre-commit
            return batch_size
@@ -705,10 +715,14 @@ class HFLM(LM):
                return self.model(inps).logits
    def _model_generate(self, context, max_length, stop, **generation_kwargs):
-        # we require users to pass do_sample=True explicitly
+        # temperature = 0.0 if not set
-        # for non-greedy gen. This should be reevaluated when considering beam search.
+        # if do_sample is false and temp==0.0:
-        if "do_sample" not in generation_kwargs:
+        # remove temperature, as do_sample=False takes care of this
-            generation_kwargs["do_sample"] = False
+        # and we don't want a warning from HF
+        generation_kwargs["temperature"] = generation_kwargs.get("temperature", 0.0)
+        do_sample = generation_kwargs.get("do_sample", None)
+        if do_sample is False and generation_kwargs.get("temperature") == 0.0:
+            generation_kwargs.pop("temperature")
        # build stopping criteria
        stopping_criteria = stop_sequences_criteria(
            self.tokenizer, stop, context.shape[1], context.shape[0]