Commit 51f27158 authored by lintangsutawika's avatar lintangsutawika
Browse files

udpate with merge

parents 924c9790 f5408b6b
name: Publish Python distribution to PyPI
on:
push:
tags:
- '*'
jobs:
build:
name: Build distribution
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.x"
- name: Install pypa/build
run: >-
python3 -m
pip install
build
--user
- name: Build a binary wheel and a source tarball
run: python3 -m build
- name: Store the distribution packages
uses: actions/upload-artifact@v3
with:
name: python-package-distributions
path: dist/
publish-to-pypi:
name: >-
Publish Python distribution to PyPI
if: startsWith(github.ref, 'refs/tags/') # only publish to PyPI on tag pushes
needs:
- build
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/p/lm_eval
permissions:
id-token: write # IMPORTANT: mandatory for trusted publishing
steps:
- name: Download all the dists
uses: actions/download-artifact@v3
with:
name: python-package-distributions
path: dist/
- name: Publish distribution to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
publish-to-testpypi:
name: Publish Python distribution to TestPyPI
needs:
- build
runs-on: ubuntu-latest
environment:
name: testpypi
url: https://test.pypi.org/p/lm_eval
permissions:
id-token: write # IMPORTANT: mandatory for trusted publishing
steps:
- name: Download all the dists
uses: actions/download-artifact@v3
with:
name: python-package-distributions
path: dist/
- name: Publish distribution to TestPyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
repository-url: https://test.pypi.org/legacy/
...@@ -56,7 +56,7 @@ jobs: ...@@ -56,7 +56,7 @@ jobs:
- name: Install dependencies - name: Install dependencies
run: | run: |
python -m pip install --upgrade pip python -m pip install --upgrade pip
pip install -e '.[dev,anthropic,sentencepiece]' --extra-index-url https://download.pytorch.org/whl/cpu pip install -e '.[dev,anthropic,sentencepiece,optimum]' --extra-index-url https://download.pytorch.org/whl/cpu
# Install optional git dependencies # Install optional git dependencies
# pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt # pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
# if [ -f requirements.txt ]; then pip install -r requirements.txt; fi # if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
......
...@@ -45,26 +45,7 @@ git clone https://github.com/EleutherAI/lm-evaluation-harness ...@@ -45,26 +45,7 @@ git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness cd lm-evaluation-harness
pip install -e . pip install -e .
``` ```
We also provide a number of optional dependencies for extended functionality. A detailed table is available at the end of this document.
We also provide a number of optional dependencies for extended functionality. Extras can be installed via `pip install -e ".[NAME]"`
| Name | Use |
|---------------|---------------------------------------|
| anthropic | For using Anthropic's models |
| dev | For linting PRs and contributions |
| gptq | For loading models with GPTQ |
| ifeval | For running the IFEval task |
| mamba | For loading Mamba SSM models |
| math | For running math task answer checking |
| multilingual | For multilingual tokenizers |
| openai | For using OpenAI's models |
| promptsource | For using PromptSource prompts |
| sentencepiece | For using the sentencepiece tokenizer |
| testing | For running library test suite |
| vllm | For loading models with vLLM |
| zeno | For visualizing results with Zeno |
|---------------|---------------------------------------|
| all | Loads all extras (not recommended) |
## Basic Usage ## Basic Usage
...@@ -145,6 +126,9 @@ For more advanced users or even larger models, we allow for the following argume ...@@ -145,6 +126,9 @@ For more advanced users or even larger models, we allow for the following argume
These two options (`accelerate launch` and `parallelize=True`) are mutually exclusive. These two options (`accelerate launch` and `parallelize=True`) are mutually exclusive.
**Note: we do not currently support multi-node evaluations natively, and advise using either an externally hosted server to run inference requests against, or creating a custom integration with your distributed framework [as is done for the GPT-NeoX library](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py).**
### Tensor + Data Parallel and Optimized Inference with `vLLM` ### Tensor + Data Parallel and Optimized Inference with `vLLM`
We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html), especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example: We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html), especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:
...@@ -189,10 +173,10 @@ Note that for externally hosted models, configs such as `--device` and `--batch_ ...@@ -189,10 +173,10 @@ Note that for externally hosted models, configs such as `--device` and `--batch_
| [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark: | `gguf`, `ggml` | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp) | `generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented) | | [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark: | `gguf`, `ggml` | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp) | `generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented) |
| vLLM | :heavy_check_mark: | `vllm` | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | | vLLM | :heavy_check_mark: | `vllm` | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Mamba | :heavy_check_mark: | `mamba_ssm` | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | | Mamba | :heavy_check_mark: | `mamba_ssm` | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Your local inference server! | :heavy_check_mark: | `local-completions` or `local-chat-completions` (using `openai-chat-completions` model type) | Any server address that accepts GET requests using HF models and mirror's OpenAI's ChatCompletions interface | `generate_until` | | ... | | Huggingface Optimum (Causal LMs) | ✔️ | `openvino` | Any decoder-only AutoModelForCausalLM converted with Huggingface Optimum into OpenVINO™ Intermediate Representation (IR) format | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | ... |
| `local-completions` (using `openai-completions` model type) | Any server address that accepts GET requests using HF models and mirror's OpenAI's Completions interface | `generate_until` | | ... | | Your local inference server! | :heavy_check_mark: | `local-completions` or `local-chat-completions` (using `openai-chat-completions` model type) | Any server address that accepts GET requests using HF models and mirror's OpenAI's Completions or ChatCompletions interface | `generate_until` | | ... |
Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while models that are local or APIs that supply logprobs/logits can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface). For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).
...@@ -203,6 +187,8 @@ A number of other libraries contain scripts for calling the eval harness through ...@@ -203,6 +187,8 @@ A number of other libraries contain scripts for calling the eval harness through
To create your own custom integration you can follow instructions from [this tutorial](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage). To create your own custom integration you can follow instructions from [this tutorial](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage).
### Additional Features ### Additional Features
> [!Note]
> For tasks unsuitable for direct evaluation — either due risks associated with executing untrusted code or complexities in the evaluation process — the `--predict_only` flag is available to obtain decoded generations for post-hoc evaluation.
If you have a Metal compatible Mac, you can run the eval harness using the MPS back-end by replacing `--device cuda:0` with `--device mps` (requires PyTorch version 2.1 or higher). If you have a Metal compatible Mac, you can run the eval harness using the MPS back-end by replacing `--device cuda:0` with `--device mps` (requires PyTorch version 2.1 or higher).
...@@ -252,6 +238,9 @@ Additionally, one can provide a directory with `--use_cache` to cache the result ...@@ -252,6 +238,9 @@ Additionally, one can provide a directory with `--use_cache` to cache the result
For a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md) guide in our documentation! For a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md) guide in our documentation!
> [!Tip]
> Running lm-evaluation-harness as an external library and can't find (almost) any tasks available? Run `lm_eval.tasks.initialize_tasks()` to load the library's stock tasks before calling `lm_eval.evaluate()` or `lm_eval.simple_evaluate()` !
## Visualizing Results ## Visualizing Results
You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs. You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.
...@@ -315,6 +304,28 @@ We try to prioritize agreement with the procedures used by other groups to decre ...@@ -315,6 +304,28 @@ We try to prioritize agreement with the procedures used by other groups to decre
The best way to get support is to open an issue on this repo or join the [EleutherAI Discord server](https://discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases. If you've used the library and have had a positive (or negative) experience, we'd love to hear from you! The best way to get support is to open an issue on this repo or join the [EleutherAI Discord server](https://discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases. If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
## Optional Extras
Extras dependencies can be installed via `pip install -e ".[NAME]"`
| Name | Use |
|---------------|---------------------------------------|
| anthropic | For using Anthropic's models |
| dev | For linting PRs and contributions |
| gptq | For loading models with GPTQ |
| ifeval | For running the IFEval task |
| mamba | For loading Mamba SSM models |
| math | For running math task answer checking |
| multilingual | For multilingual tokenizers |
| openai | For using OpenAI's models |
| optimum | For running Intel OpenVINO models |
| promptsource | For using PromptSource prompts |
| sentencepiece | For using the sentencepiece tokenizer |
| testing | For running library test suite |
| vllm | For loading models with vLLM |
| zeno | For visualizing results with Zeno |
|---------------|---------------------------------------|
| all | Loads all extras (not recommended) |
## Cite as ## Cite as
``` ```
......
# Contributing to LM Evaluation Harness
Welcome and thank you for your interest in the LM Evaluation Harness! We welcome contributions and feedback and appreciate your time spent with our library, and hope you find it useful!
We intend LM Evaluation Harness to be a broadly useful and
## Important Resources
There are several places information about LM Evaluation Harness is located:
- Our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)
- We occasionally use [GitHub Milestones](https://github.com/EleutherAI/lm-evaluation-harness/milestones) to track progress toward specific near-term version releases.
- We maintain a [Project Board](https://github.com/orgs/EleutherAI/projects/25) for tracking current work items and PRs, and for future roadmap items or feature requests.
- Further discussion and support conversations are located in the #lm-thunderdome channel of the [EleutherAI discord](discord.gg/eleutherai).
## Code Style
LM Evaluation Harness uses [ruff](https://github.com/astral-sh/ruff) for linting via [pre-commit](https://pre-commit.com/).
You can install linters and dev tools via
```pip install lm_eval[dev]```
Then, run
```pre-commit install```
in order to ensure linters and other checks will be run upon committing.
## Testing
We use [pytest](https://docs.pytest.org/en/latest/) for running unit tests. All library unit tests can be run via:
```
python -m pytest --ignore=tests/tests_master --ignore=tests/extra
```
## Contributor License Agreement
We ask that new contributors agree to a Contributor License Agreement affirming that EleutherAI has the rights to use your contribution to our library.
First-time pull requests will have a reply added by @CLAassistant containing instructions for how to confirm this, and we require it before merging your PR.
## Contribution Best Practices
We recommend a few best practices to make your contributions or reported errors easier to assist with.
**For Pull Requests:**
- PRs should be titled descriptively, and be opened with a brief description of the scope and intent of the new contribution.
- New features should have appropriate documentation added alongside them.
- Aim for code maintainability, and minimize code copying.
- If opening a task, try to share test results on the task using a publicly-available model, and if any public results are available on the task, compare to them.
**For Feature Requests:**
- Provide a short paragraph's worth of description. What is the feature you are requesting? What is its motivation, and an example use case of it? How does this differ from what is currently supported?
**For Bug Reports**:
- Provide a short description of the bug.
- Provide a *reproducible example*--what is the command you run with our library that results in this error? Have you tried any other steps to resolve it?
- Provide a *full error traceback* of the error that occurs, if applicable. A one-line error message or small screenshot snippet is unhelpful without the surrounding context.
- Note what version of the codebase you are using, and any specifics of your environment and setup that may be relevant.
**For Requesting New Tasks**:
- Provide a 1-2 sentence description of what the task is and what it evaluates.
- Provide a link to the paper introducing the task.
- Provide a link to where the dataset can be found.
- Provide a link to a paper containing results on an open-source model on the task, for use in comparisons and implementation validation.
- If applicable, link to any codebase that has implemented the task (especially the original publication's codebase, if existent).
## How Can I Get Involved?
To quickly get started, we maintain a list of good first issues, which can be found [on our project board](https://github.com/orgs/EleutherAI/projects/25/views/8) or by [filtering GH Issues](https://github.com/EleutherAI/lm-evaluation-harness/issues?q=is%3Aopen+label%3A%22good+first+issue%22+label%3A%22help+wanted%22). These are typically smaller code changes or self-contained features which can be added without extensive familiarity with library internals, and we recommend new contributors consider taking a stab at one of these first if they are feeling uncertain where to begin.
There are a number of distinct ways to contribute to LM Evaluation Harness, and all are extremely helpful! A sampling of ways to contribute include:
- **Implementing and verifying new evaluation tasks**: Is there a task you'd like to see LM Evaluation Harness support? Consider opening an issue requesting it, or helping add it! Verifying and cross-checking task implementations with their original versions is also a very valuable form of assistance in ensuring standardized evaluation.
- **Improving documentation** - Improvements to the documentation, or noting pain points / gaps in documentation, are helpful in order for us to improve the user experience of the library and clarity + coverage of documentation.
- **Testing and devops** - We are very grateful for any assistance in adding tests for the library that can be run for new PRs, and other devops workflows.
- **Adding new modeling / inference library integrations** - We hope to support a broad range of commonly-used inference libraries popular among the community, and welcome PRs for new integrations, so long as they are documented properly and maintainable.
- **Proposing or Contributing New Features** - We want LM Evaluation Harness to support a broad range of evaluation usecases. If you have a feature that is not currently supported but desired, feel free to open an issue describing the feature and, if applicable, how you intend to implement it. We would be happy to give feedback on the cleanest way to implement new functionalities and are happy to coordinate with interested contributors via GH discussions or via discord.
We hope that this has been helpful, and appreciate your interest in contributing! Further questions can be directed to [our Discord](discord.gg/eleutherai).
...@@ -44,6 +44,8 @@ This mode supports a number of command-line arguments, the details of which can ...@@ -44,6 +44,8 @@ This mode supports a number of command-line arguments, the details of which can
* `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/` * `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`
* `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.
## External Library Usage ## External Library Usage
We also support using the library's external API for use within model training loops or other scripts. We also support using the library's external API for use within model training loops or other scripts.
......
...@@ -256,7 +256,7 @@ metric_list: ...@@ -256,7 +256,7 @@ metric_list:
``` ```
`aggregation` and `higher_is_better` can optionally be left out to default to the manually-set defaults if using a natively supported metric, otherwise it must be defined explicitly (for example, when using a custom metric implemented as a function). `aggregation` and `higher_is_better` can optionally be left out to default to the manually-set defaults if using a natively supported metric, otherwise it must be defined explicitly (for example, when using a custom metric implemented as a function).
For a full list of natively supported metrics and aggregation functions see `docs/advanced_task_guide.md`. All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval` or `hf_evaluate` is set to `true`. For a full list of natively supported metrics and aggregation functions see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval` or `hf_evaluate` is set to `true`.
### Optional, More Advanced Setup ### Optional, More Advanced Setup
...@@ -269,7 +269,7 @@ As a heuristic check: ...@@ -269,7 +269,7 @@ As a heuristic check:
* Do you expect to compute metrics after applying multiple such processing steps on your model outputs? * Do you expect to compute metrics after applying multiple such processing steps on your model outputs?
* Does your task rely on metrics that need a custom implementation? * Does your task rely on metrics that need a custom implementation?
For more detail on the task system and advanced features, see `docs/advanced_task_guide.md` . If none of the above sound like they apply to your task, it's time to continue onto checking your task performance! For more detail on the task system and advanced features, see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md) . If none of the above sound like they apply to your task, it's time to continue onto checking your task performance!
### Task name + groups (registering a task) ### Task name + groups (registering a task)
......
...@@ -143,6 +143,13 @@ def parse_eval_args() -> argparse.Namespace: ...@@ -143,6 +143,13 @@ def parse_eval_args() -> argparse.Namespace:
metavar="CRITICAL|ERROR|WARNING|INFO|DEBUG", metavar="CRITICAL|ERROR|WARNING|INFO|DEBUG",
help="Controls the reported logging error level. Set to DEBUG when testing + adding new task configurations for comprehensive log output.", help="Controls the reported logging error level. Set to DEBUG when testing + adding new task configurations for comprehensive log output.",
) )
parser.add_argument(
"--predict_only",
"-x",
action="store_true",
default=False,
help="Use with --log_samples. Only model outputs will be saved and metrics will not be evaluated.",
)
return parser.parse_args() return parser.parse_args()
...@@ -156,6 +163,11 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -156,6 +163,11 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
eval_logger.info(f"Verbosity set to {args.verbosity}") eval_logger.info(f"Verbosity set to {args.verbosity}")
os.environ["TOKENIZERS_PARALLELISM"] = "false" os.environ["TOKENIZERS_PARALLELISM"] = "false"
if args.predict_only:
args.log_samples = True
if (args.log_samples or args.predict_only) and not args.output_path:
assert args.output_path, "Specify --output_path"
initialize_tasks(args.verbosity) initialize_tasks(args.verbosity)
if args.limit: if args.limit:
...@@ -171,7 +183,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -171,7 +183,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
task_names = ALL_TASKS task_names = ALL_TASKS
elif args.tasks == "list": elif args.tasks == "list":
eval_logger.info( eval_logger.info(
"Available Tasks:\n - {}".format("\n - ".join(sorted(ALL_TASKS))) f"Available Tasks:\n - {(os.linesep + ' - ').join(sorted(ALL_TASKS))}"
) )
sys.exit() sys.exit()
else: else:
...@@ -223,8 +235,6 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -223,8 +235,6 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
else: else:
path.mkdir(parents=True, exist_ok=True) path.mkdir(parents=True, exist_ok=True)
output_path_file = path.joinpath("results.json") output_path_file = path.joinpath("results.json")
elif args.log_samples and not args.output_path:
assert args.output_path, "Specify --output_path"
eval_logger.info(f"Selected Tasks: {task_names}") eval_logger.info(f"Selected Tasks: {task_names}")
...@@ -243,6 +253,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -243,6 +253,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
write_out=args.write_out, write_out=args.write_out,
log_samples=args.log_samples, log_samples=args.log_samples,
gen_kwargs=args.gen_kwargs, gen_kwargs=args.gen_kwargs,
predict_only=args.predict_only,
) )
if results is not None: if results is not None:
...@@ -257,7 +268,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -257,7 +268,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
batch_sizes = ",".join(map(str, results["config"]["batch_sizes"])) batch_sizes = ",".join(map(str, results["config"]["batch_sizes"]))
if args.output_path: if args.output_path:
output_path_file.open("w").write(dumped) output_path_file.open("w", encoding="utf-8").write(dumped)
if args.log_samples: if args.log_samples:
for task_name, config in results["configs"].items(): for task_name, config in results["configs"].items():
......
from abc import ABC, abstractmethod
from dataclasses import dataclass from dataclasses import dataclass
from typing import List from typing import Callable, Iterable, List, Union
from datasets import Dataset
from lm_eval.api.instance import Instance from lm_eval.api.instance import Instance
class Filter: class Filter(ABC):
""" """
Filter classes operate on a per-task level. Filter classes operate on a per-task level.
They take all model outputs (`instance.resps` for all `task.instances`) They take all model outputs (`instance.resps` for all `task.instances`)
...@@ -15,12 +14,13 @@ class Filter: ...@@ -15,12 +14,13 @@ class Filter:
""" """
def __init__(self, *args, **kwargs) -> None: def __init__(self, **kwargs) -> None:
""" """
Can define custom behavior here, if an individual instantiation of a Filter class should have state. Can define custom behavior here, if an individual instantiation of a Filter class should have state.
""" """
def apply(self, resps, docs): @abstractmethod
def apply(self, resps: Union[List, Iterable], docs: List[dict]) -> Iterable:
""" """
Defines the operation to perform on a list of the `inst.resps` properties of `Instance` objects. Defines the operation to perform on a list of the `inst.resps` properties of `Instance` objects.
Should return the list of (filtered) response lists *in the same order as they were input*, e.g. Should return the list of (filtered) response lists *in the same order as they were input*, e.g.
...@@ -40,15 +40,15 @@ class FilterEnsemble: ...@@ -40,15 +40,15 @@ class FilterEnsemble:
""" """
name: str name: str
filters: List[Filter] filters: List[Callable[[], Filter]]
def apply(self, instances: List[Instance]) -> None:
resps, docs = zip(*((inst.resps, inst.doc) for inst in instances))
resps, docs = list(resps), list(docs)
def apply(self, instances: List[Instance], docs: List[Dataset]) -> None:
resps = [
inst.resps for inst in instances
] # operate just on the model responses
for f in self.filters: for f in self.filters:
# apply filters in sequence # apply filters in sequence
resps = f.apply(resps, docs) resps = f().apply(resps, docs)
# add the end results after filtering to filtered_requests of their respective source instances. # add the end results after filtering to filtered_requests of their respective source instances.
# has key `self.name`: each FilterEnsemble applied in a given run should use a different name. # has key `self.name`: each FilterEnsemble applied in a given run should use a different name.
......
...@@ -4,7 +4,12 @@ from typing import Literal, Tuple ...@@ -4,7 +4,12 @@ from typing import Literal, Tuple
@dataclass @dataclass
class Instance: class Instance:
request_type: Literal["loglikelihood", "loglikelihood_rolling", "generate_until"] request_type: Literal[
"loglikelihood",
"loglikelihood_rolling",
"generate_until",
"multiple_choice",
]
doc: dict doc: dict
arguments: tuple arguments: tuple
idx: int idx: int
......
...@@ -15,6 +15,11 @@ eval_logger = logging.getLogger("lm-eval") ...@@ -15,6 +15,11 @@ eval_logger = logging.getLogger("lm-eval")
# Register Aggregations First # Register Aggregations First
@register_aggregation("bypass")
def bypass_agg(arr):
return 999
@register_aggregation("mean") @register_aggregation("mean")
def mean(arr): def mean(arr):
return sum(arr) / len(arr) return sum(arr) / len(arr)
...@@ -207,6 +212,16 @@ def mean_stderr(arr): ...@@ -207,6 +212,16 @@ def mean_stderr(arr):
return sample_stddev(arr) / math.sqrt(len(arr)) return sample_stddev(arr) / math.sqrt(len(arr))
@register_metric(
metric="bypass",
higher_is_better=True,
output_type=["loglikelihood", "multiple_choice", "generate_until"],
aggregation="bypass",
)
def bypass(items):
return None
@register_metric( @register_metric(
metric="mcc", metric="mcc",
higher_is_better=True, higher_is_better=True,
......
...@@ -152,18 +152,14 @@ def get_aggregation(name): ...@@ -152,18 +152,14 @@ def get_aggregation(name):
try: try:
return AGGREGATION_REGISTRY[name] return AGGREGATION_REGISTRY[name]
except KeyError: except KeyError:
eval_logger.warning( eval_logger.warning(f"{name} not a registered aggregation metric!")
"{} not a registered aggregation metric!".format(name),
)
def get_metric_aggregation(name): def get_metric_aggregation(name):
try: try:
return METRIC_AGGREGATION_REGISTRY[name] return METRIC_AGGREGATION_REGISTRY[name]
except KeyError: except KeyError:
eval_logger.warning( eval_logger.warning(f"{name} metric is not assigned a default aggregation!")
"{} metric is not assigned a default aggregation!".format(name),
)
def is_higher_better(metric_name): def is_higher_better(metric_name):
......
...@@ -5,6 +5,7 @@ import random ...@@ -5,6 +5,7 @@ import random
import re import re
from collections.abc import Callable from collections.abc import Callable
from dataclasses import asdict, dataclass from dataclasses import asdict, dataclass
from inspect import getsource
from typing import Any, List, Literal, Tuple, Union from typing import Any, List, Literal, Tuple, Union
import datasets import datasets
...@@ -37,7 +38,6 @@ ALL_OUTPUT_TYPES = [ ...@@ -37,7 +38,6 @@ ALL_OUTPUT_TYPES = [
"generate_until", "generate_until",
] ]
eval_logger = logging.getLogger("lm-eval") eval_logger = logging.getLogger("lm-eval")
...@@ -74,7 +74,12 @@ class TaskConfig(dict): ...@@ -74,7 +74,12 @@ class TaskConfig(dict):
num_fewshot: int = None num_fewshot: int = None
# scoring options # scoring options
metric_list: list = None metric_list: list = None
output_type: str = "generate_until" output_type: Literal[
"loglikelihood",
"loglikelihood_rolling",
"generate_until",
"multiple_choice",
] = "generate_until"
generation_kwargs: dict = None generation_kwargs: dict = None
repeats: int = 1 repeats: int = 1
filter_list: Union[str, list] = None filter_list: Union[str, list] = None
...@@ -110,15 +115,13 @@ class TaskConfig(dict): ...@@ -110,15 +115,13 @@ class TaskConfig(dict):
"do_sample": False, "do_sample": False,
} }
# TODO: how to make TaskConfigs be de- and re-serializable, even when using the !function constructor?
def __getitem__(self, item): def __getitem__(self, item):
return getattr(self, item) return getattr(self, item)
def __setitem__(self, item, value): def __setitem__(self, item, value):
return setattr(self, item, value) return setattr(self, item, value)
def to_dict(self, keep_callable=False): def to_dict(self, keep_callable: bool = False) -> dict:
"""dumps the current config as a dictionary object, as a printable format. """dumps the current config as a dictionary object, as a printable format.
null fields will not be printed. null fields will not be printed.
Used for dumping results alongside full task configuration Used for dumping results alongside full task configuration
...@@ -133,14 +136,34 @@ class TaskConfig(dict): ...@@ -133,14 +136,34 @@ class TaskConfig(dict):
for k, v in list(cfg_dict.items()): for k, v in list(cfg_dict.items()):
if v is None: if v is None:
cfg_dict.pop(k) cfg_dict.pop(k)
elif isinstance(v, Callable): elif k == "metric_list":
if keep_callable: for metric_dict in v:
cfg_dict[k] = v for metric_key, metric_value in metric_dict.items():
else: if callable(metric_value):
# TODO: this should handle Promptsource template objects as a separate case? metric_dict[metric_key] = self.serialize_function(
cfg_dict[k] = str(v) metric_value, keep_callable=keep_callable
)
cfg_dict[k] = v
elif callable(v):
cfg_dict[k] = self.serialize_function(v, keep_callable=keep_callable)
return cfg_dict return cfg_dict
def serialize_function(
self, value: Union[Callable, str], keep_callable=False
) -> Union[Callable, str]:
"""Serializes a given function or string.
If 'keep_callable' is True, the original callable is returned.
Otherwise, attempts to return the source code of the callable using 'getsource'.
"""
if keep_callable:
return value
else:
try:
return getsource(value)
except (TypeError, OSError):
return str(value)
class Task(abc.ABC): class Task(abc.ABC):
"""A task represents an entire benchmark including its dataset, problems, """A task represents an entire benchmark including its dataset, problems,
...@@ -490,7 +513,7 @@ class Task(abc.ABC): ...@@ -490,7 +513,7 @@ class Task(abc.ABC):
def apply_filters(self): def apply_filters(self):
if hasattr(self, "_filters"): if hasattr(self, "_filters"):
for f in self._filters: for f in self._filters:
f.apply(self._instances, None) f.apply(self._instances)
else: else:
eval_logger.warning("No filter defined, passing through instances") eval_logger.warning("No filter defined, passing through instances")
return self._instances return self._instances
...@@ -626,16 +649,15 @@ class ConfigurableTask(Task): ...@@ -626,16 +649,15 @@ class ConfigurableTask(Task):
if self.config.filter_list is not None: if self.config.filter_list is not None:
self._filters = [] self._filters = []
for filter_config in self.config.filter_list: for filter_config in self.config.filter_list:
for filter_pipeline in filter_config: filter_name = filter_config["name"]
filter_name = filter_config["name"] filter_functions = filter_config["filter"]
filter_functions = filter_config["filter"] components = []
components = [] for function in filter_functions:
for function in filter_functions: kwargs = {
kwargs = { key: function[key] for key in function if key != "function"
key: function[key] for key in function if key != "function" }
} components.append([function["function"], kwargs])
components.append([function["function"], kwargs]) filter_pipeline = build_filter_ensemble(filter_name, components)
filter_pipeline = build_filter_ensemble(filter_name, components)
self._filters.append(filter_pipeline) self._filters.append(filter_pipeline)
else: else:
self._filters = [build_filter_ensemble("none", [["take_first", None]])] self._filters = [build_filter_ensemble("none", [["take_first", None]])]
...@@ -813,7 +835,7 @@ class ConfigurableTask(Task): ...@@ -813,7 +835,7 @@ class ConfigurableTask(Task):
def apply_filters(self): def apply_filters(self):
if hasattr(self, "_filters"): if hasattr(self, "_filters"):
for f in self._filters: for f in self._filters:
f.apply(self._instances, self.task_docs) f.apply(self._instances)
else: else:
eval_logger.warning("No filter defined, passing through instances") eval_logger.warning("No filter defined, passing through instances")
return self._instances return self._instances
...@@ -1191,12 +1213,46 @@ class ConfigurableTask(Task): ...@@ -1191,12 +1213,46 @@ class ConfigurableTask(Task):
return result_dict return result_dict
def aggregation(self): def aggregation(self) -> dict:
return self._aggregation_list return self._aggregation_list
def higher_is_better(self): def higher_is_better(self) -> dict:
return self._higher_is_better return self._higher_is_better
def get_config(self, key: str) -> Any:
return getattr(self._config, key, None)
def override_metric(self, metric_name: str) -> None:
"""
Override the default metrics used for evaluation with custom metrics.
Parameters:
- metric_name (str): The name of the custom metric to override. Should be registered in api.metrics.
"""
(
self._metric_fn_list,
self._aggregation_list,
self._metric_fn_kwargs,
self._higher_is_better,
) = ({}, {}, {}, {})
self._metric_fn_list[metric_name] = get_metric(metric_name)
self._aggregation_list[metric_name] = get_metric_aggregation(metric_name)
self._higher_is_better[metric_name] = is_higher_better(metric_name)
self._metric_fn_kwargs[metric_name] = {}
setattr(self._config, "metric_list", [{"metric": metric_name}])
setattr(self._config, "process_results", None)
def override_config(
self, key: str = None, value: Any = None, update: bool = False
) -> None:
if update:
current_value = getattr(self._config, key)
assert isinstance(current_value, dict)
current_value.update(value)
setattr(self._config, key, current_value)
else:
setattr(self._config, key, value)
class MultipleChoiceTask(Task): class MultipleChoiceTask(Task):
OUTPUT_TYPE: str = "loglikelihood" OUTPUT_TYPE: str = "loglikelihood"
......
...@@ -30,7 +30,9 @@ class Archive: ...@@ -30,7 +30,9 @@ class Archive:
self.cctx = zstandard.ZstdCompressor(level=compression_level) self.cctx = zstandard.ZstdCompressor(level=compression_level)
self.compressor = self.cctx.stream_writer(self.fh) self.compressor = self.cctx.stream_writer(self.fh)
def add_data(self, data, meta={}) -> None: def add_data(self, data, meta=None) -> None:
if meta is None:
meta = {}
self.compressor.write( self.compressor.write(
json.dumps({"text": data, "meta": meta}, default=json_serial).encode( json.dumps({"text": data, "meta": meta}, default=json_serial).encode(
"UTF-8" "UTF-8"
...@@ -108,7 +110,7 @@ class TextReader: ...@@ -108,7 +110,7 @@ class TextReader:
def read_tqdm(self, update_frequency: int = 10000): def read_tqdm(self, update_frequency: int = 10000):
current_file_position = 0 current_file_position = 0
line_counter = 0 line_counter = 0
with open(self.file_path, "r") as fh, tqdm.tqdm( with open(self.file_path, "r", encoding="utf-8") as fh, tqdm.tqdm(
total=os.path.getsize(self.file_path), total=os.path.getsize(self.file_path),
dynamic_ncols=True, dynamic_ncols=True,
unit="byte", unit="byte",
......
...@@ -38,7 +38,7 @@ def get_train_overlap(docs_by_task_set: dict, ngrams_path: str, limit: int) -> d ...@@ -38,7 +38,7 @@ def get_train_overlap(docs_by_task_set: dict, ngrams_path: str, limit: int) -> d
# return get_train_overlap_stub(docs, ngrams_path, ngrams_n_size) # return get_train_overlap_stub(docs, ngrams_path, ngrams_n_size)
info_dict_path = os.path.join(ngrams_path, "info.json") info_dict_path = os.path.join(ngrams_path, "info.json")
info_dict = json.load(open(info_dict_path, "r")) info_dict = json.load(open(info_dict_path, "r", encoding="utf-8"))
ngrams_n_size = info_dict["ngram_size"] ngrams_n_size = info_dict["ngram_size"]
janitor = Janitor() janitor = Janitor()
......
...@@ -25,7 +25,7 @@ from lm_eval.utils import ( ...@@ -25,7 +25,7 @@ from lm_eval.utils import (
def simple_evaluate( def simple_evaluate(
model, model,
model_args=None, model_args=None,
tasks=[], tasks=None,
num_fewshot=None, num_fewshot=None,
batch_size=None, batch_size=None,
max_batch_size=None, max_batch_size=None,
...@@ -38,6 +38,7 @@ def simple_evaluate( ...@@ -38,6 +38,7 @@ def simple_evaluate(
write_out: bool = False, write_out: bool = False,
log_samples: bool = True, log_samples: bool = True,
gen_kwargs: str = None, gen_kwargs: str = None,
predict_only: bool = False,
): ):
"""Instantiate and evaluate a model on a list of tasks. """Instantiate and evaluate a model on a list of tasks.
...@@ -71,6 +72,9 @@ def simple_evaluate( ...@@ -71,6 +72,9 @@ def simple_evaluate(
:param gen_kwargs: str :param gen_kwargs: str
String arguments for model generation String arguments for model generation
Ignored for all tasks with loglikelihood output_type Ignored for all tasks with loglikelihood output_type
:param predict_only: bool
If true only model outputs will be generated and returned. Metrics will not be evaluated
:return :return
Dictionary of results Dictionary of results
""" """
...@@ -80,6 +84,8 @@ def simple_evaluate( ...@@ -80,6 +84,8 @@ def simple_evaluate(
1234 1234
) # TODO: this may affect training runs that are run with evaluation mid-run. ) # TODO: this may affect training runs that are run with evaluation mid-run.
if tasks is None:
tasks = []
assert ( assert (
tasks != [] tasks != []
), "No tasks specified, or no tasks found. Please verify the task names." ), "No tasks specified, or no tasks found. Please verify the task names."
...@@ -87,7 +93,7 @@ def simple_evaluate( ...@@ -87,7 +93,7 @@ def simple_evaluate(
if gen_kwargs is not None: if gen_kwargs is not None:
gen_kwargs = simple_parse_args_string(gen_kwargs) gen_kwargs = simple_parse_args_string(gen_kwargs)
eval_logger.warning( eval_logger.warning(
"generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks." "generation_kwargs specified through cli, these settings will update set parameters in yaml tasks. Ensure 'do_sample=True' for non-greedy decoding!"
) )
if gen_kwargs == "": if gen_kwargs == "":
gen_kwargs = None gen_kwargs = None
...@@ -122,27 +128,35 @@ def simple_evaluate( ...@@ -122,27 +128,35 @@ def simple_evaluate(
task_dict = lm_eval.tasks.get_task_dict(tasks) task_dict = lm_eval.tasks.get_task_dict(tasks)
for task_name in task_dict.keys(): for task_name in task_dict.keys():
task_obj = task_dict[task_name] task_obj = task_dict[task_name]
if type(task_obj) == tuple: if isinstance(task_obj, tuple):
_, task_obj = task_obj _, task_obj = task_obj
if task_obj is None: if task_obj is None:
continue continue
config = task_obj._config if task_obj.get_config("output_type") == "generate_until":
if config["output_type"] == "generate_until" and gen_kwargs is not None: if gen_kwargs is not None:
config["generation_kwargs"].update(gen_kwargs) task_obj.override_config(
key="generation_kwargs", value=gen_kwargs, update=True
)
if predict_only:
log_samples = True
eval_logger.info(
f"Processing {task_name} in output-only mode. Metrics will not be calculated!"
)
# we have to change the class properties post-hoc. This is pretty hacky.
task_obj.override_metric(metric_name="bypass")
if num_fewshot is not None: if num_fewshot is not None:
if config["num_fewshot"] == 0: if (default_num_fewshot := task_obj.get_config("num_fewshot")) == 0:
eval_logger.info( eval_logger.info(
f"num_fewshot has been set to 0 for {task_name} in its config. Manual configuration will be ignored." f"num_fewshot has been set to 0 for {task_name} in its config. Manual configuration will be ignored."
) )
else: else:
default_num_fewshot = config["num_fewshot"]
eval_logger.warning( eval_logger.warning(
f"Overwriting default num_fewshot of {task_name} from {default_num_fewshot} to {num_fewshot}" f"Overwriting default num_fewshot of {task_name} from {default_num_fewshot} to {num_fewshot}"
) )
task_obj.override_config(key="num_fewshot", value=num_fewshot)
task_obj._config["num_fewshot"] = num_fewshot
if check_integrity: if check_integrity:
run_task_tests(task_list=tasks) run_task_tests(task_list=tasks)
...@@ -218,6 +232,14 @@ def evaluate( ...@@ -218,6 +232,14 @@ def evaluate(
# decontaminate = decontamination_ngrams_path is not None # decontaminate = decontamination_ngrams_path is not None
for task_name, task in task_dict.items():
if isinstance(task, tuple):
_, task = task
if not log_samples:
assert (
"bypass" not in getattr(task, "_metric_fn_list", {}).keys()
), f"log_samples must be True for 'bypass' only tasks: {task_name}"
# stores the final result for each task, for each metric/filter pair. # stores the final result for each task, for each metric/filter pair.
results = collections.defaultdict(dict) results = collections.defaultdict(dict)
# Tracks each task's version. # Tracks each task's version.
...@@ -242,7 +264,7 @@ def evaluate( ...@@ -242,7 +264,7 @@ def evaluate(
# get lists of each type of request # get lists of each type of request
for task_name, task in task_dict.items(): for task_name, task in task_dict.items():
if type(task) == tuple: if isinstance(task, tuple):
group_name, task = task group_name, task = task
task_hierarchy[group_name].append(task_name) task_hierarchy[group_name].append(task_name)
versions[group_name] = "N/A" versions[group_name] = "N/A"
...@@ -316,7 +338,7 @@ def evaluate( ...@@ -316,7 +338,7 @@ def evaluate(
### Run LM on inputs, get all outputs ### ### Run LM on inputs, get all outputs ###
# execute each type of request # execute each type of request
for reqtype, reqs in requests.items(): for reqtype, reqs in requests.items():
eval_logger.info("Running {} requests".format(reqtype)) eval_logger.info(f"Running {reqtype} requests")
# create `K` copies of each request `req` based off `K = req.repeats` # create `K` copies of each request `req` based off `K = req.repeats`
cloned_reqs = [] cloned_reqs = []
for req in reqs: for req in reqs:
...@@ -339,7 +361,7 @@ def evaluate( ...@@ -339,7 +361,7 @@ def evaluate(
### Postprocess outputs ### ### Postprocess outputs ###
# TODO: del model here, maybe (idea: allow user to specify device of e.g. reward model separately) # TODO: del model here, maybe (idea: allow user to specify device of e.g. reward model separately)
for task_name, task in task_dict.items(): for task_name, task in task_dict.items():
if type(task) == tuple: if isinstance(task, tuple):
group, task = task group, task = task
if task is None: if task is None:
continue continue
...@@ -350,7 +372,7 @@ def evaluate( ...@@ -350,7 +372,7 @@ def evaluate(
# unpack results and sort back in order and return control to Task # unpack results and sort back in order and return control to Task
for task_name, task in task_dict.items(): for task_name, task in task_dict.items():
if type(task) == tuple: if isinstance(task, tuple):
group, task = task group, task = task
if task is None: if task is None:
continue continue
...@@ -401,7 +423,7 @@ def evaluate( ...@@ -401,7 +423,7 @@ def evaluate(
vals_torch = collections.defaultdict(list) vals_torch = collections.defaultdict(list)
for (task_name, key, metric), items in vals.items(): for (task_name, key, metric), items in vals.items():
numitem = 0 numitem = 0
if type(items[0]) == tuple: if isinstance(items[0], tuple):
numitem = len(items[0]) numitem = len(items[0])
if isinstance(items[0], (str, list, tuple)): if isinstance(items[0], (str, list, tuple)):
...@@ -447,7 +469,7 @@ def evaluate( ...@@ -447,7 +469,7 @@ def evaluate(
task = task_dict[task_name] task = task_dict[task_name]
metric_key = metric + "," + key metric_key = metric + "," + key
if type(task) == tuple: if isinstance(task, tuple):
group_name, task = task group_name, task = task
else: else:
group_name = None group_name = None
...@@ -474,7 +496,8 @@ def evaluate( ...@@ -474,7 +496,8 @@ def evaluate(
if bool(results): if bool(results):
for group, task_list in reversed(task_hierarchy.items()): for group, task_list in reversed(task_hierarchy.items()):
if task_list == []: if task_list == []:
total_size = results[group]["samples"] # TODO: No samples when bypass
total_size = results[group].get("samples", 999)
else: else:
total_size = 0 total_size = 0
...@@ -510,7 +533,7 @@ def evaluate( ...@@ -510,7 +533,7 @@ def evaluate(
+ metric_score * current_size + metric_score * current_size
) / (total_size + current_size) ) / (total_size + current_size)
# $$s_z^2 = \frac{(n-1) s_x^2 + (m-1) s_y^2}{n+m-1} + \frac{nm(\bar x - \bar y)^2}{(n+m)(n+m-1)}.$$ # $$s_z^2 = \frac{(n-1) s_x^2 + (m-1) s_y^2}{n+m-1} + \frac{nm(\bar x - \bar y)^2}{(n+m)(n+m-1)}.$$
if var_score == "N/A": if var_score == "N/A" or results[group][stderr] == "N/A":
results[group][stderr] = "N/A" results[group][stderr] = "N/A"
else: else:
results[group][stderr] = ( results[group][stderr] = (
......
from typing import List, Union
from functools import partial
from lm_eval.api.filter import FilterEnsemble from lm_eval.api.filter import FilterEnsemble
from . import selection from . import selection
from . import extraction from . import extraction
...@@ -20,24 +23,25 @@ FILTER_REGISTRY = { ...@@ -20,24 +23,25 @@ FILTER_REGISTRY = {
} }
def get_filter(filter_name): def get_filter(filter_name: str) -> Union[type, str]:
if filter_name in FILTER_REGISTRY: if filter_name in FILTER_REGISTRY:
return FILTER_REGISTRY[filter_name] return FILTER_REGISTRY[filter_name]
else: else:
return filter_name return filter_name
def build_filter_ensemble(filter_name, components): def build_filter_ensemble(
filter_name: str, components: List[List[str]]
) -> FilterEnsemble:
""" """
Create a filtering pipeline. Create a filtering pipeline.
""" """
filters = [] filters = []
for function, kwargs in components: for function, kwargs in components:
if kwargs is None: if kwargs is None:
f = get_filter(function)() kwargs = {}
else: # create a filter given its name in the registry
# create a filter given its name in the registry f = partial(get_filter(function), **kwargs)
f = get_filter(function)(**kwargs) # TODO: pass kwargs to filters properly
# add the filter as a pipeline step # add the filter as a pipeline step
filters.append(f) filters.append(f)
......
...@@ -17,12 +17,14 @@ class TakeFirstFilter(Filter): ...@@ -17,12 +17,14 @@ class TakeFirstFilter(Filter):
class TakeKFilter(Filter): class TakeKFilter(Filter):
def __init__(self, *args, **kwargs) -> None: def __init__(self, **kwargs) -> None:
self.k = kwargs.pop("k") self.k = kwargs.pop("k")
super().__init__(*args, **kwargs) super().__init__(**kwargs)
def apply(self, resps, docs): def apply(self, resps, docs):
# need resp to be subscriptable to check below
resps = list(resps)
# check we have at least k responses per doc, else we can't take the first k # check we have at least k responses per doc, else we can't take the first k
assert ( assert (
len(resps[0]) >= self.k len(resps[0]) >= self.k
......
...@@ -24,7 +24,7 @@ class UppercaseFilter(Filter): ...@@ -24,7 +24,7 @@ class UppercaseFilter(Filter):
class MapFilter(Filter): class MapFilter(Filter):
def __init__(self, mapping_dict: dict = {}, default_value=None) -> None: def __init__(self, mapping_dict: dict = None, default_value=None) -> None:
""" """
Initializes the MapFilter with a given mapping dictionary and default value. Initializes the MapFilter with a given mapping dictionary and default value.
...@@ -37,6 +37,8 @@ class MapFilter(Filter): ...@@ -37,6 +37,8 @@ class MapFilter(Filter):
Example: Example:
mapper = MapFilter({'A': 1, 'B': 2}, default_value=0) mapper = MapFilter({'A': 1, 'B': 2}, default_value=0)
""" """
if mapping_dict is None:
mapping_dict = {}
assert isinstance( assert isinstance(
mapping_dict, dict mapping_dict, dict
), "Provided mapping_dict is not a dictionary" ), "Provided mapping_dict is not a dictionary"
......
...@@ -6,5 +6,6 @@ from . import anthropic_llms ...@@ -6,5 +6,6 @@ from . import anthropic_llms
from . import gguf from . import gguf
from . import vllm_causallms from . import vllm_causallms
from . import mamba_lm from . import mamba_lm
from . import optimum_lm
# TODO: implement __all__ # TODO: implement __all__
...@@ -200,8 +200,9 @@ class HFLM(LM): ...@@ -200,8 +200,9 @@ class HFLM(LM):
) )
# access self._model through self.model property outside this method # access self._model through self.model property outside this method
self.model.eval() if isinstance(self.model, torch.nn.Module):
self.model.tie_weights() self.model.eval()
self.model.tie_weights()
if isinstance(pretrained, str) and (gpus >= 1 or str(self.device) == "mps"): if isinstance(pretrained, str) and (gpus >= 1 or str(self.device) == "mps"):
# TODO: can remove this whole snippet except in the mps case, perhaps? # TODO: can remove this whole snippet except in the mps case, perhaps?
...@@ -238,6 +239,16 @@ class HFLM(LM): ...@@ -238,6 +239,16 @@ class HFLM(LM):
if self.config.model_type == "qwen": if self.config.model_type == "qwen":
# Qwen's trust_remote_code tokenizer does not allow for adding special tokens # Qwen's trust_remote_code tokenizer does not allow for adding special tokens
self.tokenizer.pad_token = "<|endoftext|>" self.tokenizer.pad_token = "<|endoftext|>"
elif (
self.tokenizer.__class__.__name__ == "RWKVWorldTokenizer"
or self.tokenizer.__class__.__name__ == "Rwkv5Tokenizer"
):
# The RWKV world tokenizer, does not allow for adding special tokens / setting the pad token (which is set as 0)
# The additional tokenizer name check is needed, as there exists rwkv4 models with neox tokenizer
# ---
# Note that the world tokenizer class name, might change in the future for the final huggingface merge
# https://github.com/huggingface/transformers/pull/26963
assert self.tokenizer.pad_token_id == 0
else: else:
self.tokenizer.add_special_tokens({"pad_token": "<|pad|>"}) self.tokenizer.add_special_tokens({"pad_token": "<|pad|>"})
...@@ -602,8 +613,7 @@ class HFLM(LM): ...@@ -602,8 +613,7 @@ class HFLM(LM):
(batch_size, max_length), device=self.device (batch_size, max_length), device=self.device
).long() ).long()
for _ in range(5): for _ in range(5):
out = F.log_softmax(self._model_call(test_batch, **call_kwargs), dim=-1) out = F.log_softmax(self._model_call(test_batch, **call_kwargs), dim=-1) # noqa: F841
out = out # Identity process so that it passes pre-commit
return batch_size return batch_size
...@@ -705,10 +715,14 @@ class HFLM(LM): ...@@ -705,10 +715,14 @@ class HFLM(LM):
return self.model(inps).logits return self.model(inps).logits
def _model_generate(self, context, max_length, stop, **generation_kwargs): def _model_generate(self, context, max_length, stop, **generation_kwargs):
# we require users to pass do_sample=True explicitly # temperature = 0.0 if not set
# for non-greedy gen. This should be reevaluated when considering beam search. # if do_sample is false and temp==0.0:
if "do_sample" not in generation_kwargs: # remove temperature, as do_sample=False takes care of this
generation_kwargs["do_sample"] = False # and we don't want a warning from HF
generation_kwargs["temperature"] = generation_kwargs.get("temperature", 0.0)
do_sample = generation_kwargs.get("do_sample", None)
if do_sample is False and generation_kwargs.get("temperature") == 0.0:
generation_kwargs.pop("temperature")
# build stopping criteria # build stopping criteria
stopping_criteria = stop_sequences_criteria( stopping_criteria = stop_sequences_criteria(
self.tokenizer, stop, context.shape[1], context.shape[0] self.tokenizer, stop, context.shape[1], context.shape[0]
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment