Unverified Commit cda25fef authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Merge branch 'main' into standardize_metrics

parents dfb41835 4d10ad56
......@@ -56,7 +56,7 @@ jobs:
if: steps.changed-tasks.outputs.tasks_any_modified == 'true' || steps.changed-tasks.outputs.api_any_modified == 'true'
run: |
python -m pip install --upgrade pip
pip install -e '.[testing]' --extra-index-url https://download.pytorch.org/whl/cpu
pip install -e '.[dev]' --extra-index-url https://download.pytorch.org/whl/cpu
# Install optional git dependencies
# pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
# if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
......
......@@ -17,29 +17,22 @@ jobs:
linter:
name: Linters
runs-on: ubuntu-latest
timeout-minutes: 20
timeout-minutes: 5
steps:
- name: Checkout Code
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Set up Python 3.8
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: 3.8
cache: pip
cache-dependency-path: setup.py
- name: Install dependencies
run: pip install -e '.[linting,testing]' --extra-index-url https://download.pytorch.org/whl/cpu ; export SKIP=no-commit-to-branch # env var deactivates --no-commit-to-branch
cache-dependency-path: pyproject.toml
- name: Pre-Commit
env:
SKIP: "no-commit-to-branch,mypy"
uses: pre-commit/action@v3.0.0
- name: Lint with pylint
run: python -m pylint --disable=all -e W0311 --jobs=0 --indent-string=' ' **/*.py
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=F,E9,E71,E72,E501,E112,E113,W6 --extend-ignore=F541 --show-source --statistics --exit-zero
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
# # mypy turned off for now
# - name: Lint with mypy
# run: mypy . --ignore-missing-imports --check-untyped-defs --explicit-package-bases --warn-unreachable
......@@ -53,17 +46,17 @@ jobs:
timeout-minutes: 30
steps:
- name: Checkout Code
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: pip
cache-dependency-path: setup.py
cache-dependency-path: pyproject.toml
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e '.[testing,anthropic,sentencepiece]' --extra-index-url https://download.pytorch.org/whl/cpu
pip install -e '.[dev,anthropic,sentencepiece]' --extra-index-url https://download.pytorch.org/whl/cpu
# Install optional git dependencies
# pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
# if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
......
......@@ -27,14 +27,16 @@ repos:
args: [--remove]
- id: mixed-line-ending
args: [--fix=lf]
- repo: https://github.com/pycqa/flake8
rev: 3.7.9
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.1.8
hooks:
- id: flake8
- repo: https://github.com/psf/black
rev: 22.3.0
hooks:
- id: black
# Run the linter.
- id: ruff
args:
- --fix
# Run the formatter.
- id: ruff-format
- repo: https://github.com/codespell-project/codespell
rev: v2.1.0
hooks:
......
......@@ -18,7 +18,7 @@ New updates and features include:
Please see our updated documentation pages in `docs/` for more details.
Development will be continuing on the `main` branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub, or in the [EleutherAI discord](discord.gg/eleutherai)!
Development will be continuing on the `main` branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub, or in the [EleutherAI discord](https://discord.gg/eleutherai)!
## Overview
......@@ -49,23 +49,28 @@ pip install -e .
We also provide a number of optional dependencies for extended functionality. Extras can be installed via `pip install -e ".[NAME]"`
| Name | Use |
| ------------- | ------------------------------------- |
|---------------|---------------------------------------|
| anthropic | For using Anthropic's models |
| dev | You probably don't want to use this |
| dev | For linting PRs and contributions |
| gptq | For loading models with GPTQ |
| testing | You probably don't want to use this |
| ifeval | For running the IFEval task |
| mamba | For loading Mamba SSM models |
| math | For running math task answer checking |
| multilingual | For multilingual tokenizers |
| openai | For using OpenAI's models |
| promptsource | For using PromtSource prompts |
| promptsource | For using PromptSource prompts |
| sentencepiece | For using the sentencepiece tokenizer |
| testing | For running library test suite |
| vllm | For loading models with vLLM |
| all | Loads all extras |
| zeno | For visualizing results with Zeno |
|---------------|---------------------------------------|
| all | Loads all extras (not recommended) |
## Basic Usage
### Hugging Face `transformers`
To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on `hellaswag` you can use the following command:
To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on `hellaswag` you can use the following command (this assumes you are using a CUDA-compatible GPU):
```bash
lm_eval --model hf \
......@@ -151,23 +156,28 @@ To call a hosted model, use:
```bash
export OPENAI_API_KEY=YOUR_KEY_HERE
lm_eval --model openai-completions \
--model_args engine=davinci \
--model_args model=davinci \
--tasks lambada_openai,hellaswag
```
Note that for externally hosted models, configs such as `--device` and `--batch_size` should not be used and do not function. Just like you can use `--model_args` to pass arbitrary arguments to the model constructor for local models, you can use it to pass arbitrary arguments to the model API for hosted models. See the documentation of the hosting service for information on what arguments they support.
We also support using your own local inference server with an implemented version of the OpenAI ChatCompletions endpoint and passing trained HuggingFace artifacts and tokenizers.
```bash
lm_eval --model local-chat-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1
```
Note that for externally hosted models, configs such as `--device` and `--batch_size` should not be used and do not function. Just like you can use `--model_args` to pass arbitrary arguments to the model constructor for local models, you can use it to pass arbitrary arguments to the model API for hosted models. See the documentation of the hosting service for information on what arguments they support.
| API or Inference Server | Implemented? | `--model <xxx>` name | Models supported: | Request Types: |
|-----------------------------|---------------------------------|--------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|----------------------------------------------------------|
|---------------------------------------------------------------------------------------------------------------------------|---------------------------------|---------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|------------------------------------------------------------|
| OpenAI Completions | :heavy_check_mark: | `openai-completions` | up to `code-davinci-002` | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| OpenAI ChatCompletions | :x: Not yet - needs testing! | N/A | [All ChatCompletions API models](https://platform.openai.com/docs/guides/gpt) | `generate_until` (no logprobs) |
| OpenAI ChatCompletions | :heavy_check_mark: | `openai-chat-completions`, `local-chat-completions` | [All ChatCompletions API models](https://platform.openai.com/docs/guides/gpt) | `generate_until` (no logprobs) |
| Anthropic | :heavy_check_mark: | `anthropic` | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model) | `generate_until` (no logprobs) |
| Textsynth | :heavy_check_mark: | `textsynth` | [All supported engines](https://textsynth.com/documentation.html#engines) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Cohere | [:hourglass: - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark: | `gguf`, `ggml` | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark: | `gguf`, `ggml` | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp) | `generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented) |
| vLLM | :heavy_check_mark: | `vllm` | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Your inference server here! | ... | ... | ... | ... | | ... |
| Mamba | :heavy_check_mark: | `mamba_ssm` | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Your local inference server! | :heavy_check_mark: | `local-chat-completions` (using `openai-chat-completions` model type) | Any server address that accepts GET requests using HF models and mirror's OpenAI's ChatCompletions interface | `generate_until` | | ... |
It is on our roadmap to create task variants designed to enable models which do not serve logprobs/loglikelihoods to be compared with generation performance of open-source models.
......@@ -225,9 +235,50 @@ Additionally, one can provide a directory with `--use_cache` to cache the result
For a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md) guide in our documentation!
## Visualizing Results
You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.
First, head to [hub.zenoml.com](https://hub.zenoml.com) to create an account and get an API key [on your account page](https://hub.zenoml.com/account).
Add this key as an environment variable:
```bash
export ZENO_API_KEY=[your api key]
```
You'll also need to install the `lm_eval[zeno]` package extra.
To visualize the results, run the eval harness with the `log_samples` and `output_path` flags.
We expect `output_path` to contain multiple folders that represent individual model names.
You can thus run your evaluation on any number of tasks and models and upload all of the results as projects on Zeno.
```bash
lm_eval \
--model hf \
--model_args pretrained=EleutherAI/gpt-j-6B \
--tasks hellaswag \
--device cuda:0 \
--batch_size 8 \
--log_samples \
--output_path output/gpt-j-6B
```
Then, you can upload the resulting data using the `zeno_visualize` script:
```bash
python scripts/zeno_visualize.py \
--data_path output \
--project_name "Eleuther Project"
```
This will use all subfolders in `data_path` as different models and upload all tasks within these model folders to Zeno.
If you run the eval harness on multiple tasks, the `project_name` will be used as a prefix and one project will be created per task.
You can find an example of this workflow in [examples/visualize-zeno.ipynb](examples/visualize-zeno.ipynb).
## How to Contribute or Learn More?
For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
### Implementing new tasks
......
......@@ -4,7 +4,7 @@ Welcome to the docs for the LM Evaluation Harness!
## Table of Contents
* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/user_guide.md)
* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/interface.md)
* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/model_guide.md).
* For a crash course on adding new tasks to the library, see our [New Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/task_guide.md).
......@@ -315,6 +315,25 @@ python -m scripts.write_out \
Open the file specified at the `--output_base_path <path>` and ensure it passes
a simple eye test.
## Versioning
One key feature in LM Evaluation Harness is the ability to version tasks--that is, mark them with a specific version number that can be bumped whenever a breaking change is made.
This version info can be provided by adding the following to your new task config file:
```
metadata:
version: 0
```
Now, whenever a change needs to be made to your task in the future, please increase the version number by 1 so that users can differentiate the different task iterations and versions.
If you are incrementing a task's version, please also consider adding a changelog to the task's README.md noting the date, PR number, what version you have updated to, and a one-liner describing the change.
for example,
* \[Dec 25, 2023\] (PR #999) Version 0.0 -> 1.0: Fixed a bug with answer extraction that led to underestimated performance.
## Checking performance + equivalence
It's now time to check models' performance on your task! In the evaluation harness, we intend to support a wide range of evaluation tasks and setups, but prioritize the inclusion of already-proven benchmarks following the precise evaluation setups in the literature where possible.
......@@ -340,4 +359,4 @@ It is recommended to include a filled-out copy of this checklist in the README.m
## Submitting your task
You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
You're all set! Now push your work and make a pull request to the `main` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Visualizing Results in Zeno\n",
"\n",
"Benchmarking your models is the first step towards making sure your model performs well.\n",
"However, looking at the data behind the benchmark, slicing the data into subsets, and comparing models on individual instances can help you even more in evaluating and quantifying the behavior of your AI system.\n",
"\n",
"All of this can be done in [Zeno](https://zenoml.com)!\n",
"Zeno is super easy to use with the eval harness, let's explore how you can easily upload and visualize your eval results.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Install this project if you did not already do that. This is all that needs to be installed for you to be able to visualize your data in Zeno!\n",
"!pip install -e ..\n",
"!pip install -e ..[zeno]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Run the Eval Harness\n",
"\n",
"To visualize the results, run the eval harness with the `log_samples` and `output_path` flags. We expect `output_path` to contain multiple folders that represent individual model names. You can thus run your evaluation on any number of tasks and models and upload all of the results as projects on Zeno.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/gpt-neo-2.7B \\\n",
" --tasks hellaswag,wikitext \\\n",
" --batch_size 8 \\\n",
" --device mps \\\n",
" --log_samples \\\n",
" --output_path output/gpt-neo-2.7B \\\n",
" --limit 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Set your API Key\n",
"\n",
"This is so you can be authenticated with Zeno.\n",
"If you don't already have a Zeno account, first create an account on [Zeno Hub](https://hub.zenoml.com).\n",
"After logging in to Zeno Hub, generate your API key by clicking on your profile at the bottom left to navigate to your account page.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%env ZENO_API_KEY=YOUR_API_KEY"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Visualize Eval Results\n",
"\n",
"You can now use the `zeno_visualize` script to upload the results to Zeno.\n",
"\n",
"This will use all subfolders in `data_path` as different models and upload all tasks within these model folders to Zeno. If you run the eval harness on multiple tasks, the `project_name` will be used as a prefix and one project will be created per task.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python ../scripts/zeno_visualize.py --data_path output --project_name \"Zeno Upload Test\""
]
}
],
"metadata": {
"kernelspec": {
"display_name": "zeno_projects",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
import argparse
import json
import logging
import os
import re
import sys
import json
import logging
import argparse
import numpy as np
from pathlib import Path
from typing import Union
import numpy as np
from lm_eval import evaluator, utils
from lm_eval.tasks import initialize_tasks, include_path
from lm_eval.api.registry import ALL_TASKS
from lm_eval.tasks import include_path, initialize_tasks
from lm_eval.utils import make_table
def _handle_non_serializable(o):
......@@ -170,7 +171,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
task_names = ALL_TASKS
elif args.tasks == "list":
eval_logger.info(
"Available Tasks:\n - {}".format(f"\n - ".join(sorted(ALL_TASKS)))
"Available Tasks:\n - {}".format("\n - ".join(sorted(ALL_TASKS)))
)
sys.exit()
else:
......@@ -271,9 +272,9 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
f"{args.model} ({args.model_args}), gen_kwargs: ({args.gen_kwargs}), limit: {args.limit}, num_fewshot: {args.num_fewshot}, "
f"batch_size: {args.batch_size}{f' ({batch_sizes})' if batch_sizes else ''}"
)
print(evaluator.make_table(results))
print(make_table(results))
if "groups" in results:
print(evaluator.make_table(results, "groups"))
print(make_table(results, "groups"))
if __name__ == "__main__":
......
from dataclasses import dataclass
from typing import List
from lm_eval.api.instance import Instance
from datasets import Dataset
from lm_eval.api.instance import Instance
class Filter:
"""
......@@ -42,7 +43,6 @@ class FilterEnsemble:
filters: List[Filter]
def apply(self, instances: List[Instance], docs: List[Dataset]) -> None:
resps = [
inst.resps for inst in instances
] # operate just on the model responses
......
......@@ -8,12 +8,13 @@ import numpy as np
import sacrebleu
import sklearn.metrics
from lm_eval.api.registry import register_metric, register_aggregation
from lm_eval.api.registry import register_aggregation, register_metric
eval_logger = logging.getLogger("lm-eval")
# Register Aggregations First
@register_aggregation("mean")
def mean(arr):
return sum(arr) / len(arr)
......
import abc
import hashlib
import json
import logging
import os
from typing import List, Optional, Tuple, Type, TypeVar
import torch
from typing import Union, List, Tuple, Optional, Type, TypeVar
from sqlitedict import SqliteDict
import json
import hashlib
from tqdm import tqdm
from lm_eval import utils
import logging
eval_logger = logging.getLogger("lm-eval")
......
......@@ -6,6 +6,7 @@ from functools import partial
from lm_eval.api.model import LM
eval_logger = logging.getLogger("lm-eval")
MODEL_REGISTRY = {}
......@@ -104,6 +105,7 @@ def register_metric(
if aggregation is not None:
METRIC_REGISTRY[_metric]["aggregation"] = aggregation
if higher_is_better is not None:
METRIC_REGISTRY[_metric]["higher_is_better"] = higher_is_better
......
......@@ -40,18 +40,18 @@ class ContextSampler:
self.doc_to_text(doc)
if (
self.config.doc_to_choice is None
or type(self.doc_to_text(doc)) is str
or isinstance(self.doc_to_text(doc), str)
)
else self.doc_to_choice(doc)[self.doc_to_text(doc)]
)
+ self.target_delimiter
+ (
str(self.doc_to_target(doc)[0])
if type(self.doc_to_target(doc)) is list
if isinstance(self.doc_to_target(doc), list)
else self.doc_to_target(doc)
if (
self.config.doc_to_choice is None
or type(self.doc_to_target(doc)) is str
or isinstance(self.doc_to_target(doc), str)
)
else str(self.doc_to_choice(doc)[self.doc_to_target(doc)])
)
......@@ -77,8 +77,8 @@ class FirstNSampler(ContextSampler):
Draw the first `n` samples in order from the specified split.
Used for tasks with "canonical" ordered fewshot examples, such as MMLU and CMMLU.
"""
assert n <= len(
self.docs
assert (
n <= len(self.docs)
), f"Error: number of fewshot samples requested exceeds the {len(self.docs)} that are available."
return self.docs[:n]
......
import abc
from dataclasses import dataclass, field, asdict
import os
import re
import ast
import yaml
import logging
import evaluate
import os
import random
import itertools
import functools
from tqdm import tqdm
import re
from collections.abc import Callable
from dataclasses import asdict, dataclass
from typing import Any, List, Literal, Tuple, Union
import datasets
import numpy as np
from typing import Union, List, Any, Tuple, Literal
from collections.abc import Callable
from lm_eval import utils
from lm_eval.api import samplers
from lm_eval.api.instance import Instance
from lm_eval.api.filter import FilterEnsemble
from lm_eval.prompts import get_prompt
from lm_eval.filters import build_filter_ensemble
from lm_eval.api.metrics import (
bits_per_byte,
mean,
weighted_perplexity,
bits_per_byte,
......@@ -37,6 +27,9 @@ from lm_eval.api.registry import (
METRIC_REGISTRY,
DEFAULT_METRIC_REGISTRY,
)
from lm_eval.filters import build_filter_ensemble
from lm_eval.prompts import get_prompt
ALL_OUTPUT_TYPES = [
"loglikelihood",
......@@ -346,9 +339,7 @@ class Task(abc.ABC):
elif self.has_validation_docs():
docs = self.validation_docs()
else:
assert (
False
), f"Task dataset (path={self.DATASET_PATH}, name={self.DATASET_NAME}) must have valid or test docs!"
assert False, f"Task dataset (path={self.DATASET_PATH}, name={self.DATASET_NAME}) must have valid or test docs!"
eval_logger.info(f"Building contexts for task on rank {rank}...")
......@@ -676,9 +667,7 @@ class ConfigurableTask(Task):
elif self.has_validation_docs():
self.task_docs = self.validation_docs()
else:
assert (
False
), f"Task dataset (path={self.DATASET_PATH}, name={self.DATASET_NAME}) must have valid or test docs!"
assert False, f"Task dataset (path={self.DATASET_PATH}, name={self.DATASET_NAME}) must have valid or test docs!"
# Test One Doc
self.features = list(self.task_docs.features.keys())
......@@ -690,20 +679,20 @@ class ConfigurableTask(Task):
if self.config.doc_to_choice is not None:
test_choice = self.doc_to_choice(test_doc)
if type(test_choice) is not list:
if not isinstance(test_choice, list):
eval_logger.error("doc_to_choice must return list")
else:
num_choice = len(test_choice)
if type(test_text) is int:
if isinstance(test_text, int):
self.multiple_input = num_choice
else:
test_choice = None
if type(test_target) is list:
if isinstance(test_target, list):
self.multiple_target = len(test_target)
else:
if (type(test_target) is int) and (test_choice is not None):
if (isinstance(test_target, int)) and (test_choice is not None):
test_target = test_choice[test_target]
else:
test_target = str(test_target)
......@@ -812,11 +801,11 @@ class ConfigurableTask(Task):
)
example = self.doc_to_text(doc)
if type(example) == str:
if isinstance(example, str):
return labeled_examples + example
elif type(example) == list:
elif isinstance(example, list):
return [labeled_examples + ex for ex in example]
elif type(example) == int:
elif isinstance(example, int):
if self.config.doc_to_choice is not None:
choices = self.doc_to_choice(doc)
return labeled_examples + choices[example]
......@@ -868,9 +857,9 @@ class ConfigurableTask(Task):
else:
doc_to_text = self.config.doc_to_text
if type(doc_to_text) == int:
if isinstance(doc_to_text, int):
return doc_to_text
elif type(doc_to_text) == str:
elif isinstance(doc_to_text, str):
if doc_to_text in self.features:
# if self.config.doc_to_choice is not None:
# return self.doc_to_choice(doc)[doc[doc_to_text]]
......@@ -902,9 +891,9 @@ class ConfigurableTask(Task):
else:
doc_to_target = self.config.doc_to_target
if type(doc_to_target) == int:
if isinstance(doc_to_target, int):
return doc_to_target
elif type(doc_to_target) == str:
elif isinstance(doc_to_target, str):
if doc_to_target in self.features:
# if self.config.doc_to_choice is not None:
# return self.doc_to_choice(doc)[doc[doc_to_target]]
......@@ -925,7 +914,7 @@ class ConfigurableTask(Task):
return target_string
else:
return target_string
elif type(doc_to_target) == list:
elif isinstance(doc_to_target, list):
return doc_to_target
elif callable(doc_to_target):
return doc_to_target(doc)
......@@ -948,14 +937,14 @@ class ConfigurableTask(Task):
else:
doc_to_choice = self.config.doc_to_choice
if type(doc_to_choice) == str:
if isinstance(doc_to_choice, str):
if doc_to_choice in self.features:
return doc[doc_to_choice]
else:
return ast.literal_eval(utils.apply_template(doc_to_choice, doc))
elif type(doc_to_choice) == list:
elif isinstance(doc_to_choice, list):
return doc_to_choice
elif type(doc_to_choice) == dict:
elif isinstance(doc_to_choice, dict):
return list(doc_to_choice.values())
elif callable(doc_to_choice):
return doc_to_choice(doc)
......@@ -1186,9 +1175,7 @@ class ConfigurableTask(Task):
predictions=[result],
**self._metric_fn_kwargs[metric],
)
except (
TypeError
): # needed for now in order to use a different interface between our own metrics and HF Evaluate metrics
except TypeError: # needed for now in order to use a different interface between our own metrics and HF Evaluate metrics
result_score = self._metric_fn_list[metric]([gold, result])
if isinstance(result_score, dict):
# TODO: this handles the case where HF evaluate returns a dict.
......
import datetime
import io
import json
import mmap
import os
from pathlib import Path
from typing import Any
import zstandard
import json
import jsonlines
import io
import datetime
import mmap
import tqdm
from pathlib import Path
import zstandard
def json_serial(obj: Any) -> str:
......
import time
import random
import pickle
import json
import collections
import glob
import json
import os
import collections
import pickle
import random
import time
from .janitor import Janitor, word_ngrams
from .archiver import ZStdTextReader
from .janitor import Janitor, word_ngrams
# Was used for testing the evaluator decoupled from the full logic below
......@@ -109,7 +109,7 @@ def get_train_overlap(docs_by_task_set: dict, ngrams_path: str, limit: int) -> d
print(f"Merging lookups took {elapsed:0.5f} seconds.")
print(f"{ngrams_n_size} grams files found in {ngrams_path}:")
files = glob.glob(os.path.join(ngrams_path, f"*.sorted.zst"))
files = glob.glob(os.path.join(ngrams_path, "*.sorted.zst"))
print(files)
for file in files:
......@@ -135,11 +135,7 @@ def get_train_overlap(docs_by_task_set: dict, ngrams_path: str, limit: int) -> d
matching_unique += 1
for task_name, task_set, doc_ids in merged_lookup[ngram]:
task_doc_set = duplicates[(task_name, task_set)]
for (
doc_id
) in (
doc_ids
): # Record contamination across all relevant task/set combos
for doc_id in doc_ids: # Record contamination across all relevant task/set combos
task_doc_set.add(doc_id)
del merged_lookup[ngram] # No point matching again
else:
......
import pickle
import re
import string
import pickle
import traceback
from pprint import pprint
from typing import Iterator, Sequence, TypeVar, List, Tuple
from typing import Iterator, List, Sequence, Tuple, TypeVar
# This is a cpp module. Compile janitor_util.cpp with:
# c++ -O3 -Wall -shared -std=c++11 -fPIC $(python3 -m pybind11 --includes) janitor_util.cpp -o janitor_util$(python3-config --extension-suffix) -undefined dynamic_lookup
......
import random
import itertools
import json
import collections
import sys
import torch
......@@ -17,8 +15,6 @@ import lm_eval.api.registry
from lm_eval.utils import (
positional_deprecated,
run_task_tests,
make_table,
create_iterator,
get_git_commit_hash,
simple_parse_args_string,
eval_logger,
......@@ -91,7 +87,7 @@ def simple_evaluate(
if gen_kwargs is not None:
gen_kwargs = simple_parse_args_string(gen_kwargs)
eval_logger.warning(
f"generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks."
"generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks."
)
if gen_kwargs == "":
gen_kwargs = None
......@@ -118,7 +114,9 @@ def simple_evaluate(
use_cache
# each rank receives a different cache db.
# necessary to avoid multiple writes to cache at once
+ "_rank" + str(lm.rank) + ".db",
+ "_rank"
+ str(lm.rank)
+ ".db",
)
task_dict = lm_eval.tasks.get_task_dict(tasks)
......@@ -513,9 +511,7 @@ def evaluate(
) + total_size * current_size / (
(total_size + current_size)
* (total_size + current_size - 1)
) * (
results[group][metric] - metric_score
) ** 2
) * (results[group][metric] - metric_score) ** 2
else:
results[group][metric] = metric_score
results[group][stderr] = var_score
......
......@@ -32,7 +32,7 @@ def build_filter_ensemble(filter_name, components):
Create a filtering pipeline.
"""
filters = []
for (function, kwargs) in components:
for function, kwargs in components:
if kwargs is None:
f = get_filter(function)()
else:
......
......@@ -5,5 +5,6 @@ from . import dummy
from . import anthropic_llms
from . import gguf
from . import vllm_causallms
from . import mamba_lm
# TODO: implement __all__
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment