Commit e58b8182 authored by lintangsutawika's avatar lintangsutawika
Browse files

resolved merge conflict

parents d213a533 0571eeb1
...@@ -56,7 +56,7 @@ jobs: ...@@ -56,7 +56,7 @@ jobs:
- name: Install dependencies - name: Install dependencies
run: | run: |
python -m pip install --upgrade pip python -m pip install --upgrade pip
pip install -e '.[dev,anthropic,sentencepiece]' --extra-index-url https://download.pytorch.org/whl/cpu pip install -e '.[dev,sentencepiece,api]' --extra-index-url https://download.pytorch.org/whl/cpu
# Install optional git dependencies # Install optional git dependencies
# pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt # pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
# if [ -f requirements.txt ]; then pip install -r requirements.txt; fi # if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
...@@ -84,7 +84,7 @@ jobs: ...@@ -84,7 +84,7 @@ jobs:
- name: Install dependencies - name: Install dependencies
run: | run: |
python -m pip install --upgrade pip python -m pip install --upgrade pip
pip install -e '.[dev,optimum,deepsparse,sparseml]' --extra-index-url https://download.pytorch.org/whl/cpu pip install -e '.[dev,optimum,deepsparse,sparseml,api]' --extra-index-url https://download.pytorch.org/whl/cpu
- name: Test with pytest - name: Test with pytest
run: python -m pytest tests/models --showlocals -s -vv run: python -m pytest tests/models --showlocals -s -vv
- name: Archive artifacts - name: Archive artifacts
......
...@@ -13,6 +13,7 @@ temp ...@@ -13,6 +13,7 @@ temp
__pycache__ __pycache__
.ipynb_checkpoints .ipynb_checkpoints
temp temp
test_logs/
# IPython # IPython
profile_default/ profile_default/
ipython_config.py ipython_config.py
......
This diff is collapsed.
# TemplateAPI Usage Guide
The `TemplateAPI` class is a versatile superclass designed to facilitate the integration of various API-based language models into the lm-evaluation-harness framework. This guide will explain how to use and extend the `TemplateAPI` class to implement your own API models. If your API implements the OpenAI API you can use the `local-completions` or the `local-chat-completions` (defined [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/models/openai_completions.py)) model types, which can also serve as examples of how to effectively subclass this template.
## Overview
The `TemplateAPI` class provides a template for creating API-based model implementations. It handles common functionalities such as:
- Tokenization (optional)
- Batch processing
- Caching
- Retrying failed requests
- Parsing API responses
To use this class, you typically need to subclass it and implement specific methods for your API.
## Key Methods to Implement
When subclassing `TemplateAPI`, you need to implement the following methods:
1. `_create_payload`: Creates the JSON payload for API requests.
2. `parse_logprobs`: Parses log probabilities from API responses.
3. `parse_generations`: Parses generated text from API responses.
4. `headers`: Returns the headers for the API request.
You may also need to override other methods or properties depending on your API's specific requirements.
> [!NOTE]
> Currently loglikelihood and MCQ based tasks (such as MMLU) are only supported for completion endpoints. Not for chat-completion — those that expect a list of dicts — endpoints! Completion APIs which support instruct tuned models can be evaluated with the `--apply_chat_template` option in order to simultaneously evaluate models using a chat template format while still being able to access the model logits needed for loglikelihood-based tasks.
# TemplateAPI Usage Guide
## TemplateAPI Arguments
When initializing a `TemplateAPI` instance or a subclass, you can provide several arguments to customize its behavior. Here's a detailed explanation of some important arguments:
- `model` or `pretrained` (str):
- The name or identifier of the model to use.
- `model` takes precedence over `pretrained` when both are provided.
- `base_url` (str):
- The base URL for the API endpoint.
- `tokenizer` (str, optional):
- The name or path of the tokenizer to use.
- If not provided, it defaults to using the same tokenizer name as the model.
- `num_concurrent` (int):
- Number of concurrent requests to make to the API.
- Useful for APIs that support parallel processing.
- Default is 1 (sequential processing).
- `tokenized_requests` (bool):
- Determines whether the input is pre-tokenized. Defaults to `True`.
- Requests can be sent in either tokenized form (`list[list[int]]`) or as text (`list[str]`, or `str` for batch_size=1).
- For loglikelihood-based tasks, prompts require tokenization to calculate the context length. If `False` prompts are decoded back to text before being sent to the API.
- Not as important for `generate_until` tasks.
- Ignored for chat formatted inputs (list[dict...]) or if tokenizer_backend is None.
- `tokenizer_backend` (str, optional):
- Required for loglikelihood-based or MCQ tasks.
- Specifies the tokenizer library to use. Options are "tiktoken", "huggingface", or None.
- Default is "huggingface".
- `max_length` (int, optional):
- Maximum length of input + output.
- Default is 2048.
- `max_retries` (int, optional):
- Maximum number of retries for failed API requests.
- Default is 3.
- `max_gen_toks` (int, optional):
- Maximum number of tokens to generate in completion tasks.
- Default is 256 or set in task yaml.
- `batch_size` (int or str, optional):
- Number of requests to batch together (if the API supports batching).
- Can be an integer or "auto" (which defaults to 1 for API models).
- Default is 1.
- `seed` (int, optional):
- Random seed for reproducibility.
- Default is 1234.
- `add_bos_token` (bool, optional):
- Whether to add the beginning-of-sequence token to inputs (when tokenizing).
- Default is False.
- `custom_prefix_token_id` (int, optional):
- Custom token ID to use as a prefix for inputs.
- If not provided, uses the model's default BOS or EOS token (if `add_bos_token` is True).
Example usage:
```python
class MyAPIModel(TemplateAPI):
def __init__(self, **kwargs):
super().__init__(
model="my-model",
base_url="https://api.mymodel.com/v1/completions",
tokenizer_backend="huggingface",
num_concurrent=5,
max_retries=5,
batch_size=10,
**kwargs
)
# Implement other required methods...
```
When subclassing `TemplateAPI`, you can override these arguments in your `__init__` method to set default values specific to your API. You can also add additional (potentially user-specified) arguments as needed for your specific implementation.
## Example Implementation: OpenAI API
The `OpenAICompletionsAPI` and `OpenAIChatCompletion` ([here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/models/openai_completions.py) classes demonstrate how to implement API models using the `TemplateAPI` class. Here's a breakdown of the key components:
### 1. Subclassing and Initialization
```python
@register_model("openai-completions")
class OpenAICompletionsAPI(LocalCompletionsAPI):
def __init__(
self,
base_url="https://api.openai.com/v1/completions",
tokenizer_backend="tiktoken",
**kwargs,
):
super().__init__(
base_url=base_url, tokenizer_backend=tokenizer_backend, **kwargs
)
```
### 2. Implementing API Key Retrieval
```python
@cached_property
def api_key(self):
key = os.environ.get("OPENAI_API_KEY", None)
if key is None:
raise ValueError(
"API key not found. Please set the OPENAI_API_KEY environment variable."
)
return key
```
### 3. Creating the Payload
```python
def _create_payload(
self,
messages: Union[List[List[int]], List[dict], List[str], str],
generate=False,
gen_kwargs: Optional[dict] = None,
**kwargs,
) -> dict:
if generate:
# ... (implementation for generation)
else:
# ... (implementation for log likelihood)
```
### 4. Parsing API Responses
```python
@staticmethod
def parse_logprobs(
outputs: Union[Dict, List[Dict]],
tokens: List[List[int]] = None,
ctxlens: List[int] = None,
**kwargs,
) -> List[Tuple[float, bool]]:
# ... (implementation)
@staticmethod
def parse_generations(outputs: Union[Dict, List[Dict]], **kwargs) -> List[str]:
# ... (implementation)
```
The requests are initiated in the `model_call` or the `amodel_call` methods.
## Implementing Your Own API Model
To implement your own API model:
1. Subclass `TemplateAPI` or one of its subclasses (e.g., `LocalCompletionsAPI`).
2. Override the `__init__` method if you need to set specific parameters.
3. Implement the `_create_payload` and `header` methods to create the appropriate payload for your API.
4. Implement the `parse_logprobs` and `parse_generations` methods to parse your API's responses.
5. Override the `api_key` property if your API requires authentication.
6. Override any other methods as necessary to match your API's behavior.
## Best Practices
1. Use the `@register_model` decorator to register your model with the framework (and import it in `lm_eval/models/__init__.py`!).
3. Use environment variables for sensitive information like API keys.
4. Properly handle batching and concurrent requests if supported by your API.
...@@ -2,8 +2,6 @@ ...@@ -2,8 +2,6 @@
Welcome and thank you for your interest in the LM Evaluation Harness! We welcome contributions and feedback and appreciate your time spent with our library, and hope you find it useful! Welcome and thank you for your interest in the LM Evaluation Harness! We welcome contributions and feedback and appreciate your time spent with our library, and hope you find it useful!
We intend LM Evaluation Harness to be a broadly useful and
## Important Resources ## Important Resources
There are several places information about LM Evaluation Harness is located: There are several places information about LM Evaluation Harness is located:
...@@ -11,7 +9,7 @@ There are several places information about LM Evaluation Harness is located: ...@@ -11,7 +9,7 @@ There are several places information about LM Evaluation Harness is located:
- Our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs) - Our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)
- We occasionally use [GitHub Milestones](https://github.com/EleutherAI/lm-evaluation-harness/milestones) to track progress toward specific near-term version releases. - We occasionally use [GitHub Milestones](https://github.com/EleutherAI/lm-evaluation-harness/milestones) to track progress toward specific near-term version releases.
- We maintain a [Project Board](https://github.com/orgs/EleutherAI/projects/25) for tracking current work items and PRs, and for future roadmap items or feature requests. - We maintain a [Project Board](https://github.com/orgs/EleutherAI/projects/25) for tracking current work items and PRs, and for future roadmap items or feature requests.
- Further discussion and support conversations are located in the #lm-thunderdome channel of the [EleutherAI discord](discord.gg/eleutherai). - Further discussion and support conversations are located in the #lm-thunderdome channel of the [EleutherAI discord](https://discord.gg/eleutherai).
## Code Style ## Code Style
...@@ -32,7 +30,7 @@ in order to ensure linters and other checks will be run upon committing. ...@@ -32,7 +30,7 @@ in order to ensure linters and other checks will be run upon committing.
We use [pytest](https://docs.pytest.org/en/latest/) for running unit tests. All library unit tests can be run via: We use [pytest](https://docs.pytest.org/en/latest/) for running unit tests. All library unit tests can be run via:
``` ```
python -m pytest --ignore=tests/tests_master --ignore=tests/extra python -m pytest --showlocals -s -vv -n=auto --ignore=tests/models/test_neuralmagic.py --ignore=tests/models/test_openvino.py
``` ```
## Contributor License Agreement ## Contributor License Agreement
......
...@@ -4,7 +4,8 @@ Welcome to the docs for the LM Evaluation Harness! ...@@ -4,7 +4,8 @@ Welcome to the docs for the LM Evaluation Harness!
## Table of Contents ## Table of Contents
* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](./interface.md) * To learn about the public interface of the library, as well as how to evaluate via the command line or as integrated into an external library, see the [Interface](./interface.md).
* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](./model_guide.md). * To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](./model_guide.md).
* For an extended description of how to extend the library to new model classes served over an API, see the [API Guide](./API_guide.md).
* For a crash course on adding new tasks to the library, see our [New Task Guide](./new_task_guide.md). * For a crash course on adding new tasks to the library, see our [New Task Guide](./new_task_guide.md).
* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](./task_guide.md). * To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](./task_guide.md).
...@@ -210,7 +210,7 @@ ...@@ -210,7 +210,7 @@
], ],
"source": [ "source": [
"# Install LM-Eval\n", "# Install LM-Eval\n",
"!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@big-refactor" "!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git"
] ]
}, },
{ {
......
...@@ -8,7 +8,6 @@ from typing import List ...@@ -8,7 +8,6 @@ from typing import List
import numpy as np import numpy as np
import sacrebleu import sacrebleu
import sklearn.metrics
from lm_eval.api.registry import register_aggregation, register_metric from lm_eval.api.registry import register_aggregation, register_metric
...@@ -51,21 +50,24 @@ def bits_per_byte(items): ...@@ -51,21 +50,24 @@ def bits_per_byte(items):
@register_aggregation("f1") @register_aggregation("f1")
def f1_score(items): def f1_score(items):
from sklearn.metrics import f1_score
unzipped_list = list(zip(*items)) unzipped_list = list(zip(*items))
golds = unzipped_list[0] golds = unzipped_list[0]
preds = unzipped_list[1] preds = unzipped_list[1]
fscore = sklearn.metrics.f1_score(golds, preds) fscore = f1_score(golds, preds)
return np.max(fscore) return np.max(fscore)
@register_aggregation("matthews_corrcoef") @register_aggregation("matthews_corrcoef")
def matthews_corrcoef(items): def matthews_corrcoef(items):
from sklearn.metrics import matthews_corrcoef
unzipped_list = list(zip(*items)) unzipped_list = list(zip(*items))
golds = unzipped_list[0] golds = unzipped_list[0]
preds = unzipped_list[1] preds = unzipped_list[1]
# print(preds) return matthews_corrcoef(golds, preds)
return sklearn.metrics.matthews_corrcoef(golds, preds)
@register_aggregation("bleu") @register_aggregation("bleu")
......
...@@ -55,7 +55,7 @@ class LM(abc.ABC): ...@@ -55,7 +55,7 @@ class LM(abc.ABC):
pass pass
@abc.abstractmethod @abc.abstractmethod
def loglikelihood_rolling(self, requests) -> List[Tuple[float]]: def loglikelihood_rolling(self, requests) -> List[float]:
"""Compute full log-likelihood of a string, with no truncation, for perplexity computation """Compute full log-likelihood of a string, with no truncation, for perplexity computation
- We will use the full max context length of the model. - We will use the full max context length of the model.
- For inputs that exceed the max context length, we divide the tokenized string into chunks of up to - For inputs that exceed the max context length, we divide the tokenized string into chunks of up to
...@@ -101,14 +101,13 @@ class LM(abc.ABC): ...@@ -101,14 +101,13 @@ class LM(abc.ABC):
"""Generate greedily until a stopping sequence """Generate greedily until a stopping sequence
:param requests: list[Instance] :param requests: list[Instance]
A list of Instance objects with property `args` which returns a tuple (context, until). A list of Instance objects with property `args` which returns a tuple (context, gen_kwargs).
context: str context: str
Context string Context string
until: [str] gen_kwargs: dict
The string sequences to generate until. These string sequences A dictionary of keyword arguments to pass to the generation function e.g. top_k, until, etc.
may each span across multiple tokens, or may be part of one token.
:return: list[str] :return: list[str]
A list of strings continuation A list of model generated continuations.
continuation: str continuation: str
The generated continuation. The generated continuation.
""" """
...@@ -325,14 +324,19 @@ class TemplateLM(LM): ...@@ -325,14 +324,19 @@ class TemplateLM(LM):
return self.eot_token_id return self.eot_token_id
@abc.abstractmethod @abc.abstractmethod
def tok_encode(self, string: str, **kwargs): def tok_encode(self, string: str, **kwargs) -> List[int]:
"""
Tokenize a string using the model's tokenizer and return a list of token IDs.
"""
pass pass
@abc.abstractmethod @abc.abstractmethod
def _loglikelihood_tokens(self, requests, **kwargs): def _loglikelihood_tokens(self, requests, **kwargs) -> List[Tuple[float, bool]]:
pass pass
def _encode_pair(self, context, continuation): def _encode_pair(
self, context: str, continuation: str
) -> Tuple[List[int], List[int]]:
n_spaces = len(context) - len(context.rstrip()) n_spaces = len(context) - len(context.rstrip())
if n_spaces > 0: if n_spaces > 0:
continuation = context[-n_spaces:] + continuation continuation = context[-n_spaces:] + continuation
...@@ -373,7 +377,7 @@ class TemplateLM(LM): ...@@ -373,7 +377,7 @@ class TemplateLM(LM):
@abc.abstractmethod @abc.abstractmethod
def loglikelihood_rolling( def loglikelihood_rolling(
self, requests, disable_tqdm: bool = False self, requests, disable_tqdm: bool = False
) -> List[Tuple[float, bool]]: ) -> List[float]:
pass pass
@abc.abstractmethod @abc.abstractmethod
......
from functools import partial
import datasets import datasets
...@@ -15,9 +17,38 @@ class ContextSampler: ...@@ -15,9 +17,38 @@ class ContextSampler:
self.target_delimiter = self.config.target_delimiter self.target_delimiter = self.config.target_delimiter
self.fewshot_delimiter = self.config.fewshot_delimiter self.fewshot_delimiter = self.config.fewshot_delimiter
self.doc_to_text = self.task.doc_to_text if (
self.doc_to_target = self.task.doc_to_target self.config.fewshot_config is not None
self.doc_to_choice = self.task.doc_to_choice and self.config.fewshot_config.get("doc_to_text", None) is not None
):
self.doc_to_text = partial(
self.task.doc_to_text,
doc_to_text=self.config.fewshot_config.get("doc_to_text", None),
)
else:
self.doc_to_text = self.task.doc_to_text
if (
self.config.fewshot_config is not None
and self.config.fewshot_config.get("doc_to_target", None) is not None
):
self.doc_to_target = partial(
self.task.doc_to_target,
doc_to_target=self.config.fewshot_config.get("doc_to_target", None),
)
else:
self.doc_to_target = self.task.doc_to_target
if (
self.config.fewshot_config is not None
and self.config.fewshot_config.get("doc_to_choice", None) is not None
):
self.doc_to_choice = partial(
self.task.doc_to_choice,
doc_to_choice=self.config.fewshot_config.get("doc_to_choice", None),
)
else:
self.doc_to_choice = self.task.doc_to_choice
self.docs = docs # HF dataset split, provided by task._fewshot_docs() self.docs = docs # HF dataset split, provided by task._fewshot_docs()
if fewshot_indices: # subset few-shot docs from if fewshot_indices: # subset few-shot docs from
...@@ -52,14 +83,15 @@ class ContextSampler: ...@@ -52,14 +83,15 @@ class ContextSampler:
else self.doc_to_choice(doc)[doc_content] else self.doc_to_choice(doc)[doc_content]
) )
labeled_examples += self.target_delimiter labeled_examples += self.target_delimiter
labeled_examples += ( if doc_target != "":
str(doc_target[0]) labeled_examples += (
if isinstance(doc_target, list) str(doc_target[0])
else str(doc_target) if isinstance(doc_target, list)
if self.config.doc_to_choice is None or isinstance(doc_target, str) else doc_target
else str(self.doc_to_choice(doc)[doc_target]) if self.config.doc_to_choice is None or isinstance(doc_target, str)
) else str(self.doc_to_choice(doc)[doc_target])
labeled_examples += self.fewshot_delimiter )
labeled_examples += self.fewshot_delimiter
return labeled_examples return labeled_examples
......
...@@ -1171,9 +1171,11 @@ class ConfigurableTask(Task): ...@@ -1171,9 +1171,11 @@ class ConfigurableTask(Task):
""" """
return doc return doc
def doc_to_text(self, doc): def doc_to_text(self, doc, doc_to_text=None):
if self.prompt is not None: if self.prompt is not None:
doc_to_text = self.prompt doc_to_text = self.prompt
elif doc_to_text is not None:
doc_to_text = doc_to_text
else: else:
doc_to_text = self.config.doc_to_text doc_to_text = self.config.doc_to_text
...@@ -1205,9 +1207,11 @@ class ConfigurableTask(Task): ...@@ -1205,9 +1207,11 @@ class ConfigurableTask(Task):
print(type(doc_to_text)) print(type(doc_to_text))
raise TypeError raise TypeError
def doc_to_target(self, doc: Mapping) -> Union[int, str, list]: def doc_to_target(self, doc: Mapping, doc_to_target=None) -> Union[int, str, list]:
if self.prompt is not None: if self.prompt is not None:
doc_to_target = self.prompt doc_to_target = self.prompt
elif doc_to_target is not None:
doc_to_target = doc_to_target
else: else:
doc_to_target = self.config.doc_to_target doc_to_target = self.config.doc_to_target
...@@ -1249,9 +1253,11 @@ class ConfigurableTask(Task): ...@@ -1249,9 +1253,11 @@ class ConfigurableTask(Task):
else: else:
raise TypeError raise TypeError
def doc_to_choice(self, doc: Any) -> List[str]: def doc_to_choice(self, doc: Any, doc_to_choice=None) -> List[str]:
if self.prompt is not None: if self.prompt is not None:
doc_to_choice = self.prompt doc_to_choice = self.prompt
elif doc_to_choice is not None:
doc_to_choice = doc_to_choice
elif self.config.doc_to_choice is None: elif self.config.doc_to_choice is None:
eval_logger.error("doc_to_choice was called but not set in config") eval_logger.error("doc_to_choice was called but not set in config")
else: else:
......
from . import ( from . import (
anthropic_llms, anthropic_llms,
api_models,
dummy, dummy,
gguf, gguf,
huggingface, huggingface,
......
from typing import Any, List, Tuple import os
from functools import cached_property
from typing import Any, Dict, List, Tuple, Union
from tqdm import tqdm from tqdm import tqdm
from lm_eval import utils from lm_eval import utils
from lm_eval.api.model import LM from lm_eval.api.model import LM
from lm_eval.api.registry import register_model from lm_eval.api.registry import register_model
from lm_eval.models.openai_completions import LocalCompletionsAPI
from lm_eval.models.utils import retry_on_specific_exceptions from lm_eval.models.utils import retry_on_specific_exceptions
...@@ -138,7 +141,7 @@ please install anthropic via `pip install 'lm-eval[anthropic]'` or `pip install ...@@ -138,7 +141,7 @@ please install anthropic via `pip install 'lm-eval[anthropic]'` or `pip install
return messages() return messages()
@register_model("anthropic") @register_model("anthropic-completions")
class AnthropicLM(LM): class AnthropicLM(LM):
REQ_CHUNK_SIZE = 20 # TODO: not used REQ_CHUNK_SIZE = 20 # TODO: not used
...@@ -271,90 +274,89 @@ please install anthropic via `pip install 'lm-eval[anthropic]'` or `pip install ...@@ -271,90 +274,89 @@ please install anthropic via `pip install 'lm-eval[anthropic]'` or `pip install
@register_model("anthropic-chat", "anthropic-chat-completions") @register_model("anthropic-chat", "anthropic-chat-completions")
class AnthropicChatLM(AnthropicLM): class AnthropicChat(LocalCompletionsAPI):
REQ_CHUNK_SIZE = 20 # TODO: not used
def __init__( def __init__(
self, self,
model: str, base_url="https://api.anthropic.com/v1/messages",
batch_size: int = 1, tokenizer_backend=None,
max_tokens: int = 256, **kwargs,
temperature: float = 0, # defaults to 1 ):
**kwargs, # top_p, top_k, etc. super().__init__(
) -> None: base_url=base_url, tokenizer_backend=tokenizer_backend, **kwargs
"""Anthropic API wrapper. )
eval_logger.warning(
:param model: str "Chat completions does not support batching. Defaulting to batch size 1."
Anthropic model e.g. 'claude-3-opus-20240229', 'claude-3-sonnet-20240229' )
:param max_tokens: int self._batch_size = 1
Maximum number of tokens to sample from the model self.anthropic_version = "2023-06-01"
:param temperature: float eval_logger.warning(
Sampling temperature f"Using Anthropic Version: {self.anthropic_version}. Confirm the current version here: https://docs.anthropic.com/en/api/versioning"
:param kwargs: Any )
Additional model_args to pass to the API client
"""
super().__init__()
try: @cached_property
import anthropic def api_key(self):
except ModuleNotFoundError: """Override this property to return the API key for the API request."""
raise Exception( key = os.environ.get("ANTHROPIC_API_KEY", None)
"attempted to use 'anthropic' LM type, but package `anthropic` is not installed. \ if key is None:
please install anthropic via `pip install 'lm-eval[anthropic]'` or `pip install -e '.[anthropic]'`", raise ValueError(
"API key not found. Please set the ANTHROPIC_API_KEY environment variable."
) )
return key
self.model = model
# defaults to os.environ.get("ANTHROPIC_API_KEY") @cached_property
self.client = anthropic.Anthropic() def header(self):
self.temperature = temperature return {
self.max_tokens = max_tokens "x-api-key": f"{self.api_key}",
self.tokenizer = self.client.get_tokenizer() "anthropic-version": self.anthropic_version,
self.kwargs = kwargs }
@property def _create_payload(
def max_gen_toks(self) -> int: self, messages: List[Dict], generate=True, gen_kwargs: dict = None, **kwargs
return self.max_tokens ) -> dict:
system = (
def generate_until(self, requests) -> List[str]: messages[0].get("content") if messages[0].get("role") == "system" else None
try: )
import anthropic if system:
except ModuleNotFoundError: messages = messages[1:]
raise Exception( gen_kwargs.pop("do_sample", False)
"attempted to use 'anthropic' LM type, but package `anthropic` is not installed. \ max_tokens = gen_kwargs.pop("max_gen_toks", self._max_gen_toks)
please install anthropic via `pip install 'lm-eval[anthropic]'` or `pip install -e '.[anthropic]'`", temperature = gen_kwargs.pop("temperature", 0)
) stop = gen_kwargs.pop("until", ["\n\nHuman:"])
if not isinstance(stop, list):
if not requests: stop = [stop]
return [] out = {
"messages": messages,
_requests: List[Tuple[str, dict]] = [req.args for req in requests] "model": self.model,
"max_tokens": max_tokens,
"temperature": temperature,
"stop_sequences": stop,
**gen_kwargs,
}
if system:
out["system"] = system
return out
def parse_generations(
self, outputs: Union[Dict, List[Dict]], **kwargs
) -> List[str]:
res = [] res = []
for request in tqdm(_requests): if not isinstance(outputs, list):
try: outputs = [outputs]
inp = request[0] for out in outputs:
request_args = request[1] for choices in out["content"]:
# generation_kwargs res.append(choices["text"])
until = request_args.get("until")
max_tokens = request_args.get("max_gen_toks", self.max_length)
temperature = request_args.get("temperature", self.temperature)
response = anthropic_chat(
client=self.client,
model=self.model,
prompt=inp,
max_tokens=max_tokens,
temperature=temperature, # TODO: implement non-greedy sampling for Anthropic
stop=until, # type: ignore
**self.kwargs,
)
res.append(response)
self.cache_hook.add_partial("generate_until", request, response)
except anthropic.APIConnectionError as e: # type: ignore # noqa: F821
eval_logger.critical(f"Server unreachable: {e.__cause__}")
break
except anthropic.APIStatusError as e: # type: ignore # noqa: F821
eval_logger.critical(f"API error {e.status_code}: {e.message}")
break
return res return res
def tok_encode(
self,
string: str,
left_truncate_len=None,
add_special_tokens=None,
**kwargs,
) -> List[str]:
return [string]
def loglikelihood(self, requests, **kwargs):
raise NotImplementedError(
"Anthropic Chat Completions API does not support the return of loglikelihood"
)
This diff is collapsed.
...@@ -9,10 +9,10 @@ import torch.nn.functional as F ...@@ -9,10 +9,10 @@ import torch.nn.functional as F
import transformers import transformers
from accelerate import ( from accelerate import (
Accelerator, Accelerator,
DistributedType,
InitProcessGroupKwargs, InitProcessGroupKwargs,
find_executable_batch_size, find_executable_batch_size,
) )
from accelerate.utils import get_max_memory
from huggingface_hub import HfApi from huggingface_hub import HfApi
from packaging import version from packaging import version
from peft import PeftModel from peft import PeftModel
...@@ -40,31 +40,6 @@ from lm_eval.models.utils import ( ...@@ -40,31 +40,6 @@ from lm_eval.models.utils import (
eval_logger = utils.eval_logger eval_logger = utils.eval_logger
def _get_accelerate_args(
device_map_option: Optional[str] = "auto",
max_memory_per_gpu: Optional[Union[int, str]] = None,
max_cpu_memory: Optional[Union[int, str]] = None,
offload_folder: Optional[str] = "./offload",
gpus: Optional[int] = None,
) -> dict:
"""Returns the kwargs needed to apply `accelerate` in `AutoModel.from_pretrained`."""
max_memory = {}
if max_memory_per_gpu is not None:
max_memory_per_gpu_map = {
device_idx: max_memory_per_gpu for device_idx in range(gpus)
}
max_memory.update(max_memory_per_gpu_map)
if max_cpu_memory is not None:
max_memory["cpu"] = max_cpu_memory
args = {}
if max_memory:
args["max_memory"] = max_memory
args["device_map"] = device_map_option
args["offload_folder"] = offload_folder
return args
@register_model("hf-auto", "hf", "huggingface") @register_model("hf-auto", "hf", "huggingface")
class HFLM(TemplateLM): class HFLM(TemplateLM):
""" """
...@@ -105,7 +80,6 @@ class HFLM(TemplateLM): ...@@ -105,7 +80,6 @@ class HFLM(TemplateLM):
# arguments used for splitting a model across GPUs naively. # arguments used for splitting a model across GPUs naively.
# only used if `parallelize=True`. # only used if `parallelize=True`.
parallelize: Optional[bool] = False, parallelize: Optional[bool] = False,
device_map_option: Optional[str] = "auto",
max_memory_per_gpu: Optional[Union[int, str]] = None, max_memory_per_gpu: Optional[Union[int, str]] = None,
max_cpu_memory: Optional[Union[int, str]] = None, max_cpu_memory: Optional[Union[int, str]] = None,
offload_folder: Optional[Union[str, os.PathLike]] = "./offload", offload_folder: Optional[Union[str, os.PathLike]] = "./offload",
...@@ -128,21 +102,6 @@ class HFLM(TemplateLM): ...@@ -128,21 +102,6 @@ class HFLM(TemplateLM):
self._config = self._model.config self._config = self._model.config
gpus = 0 gpus = 0
if tokenizer:
assert isinstance(
tokenizer, transformers.PreTrainedTokenizer
) or isinstance(tokenizer, transformers.PreTrainedTokenizerFast)
self.tokenizer = tokenizer
else:
# Get tokenizer
model_name = self._model.name_or_path
self.tokenizer = transformers.AutoTokenizer.from_pretrained(
model_name,
revision=revision,
trust_remote_code=trust_remote_code,
use_fast=use_fast_tokenizer,
)
else: else:
assert isinstance(device, str) assert isinstance(device, str)
assert isinstance(pretrained, str) assert isinstance(pretrained, str)
...@@ -157,6 +116,7 @@ class HFLM(TemplateLM): ...@@ -157,6 +116,7 @@ class HFLM(TemplateLM):
if "npu" in accelerator.device.type: if "npu" in accelerator.device.type:
gpus = torch.npu.device_count() gpus = torch.npu.device_count()
# using one process with no model parallelism
if not (parallelize or accelerator.num_processes > 1): if not (parallelize or accelerator.num_processes > 1):
# use user-passed device # use user-passed device
device_list = set( device_list = set(
...@@ -182,14 +142,19 @@ class HFLM(TemplateLM): ...@@ -182,14 +142,19 @@ class HFLM(TemplateLM):
if torch.cuda.is_available() if torch.cuda.is_available()
else torch.device("cpu") else torch.device("cpu")
) )
else: else: # Parallelism managed by accelerate
if device != "cuda": if device != "cuda":
eval_logger.info( eval_logger.info(
f"Using `accelerate launch` or `parallelize=True`, device '{device}' will be overridden when placing model." f"Using `accelerate launch` or `parallelize=True`, device '{device}' will be overridden when placing model."
) )
# TODO: include in warning that `load_in_8bit` etc. affect this too # TODO: include in warning that `load_in_8bit` etc. affect this too
self._device = torch.device(device) self._device = (
self.accelerator.device
if hasattr(self, "accelerator")
else torch.device(device)
)
revision = str(revision) # cast to string if not already one
# TODO: update this to be less of a hack once subfolder is fixed in HF # TODO: update this to be less of a hack once subfolder is fixed in HF
revision = revision + ("/" + subfolder if subfolder is not None else "") revision = revision + ("/" + subfolder if subfolder is not None else "")
...@@ -222,7 +187,6 @@ class HFLM(TemplateLM): ...@@ -222,7 +187,6 @@ class HFLM(TemplateLM):
trust_remote_code=trust_remote_code, trust_remote_code=trust_remote_code,
parallelize=parallelize, parallelize=parallelize,
gpus=gpus, gpus=gpus,
device_map_option=device_map_option,
max_memory_per_gpu=max_memory_per_gpu, max_memory_per_gpu=max_memory_per_gpu,
max_cpu_memory=max_cpu_memory, max_cpu_memory=max_cpu_memory,
offload_folder=offload_folder, offload_folder=offload_folder,
...@@ -237,19 +201,6 @@ class HFLM(TemplateLM): ...@@ -237,19 +201,6 @@ class HFLM(TemplateLM):
self.model.eval() self.model.eval()
self.model.tie_weights() self.model.tie_weights()
if isinstance(pretrained, str) and (gpus >= 1 or str(self.device) == "mps"):
# TODO: can remove this whole snippet except in the mps case, perhaps?
if not (parallelize or autogptq or hasattr(self, "accelerator")):
# place model onto device requested manually,
# if not using HF Accelerate or device_map
# or any other option that preloads model onto device
try:
self.model.to(self.device)
except ValueError:
eval_logger.debug(
"Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes` or `device_map` is provided. If the desired GPU is being used, this message is safe to ignore."
)
self.truncation = truncation self.truncation = truncation
self.logits_cache = logits_cache self.logits_cache = logits_cache
self.vocab_size = self.tokenizer.vocab_size self.vocab_size = self.tokenizer.vocab_size
...@@ -257,10 +208,10 @@ class HFLM(TemplateLM): ...@@ -257,10 +208,10 @@ class HFLM(TemplateLM):
self.tokenizer = configure_pad_token(self.tokenizer, model_config=self.config) self.tokenizer = configure_pad_token(self.tokenizer, model_config=self.config)
self.add_bos_token = add_bos_token self.add_bos_token = add_bos_token
if getattr(self.config, "model_type", None) in ["gemma", "gemma2"]: if "gemma" in getattr(self.config, "model_type", ""):
self.add_bos_token = True self.add_bos_token = True
eval_logger.info( eval_logger.info(
f"Model type is '{self.config.model_type}', a BOS token will be used as Gemma underperforms without it." f"Model type is '{self.config.model_type}', part of the Gemma family--a BOS token will be used as Gemma underperforms without it."
) )
self._max_length = max_length self._max_length = max_length
...@@ -280,49 +231,46 @@ class HFLM(TemplateLM): ...@@ -280,49 +231,46 @@ class HFLM(TemplateLM):
self.batch_size_per_gpu = int(batch_size) self.batch_size_per_gpu = int(batch_size)
if isinstance(pretrained, str): if isinstance(pretrained, str):
if gpus >= 1 or str(self.device) == "mps":
# TODO: can remove this whole snippet except in the mps case, perhaps?
if not (parallelize or autogptq or hasattr(self, "accelerator")):
# place model onto device requested manually,
# if not using HF Accelerate or device_map
# or any other option that preloads model onto device
try:
self.model.to(self.device)
except ValueError:
eval_logger.debug(
"Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes` or `device_map` is provided. If the desired GPU is being used, this message is safe to ignore."
)
# multigpu data-parallel support when launched with accelerate # multigpu data-parallel support when launched with accelerate
if gpus > 1: if gpus > 1:
if parallelize: if accelerator.num_processes > 1:
if accelerator.num_processes > 1: if parallelize:
raise RuntimeError( eval_logger.warning(
"Attempted to use both a HF Accelerate `device_map` and to launch via `accelerate launch`. If this is the case, please either remove `parallelize=True` from --model_args or launch outside of the Accelerate launcher." "You are both using a HF Accelerate `device_map` (`--model_args parallelize=True`) and launching via `accelerate launch`. This will attempt to do model and data parallelism depending on the resources available."
) )
else: elif gpus > accelerator.num_processes:
pass
elif accelerator.num_processes == 1:
# if we aren't launching via accelerate, ditch
self._rank = 0
self._world_size = 1
else:
if gpus > accelerator.num_processes:
eval_logger.warning( eval_logger.warning(
"WARNING: The number of total system GPUs does not match the number of spawned processes. " "WARNING: The number of total system GPUs does not match the number of spawned processes. "
"If you would like to use data parallelism, please launch the script " "If you would like to use data parallelism, please launch the script "
"with 'accelerate launch *script*'. " "with 'accelerate launch *script*'. "
f"Current run will proceed with {accelerator.num_processes} devices." f"Current run will proceed with {accelerator.num_processes} devices."
) )
assert ( if self.accelerator.is_local_main_process:
accelerator.distributed_type eval_logger.info(
in [ f"Using {gpus} devices with data parallelism"
DistributedType.FSDP, )
DistributedType.MULTI_GPU,
DistributedType.MULTI_NPU,
]
), "Unsupported distributed type provided. Only DDP and FSDP are supported."
if accelerator.distributed_type == DistributedType.FSDP:
self._model = accelerator.prepare(self.model)
else:
self._model = accelerator.prepare_model(
self.model, evaluation_mode=True
)
self._device = torch.device(f"{accelerator.device}") self._device = torch.device(f"{accelerator.device}")
self.accelerator = accelerator self.accelerator = accelerator
if self.accelerator.is_local_main_process:
eval_logger.info(f"Using {gpus} devices with data parallelism")
self._rank = self.accelerator.local_process_index self._rank = self.accelerator.local_process_index
self._world_size = self.accelerator.num_processes self._world_size = self.accelerator.num_processes
else:
# if we aren't launching via accelerate, ditch
self._rank = 0
self._world_size = 1
else: else:
# if a PreTrainedModel was passed into HFLM, we forgo distributed setup. # if a PreTrainedModel was passed into HFLM, we forgo distributed setup.
eval_logger.warning( eval_logger.warning(
...@@ -337,6 +285,94 @@ class HFLM(TemplateLM): ...@@ -337,6 +285,94 @@ class HFLM(TemplateLM):
f"Loglikelihood prefix token id used in evaluation: {self.prefix_token_id}" f"Loglikelihood prefix token id used in evaluation: {self.prefix_token_id}"
) )
def _get_accelerate_args(
self,
parallelize: bool = None,
device_map: Optional[str] = "auto",
max_memory_per_gpu: Optional[Union[int, str]] = None,
max_cpu_memory: Optional[Union[int, str]] = None,
offload_folder: Optional[str] = "./offload",
gpus: Optional[int] = None,
) -> dict:
"""Returns the kwargs needed to apply `accelerate` in `AutoModel.from_pretrained`."""
num_local_processes = int(os.environ.get("LOCAL_WORLD_SIZE", 1))
num_machines = int(os.environ.get("WORLD_SIZE", 0)) // num_local_processes
if (
num_machines == 0
and hasattr(self, "accelerator")
and self.accelerator is not None
):
eval_logger.info(
"We are not in a distributed setting for accelerate. Setting model_parallel to False."
)
parallelize = False
if parallelize is None:
# If parallelism is unset by the user, we automatically assign model parallelism
# if enough extra GPUs are available
max_memory_all_gpus = get_max_memory()
# We just want gpu, not cpu, max memory
if "cpu" in max_memory_all_gpus:
del max_memory_all_gpus["cpu"]
parallelize = bool(num_local_processes < len(max_memory_all_gpus))
eval_logger.info(
f"Setting model parallel to {parallelize} since "
f"the number of local processes is {num_local_processes} "
f"and the number of GPUs is {len(max_memory_all_gpus)}"
)
args = {}
if parallelize: # Model parallelism will be used
max_memory = {}
if max_memory_per_gpu is not None: # Using the provided memory requirements
max_memory_per_gpu_map = {
device_idx: max_memory_per_gpu for device_idx in range(gpus)
}
else: # Estimating the possible memory requirements
max_memory_all_gpus = get_max_memory()
if "cpu" in max_memory_all_gpus:
del max_memory_all_gpus["cpu"]
if not hasattr(self, "accelerator"):
max_memory_per_gpu_map = {
k: v for k, v in max_memory_all_gpus.items()
}
else:
# use only 1 / num_processes of the GPUs if we are running under accelerate launch
max_memory_per_gpu_map = {
k: v
for k, v in max_memory_all_gpus.items()
if k % num_local_processes
== (self.accelerator.process_index % num_local_processes)
}
args["max_memory"] = max_memory_per_gpu_map
args["device_map"] = "auto"
eval_logger.info(
f"Model parallel was set to True, setting max memory per GPU to {max_memory_per_gpu_map} and device map to 'auto'"
)
if max_cpu_memory is not None:
max_memory["cpu"] = max_cpu_memory
args["offload_folder"] = offload_folder
elif (
device_map is None
): # No model parallelism, we use the default provided device for our model
if hasattr(self, "accelerator"):
device_map = {"": f"{self.accelerator.device}"}
else:
device_map = {"": str(self.device)}
args["max_memory"] = None
args["device_map"] = device_map
eval_logger.info(
f"Model parallel was set to False, max memory was not set, and device map was set to {device_map}"
)
else:
args["max_memory"] = None
args["device_map"] = None
eval_logger.info("Model parallel was set to False.")
return args
@property @property
def config(self): def config(self):
# return the associated transformers.AutoConfig for the given pretrained model. # return the associated transformers.AutoConfig for the given pretrained model.
...@@ -483,7 +519,6 @@ class HFLM(TemplateLM): ...@@ -483,7 +519,6 @@ class HFLM(TemplateLM):
# (accelerate naive PP (device_map) options) # (accelerate naive PP (device_map) options)
parallelize: Optional[bool] = False, parallelize: Optional[bool] = False,
gpus: Optional[int] = None, gpus: Optional[int] = None,
device_map_option: Optional[str] = "auto",
max_memory_per_gpu: Optional[Union[int, str]] = None, max_memory_per_gpu: Optional[Union[int, str]] = None,
max_cpu_memory: Optional[Union[int, str]] = None, max_cpu_memory: Optional[Union[int, str]] = None,
offload_folder: Optional[str] = "./offload", offload_folder: Optional[str] = "./offload",
...@@ -507,25 +542,16 @@ class HFLM(TemplateLM): ...@@ -507,25 +542,16 @@ class HFLM(TemplateLM):
model_kwargs = kwargs if kwargs else {} model_kwargs = kwargs if kwargs else {}
if parallelize: model_kwargs.update(
model_kwargs.update( self._get_accelerate_args(
_get_accelerate_args( parallelize=parallelize,
device_map_option, # TODO: phase out device_map_option? device_map=kwargs.get("device_map", None),
max_memory_per_gpu, max_memory_per_gpu=max_memory_per_gpu,
max_cpu_memory, max_cpu_memory=max_cpu_memory,
offload_folder, offload_folder=offload_folder,
gpus, gpus=gpus,
)
) )
elif "device_map" not in model_kwargs: )
# set a device_map to initialize model on the right GPU.
# this is needed because it seems that the default behavior
# for quantized models now seems to be device_map="auto"
# which breaks data-parallel mode.
if hasattr(self, "accelerator"):
model_kwargs.update({"device_map": {"": f"{self.accelerator.device}"}})
else:
model_kwargs.update({"device_map": {"": str(self.device)}})
if not autogptq: if not autogptq:
if model_kwargs.get("load_in_4bit", None): if model_kwargs.get("load_in_4bit", None):
...@@ -538,6 +564,7 @@ class HFLM(TemplateLM): ...@@ -538,6 +564,7 @@ class HFLM(TemplateLM):
model_kwargs["bnb_4bit_compute_dtype"] = get_dtype( model_kwargs["bnb_4bit_compute_dtype"] = get_dtype(
model_kwargs["bnb_4bit_compute_dtype"] model_kwargs["bnb_4bit_compute_dtype"]
) )
self._model = self.AUTO_MODEL_CLASS.from_pretrained( self._model = self.AUTO_MODEL_CLASS.from_pretrained(
pretrained, pretrained,
revision=revision, revision=revision,
......
...@@ -231,6 +231,7 @@ class NEURON_HF(TemplateLM): ...@@ -231,6 +231,7 @@ class NEURON_HF(TemplateLM):
" For inf2.48xlarge, set it to `24`." " For inf2.48xlarge, set it to `24`."
) )
revision = str(revision) # cast to string if not already one
# TODO: update this to be less of a hack once subfolder is fixed in HF # TODO: update this to be less of a hack once subfolder is fixed in HF
revision = revision + ("/" + subfolder if subfolder is not None else "") revision = revision + ("/" + subfolder if subfolder is not None else "")
......
This diff is collapsed.
...@@ -29,7 +29,7 @@ ...@@ -29,7 +29,7 @@
| [ceval](ceval/README.md) | Tasks that evaluate language understanding and reasoning in an educational context. | Chinese | | [ceval](ceval/README.md) | Tasks that evaluate language understanding and reasoning in an educational context. | Chinese |
| [cmmlu](cmmlu/README.md) | Multi-subject multiple choice question tasks for comprehensive academic assessment. | Chinese | | [cmmlu](cmmlu/README.md) | Multi-subject multiple choice question tasks for comprehensive academic assessment. | Chinese |
| code_x_glue | Tasks that involve understanding and generating code across multiple programming languages. | Go, Java, JS, PHP, Python, Ruby | | code_x_glue | Tasks that involve understanding and generating code across multiple programming languages. | Go, Java, JS, PHP, Python, Ruby |
| [commonsense_qa](commmonsense_qa/README.md) | CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge. | English | | [commonsense_qa](commonsense_qa/README.md) | CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge. | English |
| [copal_id](copal_id/README.md) | Indonesian causal commonsense reasoning dataset that captures local nuances. | Indonesian | | [copal_id](copal_id/README.md) | Indonesian causal commonsense reasoning dataset that captures local nuances. | Indonesian |
| [coqa](coqa/README.md) | Conversational question answering tasks to test dialog understanding. | English | | [coqa](coqa/README.md) | Conversational question answering tasks to test dialog understanding. | English |
| [crows_pairs](crows_pairs/README.md) | Tasks designed to test model biases in various sociodemographic groups. | English, French | | [crows_pairs](crows_pairs/README.md) | Tasks designed to test model biases in various sociodemographic groups. | English, French |
...@@ -65,6 +65,7 @@ ...@@ -65,6 +65,7 @@
| [logiqa2](logiqa2/README.md) | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. | English, Chinese | | [logiqa2](logiqa2/README.md) | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. | English, Chinese |
| [mathqa](mathqa/README.md) | Question answering tasks involving mathematical reasoning and problem-solving. | English | | [mathqa](mathqa/README.md) | Question answering tasks involving mathematical reasoning and problem-solving. | English |
| [mc_taco](mc_taco/README.md) | Question-answer pairs that require temporal commonsense comprehension. | English | | [mc_taco](mc_taco/README.md) | Question-answer pairs that require temporal commonsense comprehension. | English |
| [med_concepts_qa](med_concepts_qa/README.md) | Benchmark for evaluating LLMs on their abilities to interpret medical codes and distinguish between medical concept. | English |
| medmcqa | Medical multiple choice questions assessing detailed medical knowledge. | English | | medmcqa | Medical multiple choice questions assessing detailed medical knowledge. | English |
| medqa | Multiple choice question answering based on the United States Medical License Exams. | | | medqa | Multiple choice question answering based on the United States Medical License Exams. | |
| [mgsm](mgsm/README.md) | Benchmark of multilingual grade-school math problems. | Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu | | [mgsm](mgsm/README.md) | Benchmark of multilingual grade-school math problems. | Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu |
...@@ -75,9 +76,9 @@ ...@@ -75,9 +76,9 @@
| [mutual](mutual/README.md) | A retrieval-based dataset for multi-turn dialogue reasoning. | English | | [mutual](mutual/README.md) | A retrieval-based dataset for multi-turn dialogue reasoning. | English |
| [nq_open](nq_open/README.md) | Open domain question answering tasks based on the Natural Questions dataset. | English | | [nq_open](nq_open/README.md) | Open domain question answering tasks based on the Natural Questions dataset. | English |
| [okapi/arc_multilingual](okapi/arc_multilingual/README.md) | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (31 languages) **Machine Translated.** | | [okapi/arc_multilingual](okapi/arc_multilingual/README.md) | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (31 languages) **Machine Translated.** |
| [okapi/hellaswag_multilingual](okapi/hellaswag_multilingual/README.md) | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (30 languages) | | [okapi/hellaswag_multilingual](okapi/hellaswag_multilingual/README.md) | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (30 languages) **Machine Translated.** |
| okapi/mmlu_multilingual | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (34 languages) | | okapi/mmlu_multilingual | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (34 languages) **Machine Translated.** |
| [okapi/truthfulqa_multilingual](okapi/truthfulqa_multilingual/README.md) | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (31 languages) | | [okapi/truthfulqa_multilingual](okapi/truthfulqa_multilingual/README.md) | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (31 languages) **Machine Translated.** |
| [openbookqa](openbookqa/README.md) | Open-book question answering tasks that require external knowledge and reasoning. | English | | [openbookqa](openbookqa/README.md) | Open-book question answering tasks that require external knowledge and reasoning. | English |
| [paloma](paloma/README.md) | Paloma is a comprehensive benchmark designed to evaluate open language models across a wide range of domains, ranging from niche artist communities to mental health forums on Reddit. | English | | [paloma](paloma/README.md) | Paloma is a comprehensive benchmark designed to evaluate open language models across a wide range of domains, ranging from niche artist communities to mental health forums on Reddit. | English |
| [paws-x](paws-x/README.md) | Paraphrase Adversaries from Word Scrambling, focusing on cross-lingual capabilities. | English, French, Spanish, German, Chinese, Japanese, Korean | | [paws-x](paws-x/README.md) | Paraphrase Adversaries from Word Scrambling, focusing on cross-lingual capabilities. | English, French, Spanish, German, Chinese, Japanese, Korean |
...@@ -115,7 +116,7 @@ ...@@ -115,7 +116,7 @@
| [wmt2016](wmt2016/README.md) | Tasks from the WMT 2016 shared task, focusing on translation between multiple languages. | English, Czech, German, Finnish, Russian, Romanian, Turkish | | [wmt2016](wmt2016/README.md) | Tasks from the WMT 2016 shared task, focusing on translation between multiple languages. | English, Czech, German, Finnish, Russian, Romanian, Turkish |
| [wsc273](wsc273/README.md) | The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution. | English | | [wsc273](wsc273/README.md) | The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution. | English |
| [xcopa](xcopa/README.md) | Cross-lingual Choice of Plausible Alternatives, testing reasoning in multiple languages. | Estonian, Haitian, Indonesian, Italian, Quechua, Swahili, Tamil, Thai, Turkish, Vietnamese, Chinese | | [xcopa](xcopa/README.md) | Cross-lingual Choice of Plausible Alternatives, testing reasoning in multiple languages. | Estonian, Haitian, Indonesian, Italian, Quechua, Swahili, Tamil, Thai, Turkish, Vietnamese, Chinese |
| [xnli](xnli/README.md) | Cross-Lingual Natural Language Inference to test understanding across different languages. | Arabic, Bulgarian, German, Greekm English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese | | [xnli](xnli/README.md) | Cross-Lingual Natural Language Inference to test understanding across different languages. | Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese |
| [xnli_eu](xnli_eu/README.md) | Cross-lingual Natural Language Inference tasks in Basque. | Basque | | [xnli_eu](xnli_eu/README.md) | Cross-lingual Natural Language Inference tasks in Basque. | Basque |
| [xstorycloze](xstorycloze/README.md) | Cross-lingual narrative understanding tasks to predict story endings in multiple languages. | Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese | | [xstorycloze](xstorycloze/README.md) | Cross-lingual narrative understanding tasks to predict story endings in multiple languages. | Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese |
| [xwinograd](xwinograd/README.md) | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. | English, French, Japanese, Portuguese, Russian, Chinese | | [xwinograd](xwinograd/README.md) | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. | English, French, Japanese, Portuguese, Russian, Chinese |
from sklearn.metrics import f1_score from lm_eval.utils import weighted_f1_score
def doc_to_choice(doc): def doc_to_choice(doc):
...@@ -30,11 +30,3 @@ def doc_to_text(doc): ...@@ -30,11 +30,3 @@ def doc_to_text(doc):
choice4=choices[3], choice4=choices[3],
) )
return text return text
def weighted_f1_score(items):
unzipped_list = list(zip(*items))
golds = unzipped_list[0]
preds = unzipped_list[1]
fscore = f1_score(golds, preds, average="weighted")
return fscore
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment