"test/vscode:/vscode.git/clone" did not exist on "c1c7dc4534f82fbe8cdc9a5b6ab81745bfb176fe"
Commit b58e5556 authored by Baber's avatar Baber
Browse files

Merge branch 'main' into tasklist

# Conflicts:
#	pyproject.toml
parents 6e1866f5 4f8195f1
...@@ -43,8 +43,10 @@ repos: ...@@ -43,8 +43,10 @@ repos:
- id: codespell - id: codespell
exclude: > exclude: >
(?x)^( (?x)^(
.*\.json|ignore.txt|lm_eval/tasks/.*|.*yaml|.*\.ipynb .*\.json|ignore.txt|lm_eval/tasks/.*|.*yaml|.*\.ipynb
)$ )$
args: [--check-filenames, --check-hidden, --ignore-words=ignore.txt] args: [--check-filenames, --check-hidden, --ignore-words=ignore.txt]
- repo: https://github.com/jackdewinter/pymarkdown - repo: https://github.com/jackdewinter/pymarkdown
rev: v0.9.30 rev: v0.9.30
...@@ -52,9 +54,3 @@ repos: ...@@ -52,9 +54,3 @@ repos:
- id: pymarkdown - id: pymarkdown
exclude: ^(lm_eval/tasks/.*|docs/footguns\.md)$ exclude: ^(lm_eval/tasks/.*|docs/footguns\.md)$
args: [fix, -r] args: [fix, -r]
# - repo: https://github.com/pre-commit/mirrors-mypy
# rev: v1.5.1
# hooks:
# - id: mypy
# additional_dependencies: [".[sentencepiece,multilingual,promptsource,gptq]", "types-PyYAML", "types-requests"]
# exclude: ^tests/.*$
...@@ -5,7 +5,7 @@ ...@@ -5,7 +5,7 @@
--- ---
## Latest News 📣 ## Latest News 📣
- [2025/07] Added `think_end_token` arg to `hf` (token/str), `vllm` and `sglang` (str) for stripping CoT reasoning traces from models that support it.
- [2025/03] Added support for steering HF models! - [2025/03] Added support for steering HF models!
- [2025/02] Added [SGLang](https://docs.sglang.ai/) support! - [2025/02] Added [SGLang](https://docs.sglang.ai/) support!
- [2024/09] We are prototyping allowing users of LM Evaluation Harness to create and evaluate on text+image multimodal input, text output tasks, and have just added the `hf-multimodal` and `vllm-vlm` model types and `mmmu` task as a prototype feature. We welcome users to try out this in-progress feature and stress-test it for themselves, and suggest they check out [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval), a wonderful project originally forking off of the lm-evaluation-harness, for a broader range of multimodal tasks, models, and features. - [2024/09] We are prototyping allowing users of LM Evaluation Harness to create and evaluate on text+image multimodal input, text output tasks, and have just added the `hf-multimodal` and `vllm-vlm` model types and `mmmu` task as a prototype feature. We welcome users to try out this in-progress feature and stress-test it for themselves, and suggest they check out [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval), a wonderful project originally forking off of the lm-evaluation-harness, for a broader range of multimodal tasks, models, and features.
......
...@@ -21,7 +21,11 @@ When subclassing `TemplateAPI`, you need to implement the following methods: ...@@ -21,7 +21,11 @@ When subclassing `TemplateAPI`, you need to implement the following methods:
1. `_create_payload`: Creates the JSON payload for API requests. 1. `_create_payload`: Creates the JSON payload for API requests.
2. `parse_logprobs`: Parses log probabilities from API responses. 2. `parse_logprobs`: Parses log probabilities from API responses.
3. `parse_generations`: Parses generated text from API responses. 3. `parse_generations`: Parses generated text from API responses.
4. `headers`: Returns the headers for the API request.
Optional Properties:
4. `header`: Returns the headers for the API request.
5. `api_key`: Returns the API key for authentication (if required).
You may also need to override other methods or properties depending on your API's specific requirements. You may also need to override other methods or properties depending on your API's specific requirements.
...@@ -97,6 +101,10 @@ When initializing a `TemplateAPI` instance or a subclass, you can provide severa ...@@ -97,6 +101,10 @@ When initializing a `TemplateAPI` instance or a subclass, you can provide severa
- Whether to validate the certificate of the API endpoint (if HTTPS). - Whether to validate the certificate of the API endpoint (if HTTPS).
- Default is True. - Default is True.
- `header` (dict, optional):
- Custom headers for API requests.
- If not provided, uses `{"Authorization": f"Bearer {self.api_key}"}` by default.
Example usage: Example usage:
```python ```python
......
...@@ -435,9 +435,14 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -435,9 +435,14 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
# because it's already been determined based on the prior env var before launching our # because it's already been determined based on the prior env var before launching our
# script--`datasets` gets imported by lm_eval internally before these lines can update the env. # script--`datasets` gets imported by lm_eval internally before these lines can update the env.
import datasets import datasets
from packaging.version import parse as vparse
if vparse(datasets.__version__) < vparse("4.0.0"):
datasets.config.HF_DATASETS_TRUST_REMOTE_CODE = True datasets.config.HF_DATASETS_TRUST_REMOTE_CODE = True
if isinstance(args.model_args, dict):
args.model_args["trust_remote_code"] = True
else:
args.model_args = args.model_args + ",trust_remote_code=True" args.model_args = args.model_args + ",trust_remote_code=True"
( (
eval_logger.info(f"Selected Tasks: {task_names}") eval_logger.info(f"Selected Tasks: {task_names}")
......
...@@ -505,7 +505,6 @@ def bootstrap_stderr( ...@@ -505,7 +505,6 @@ def bootstrap_stderr(
if not os.getenv("DISABLE_MULTIPROC"): if not os.getenv("DISABLE_MULTIPROC"):
import multiprocessing as mp import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())
# this gives a biased estimate of the stderr (i.e w/ the mean, it gives something # this gives a biased estimate of the stderr (i.e w/ the mean, it gives something
# equivalent to stderr calculated without Bessel's correction in the stddev. # equivalent to stderr calculated without Bessel's correction in the stddev.
# Unfortunately, I haven't been able to figure out what the right correction is # Unfortunately, I haven't been able to figure out what the right correction is
...@@ -517,6 +516,7 @@ def bootstrap_stderr( ...@@ -517,6 +516,7 @@ def bootstrap_stderr(
from tqdm import tqdm from tqdm import tqdm
print("bootstrapping for stddev:", f.__name__) print("bootstrapping for stddev:", f.__name__)
with mp.Pool(mp.cpu_count()) as pool:
for bootstrap in tqdm( for bootstrap in tqdm(
pool.imap( pool.imap(
_bootstrap_internal(f, chunk_size), _bootstrap_internal(f, chunk_size),
...@@ -526,8 +526,6 @@ def bootstrap_stderr( ...@@ -526,8 +526,6 @@ def bootstrap_stderr(
): ):
# sample w replacement # sample w replacement
res.extend(bootstrap) res.extend(bootstrap)
pool.close()
else: else:
res = _bootstrap_internal_no_mp(f, xs, iters) res = _bootstrap_internal_no_mp(f, xs, iters)
......
...@@ -3,18 +3,15 @@ import ast ...@@ -3,18 +3,15 @@ import ast
import logging import logging
import random import random
import re import re
from collections.abc import Callable from collections.abc import Callable, Iterable, Iterator, Mapping
from copy import deepcopy from copy import deepcopy
from dataclasses import asdict, dataclass from dataclasses import asdict, dataclass
from inspect import getsource from inspect import getsource
from typing import ( from typing import (
Any, Any,
Dict, Dict,
Iterable,
Iterator,
List, List,
Literal, Literal,
Mapping,
Optional, Optional,
Tuple, Tuple,
Union, Union,
...@@ -113,7 +110,7 @@ class TaskConfig(dict): ...@@ -113,7 +110,7 @@ class TaskConfig(dict):
if "until" not in self.generation_kwargs: if "until" not in self.generation_kwargs:
eval_logger.warning( eval_logger.warning(
f"{self.task}: No `until` specified in `generation_kwargs`! Defaulting to the fewshot_delimiter={repr(self.fewshot_delimiter)}" f"{self.task}: No `until` specified in `generation_kwargs`! Defaulting to the fewshot_delimiter={self.fewshot_delimiter!r}"
) )
self.generation_kwargs["until"] = [self.fewshot_delimiter] self.generation_kwargs["until"] = [self.fewshot_delimiter]
else: else:
...@@ -289,17 +286,14 @@ class Task(abc.ABC): ...@@ -289,17 +286,14 @@ class Task(abc.ABC):
@abc.abstractmethod @abc.abstractmethod
def has_training_docs(self): def has_training_docs(self):
"""Whether the task has a training set""" """Whether the task has a training set"""
pass
@abc.abstractmethod @abc.abstractmethod
def has_validation_docs(self): def has_validation_docs(self):
"""Whether the task has a validation set""" """Whether the task has a validation set"""
pass
@abc.abstractmethod @abc.abstractmethod
def has_test_docs(self): def has_test_docs(self):
"""Whether the task has a test set""" """Whether the task has a test set"""
pass
def training_docs(self) -> Iterable: def training_docs(self) -> Iterable:
""" """
...@@ -518,7 +512,6 @@ class Task(abc.ABC): ...@@ -518,7 +512,6 @@ class Task(abc.ABC):
The number of times each instance in a dataset is inferred on. Defaults to 1, The number of times each instance in a dataset is inferred on. Defaults to 1,
can be increased for techniques like majority voting. can be increased for techniques like majority voting.
""" """
pass
@abc.abstractmethod @abc.abstractmethod
def process_results(self, doc, results): def process_results(self, doc, results):
...@@ -531,7 +524,6 @@ class Task(abc.ABC): ...@@ -531,7 +524,6 @@ class Task(abc.ABC):
:param results: :param results:
The results of the requests created in construct_requests. The results of the requests created in construct_requests.
""" """
pass
@abc.abstractmethod @abc.abstractmethod
def aggregation(self): def aggregation(self):
...@@ -540,7 +532,6 @@ class Task(abc.ABC): ...@@ -540,7 +532,6 @@ class Task(abc.ABC):
A dictionary where keys are the names of submetrics and values are A dictionary where keys are the names of submetrics and values are
functions that aggregate a list of metric scores functions that aggregate a list of metric scores
""" """
pass
@abc.abstractmethod @abc.abstractmethod
def higher_is_better(self): def higher_is_better(self):
...@@ -549,7 +540,6 @@ class Task(abc.ABC): ...@@ -549,7 +540,6 @@ class Task(abc.ABC):
A dictionary where keys are the names of submetrics and values are A dictionary where keys are the names of submetrics and values are
whether a higher value of the submetric is better whether a higher value of the submetric is better
""" """
pass
def get_config(self, key: str) -> Any: def get_config(self, key: str) -> Any:
return getattr(self._config, key, None) return getattr(self._config, key, None)
...@@ -675,8 +665,8 @@ class Task(abc.ABC): ...@@ -675,8 +665,8 @@ class Task(abc.ABC):
self.aggregation = lambda: { self.aggregation = lambda: {
metric_name: get_metric_aggregation(metric_name) metric_name: get_metric_aggregation(metric_name)
} }
setattr(self._config, "metric_list", [{"metric": metric_name}]) self._config.metric_list = [{"metric": metric_name}]
setattr(self._config, "process_results", None) self._config.process_results = None
def set_fewshot_seed(self, seed: Optional[int] = None) -> None: def set_fewshot_seed(self, seed: Optional[int] = None) -> None:
self.fewshot_rnd = random.Random(seed) self.fewshot_rnd = random.Random(seed)
...@@ -835,7 +825,7 @@ class ConfigurableTask(Task): ...@@ -835,7 +825,7 @@ class ConfigurableTask(Task):
agg_name = metric_config["aggregation"] agg_name = metric_config["aggregation"]
if isinstance(agg_name, str): if isinstance(agg_name, str):
self._aggregation_list[metric_name] = get_aggregation(agg_name) self._aggregation_list[metric_name] = get_aggregation(agg_name)
elif callable(agg_name): # noqa: E721 elif callable(agg_name):
self._aggregation_list[metric_name] = metric_config[ self._aggregation_list[metric_name] = metric_config[
"aggregation" "aggregation"
] ]
...@@ -980,6 +970,10 @@ class ConfigurableTask(Task): ...@@ -980,6 +970,10 @@ class ConfigurableTask(Task):
def download( def download(
self, dataset_kwargs: Optional[Dict[str, Any]] = None, **kwargs self, dataset_kwargs: Optional[Dict[str, Any]] = None, **kwargs
) -> None: ) -> None:
from packaging.version import parse as vparse
if dataset_kwargs and vparse(datasets.__version__) >= vparse("4.0.0"):
dataset_kwargs.pop("trust_remote_code", None)
if isinstance(self.config.custom_dataset, Callable): if isinstance(self.config.custom_dataset, Callable):
eval_logger.warning( eval_logger.warning(
f"{self.config.task}: Custom kwargs can be passed to `--metadata` in console (as json string) or to the TaskManager." f"{self.config.task}: Custom kwargs can be passed to `--metadata` in console (as json string) or to the TaskManager."
...@@ -1498,7 +1492,7 @@ class ConfigurableTask(Task): ...@@ -1498,7 +1492,7 @@ class ConfigurableTask(Task):
): # TODO: ensure that non-multimodal tasks aren't getting visual args ): # TODO: ensure that non-multimodal tasks aren't getting visual args
multimodal_arg = { multimodal_arg = {
**multimodal_arg, **multimodal_arg,
**{"visual": self.doc_to_image(doc)}, "visual": self.doc_to_image(doc),
} }
if ( if (
...@@ -1506,7 +1500,7 @@ class ConfigurableTask(Task): ...@@ -1506,7 +1500,7 @@ class ConfigurableTask(Task):
): # TODO: ensure that non-multimodal tasks aren't getting audio args ): # TODO: ensure that non-multimodal tasks aren't getting audio args
multimodal_arg = { multimodal_arg = {
**multimodal_arg, **multimodal_arg,
**{"audio": self.doc_to_audio(doc)}, "audio": self.doc_to_audio(doc),
} }
if bool(multimodal_arg): if bool(multimodal_arg):
...@@ -1769,7 +1763,7 @@ class MultipleChoiceTask(Task): ...@@ -1769,7 +1763,7 @@ class MultipleChoiceTask(Task):
Instance( Instance(
request_type="loglikelihood", request_type="loglikelihood",
doc=doc, doc=doc,
arguments=(ctx, " {}".format(choice)), arguments=(ctx, f" {choice}"),
idx=i, idx=i,
**kwargs, **kwargs,
) )
......
...@@ -35,6 +35,7 @@ from lm_eval.utils import ( ...@@ -35,6 +35,7 @@ from lm_eval.utils import (
positional_deprecated, positional_deprecated,
setup_logging, setup_logging,
simple_parse_args_string, simple_parse_args_string,
wrap_text,
) )
...@@ -169,8 +170,11 @@ def simple_evaluate( ...@@ -169,8 +170,11 @@ def simple_evaluate(
) )
) and not apply_chat_template: ) and not apply_chat_template:
eval_logger.warning( eval_logger.warning(
"Model appears to be an instruct or chat variant but chat template is not applied. " wrap_text(
"Recommend setting `apply_chat_template` (optionally `fewshot_as_multiturn`)." f"""pretrained={model_args.get("pretrained") if isinstance(model_args, dict) else model_args} appears to be an
instruct or chat variant but chat template is not applied.
Recommend setting `apply_chat_template` (optionally `fewshot_as_multiturn`).""",
)
) )
if delete_requests_cache: if delete_requests_cache:
...@@ -234,8 +238,10 @@ def simple_evaluate( ...@@ -234,8 +238,10 @@ def simple_evaluate(
else: else:
eval_logger.info( eval_logger.info(
wrap_text(
f"Initializing {model} model, with arguments: {simple_parse_args_string(model_args)}" f"Initializing {model} model, with arguments: {simple_parse_args_string(model_args)}"
) )
)
lm = lm_eval.api.registry.get_model(model).create_from_arg_string( lm = lm_eval.api.registry.get_model(model).create_from_arg_string(
model_args, model_args,
{ {
......
...@@ -135,6 +135,7 @@ class TemplateAPI(TemplateLM): ...@@ -135,6 +135,7 @@ class TemplateAPI(TemplateLM):
eos_string: str = None, eos_string: str = None,
# timeout in seconds # timeout in seconds
timeout: int = 300, timeout: int = 300,
header: Optional[Dict[str, str]] = None,
max_images: int = 1, max_images: int = 1,
**kwargs, **kwargs,
) -> None: ) -> None:
...@@ -152,6 +153,7 @@ class TemplateAPI(TemplateLM): ...@@ -152,6 +153,7 @@ class TemplateAPI(TemplateLM):
self.model = model or pretrained self.model = model or pretrained
self.base_url = base_url self.base_url = base_url
self.tokenizer = tokenizer self.tokenizer = tokenizer
self._header = header
if not isinstance(batch_size, int) and "auto" in batch_size: if not isinstance(batch_size, int) and "auto" in batch_size:
eval_logger.warning( eval_logger.warning(
"Automatic batch size is not supported for API models. Defaulting to batch size 1." "Automatic batch size is not supported for API models. Defaulting to batch size 1."
...@@ -296,7 +298,7 @@ class TemplateAPI(TemplateLM): ...@@ -296,7 +298,7 @@ class TemplateAPI(TemplateLM):
@cached_property @cached_property
def header(self) -> dict: def header(self) -> dict:
"""Override this property to return the headers for the API request.""" """Override this property to return the headers for the API request."""
return {"Authorization": f"Bearer {self.api_key}"} return self._header or {"Authorization": f"Bearer {self.api_key}"}
@property @property
def tokenizer_name(self) -> str: def tokenizer_name(self) -> str:
...@@ -447,6 +449,7 @@ class TemplateAPI(TemplateLM): ...@@ -447,6 +449,7 @@ class TemplateAPI(TemplateLM):
async def amodel_call( async def amodel_call(
self, self,
session: ClientSession, session: ClientSession,
sem: asyncio.Semaphore,
messages: Union[List[List[int]], List[str], List[JsonChatStr]], messages: Union[List[List[int]], List[str], List[JsonChatStr]],
*, *,
generate: bool = True, generate: bool = True,
...@@ -465,6 +468,7 @@ class TemplateAPI(TemplateLM): ...@@ -465,6 +468,7 @@ class TemplateAPI(TemplateLM):
**kwargs, **kwargs,
) )
cache_method = "generate_until" if generate else "loglikelihood" cache_method = "generate_until" if generate else "loglikelihood"
acquired = await sem.acquire()
try: try:
async with session.post( async with session.post(
self.base_url, self.base_url,
...@@ -474,7 +478,8 @@ class TemplateAPI(TemplateLM): ...@@ -474,7 +478,8 @@ class TemplateAPI(TemplateLM):
if not response.ok: if not response.ok:
error_text = await response.text() error_text = await response.text()
eval_logger.warning( eval_logger.warning(
f"API request failed with error message: {error_text}. Retrying..." f"API request failed! Status code: {response.status}, "
f"Response text: {error_text}. Retrying..."
) )
# raising exception will retry the request # raising exception will retry the request
response.raise_for_status() response.raise_for_status()
...@@ -495,11 +500,12 @@ class TemplateAPI(TemplateLM): ...@@ -495,11 +500,12 @@ class TemplateAPI(TemplateLM):
self.cache_hook.add_partial(cache_method, cache, res) self.cache_hook.add_partial(cache_method, cache, res)
return answers return answers
# If the retries also fail # If the retries also fail
except RetryError: except BaseException as e:
eval_logger.error( eval_logger.error(f"Exception:{repr(e)}, {outputs}, retrying.")
"API request failed after multiple retries. Please check the API status." raise e
) finally:
return None if acquired:
sem.release()
def batch_loglikelihood_requests( def batch_loglikelihood_requests(
self, chunks: Iterable[List[LogLikelihoodInputs]] self, chunks: Iterable[List[LogLikelihoodInputs]]
...@@ -535,6 +541,7 @@ class TemplateAPI(TemplateLM): ...@@ -535,6 +541,7 @@ class TemplateAPI(TemplateLM):
) -> Union[List[List[str]], List[List[Tuple[float, bool]]]]: ) -> Union[List[List[str]], List[List[Tuple[float, bool]]]]:
ctxlens = ctxlens if ctxlens else [None] * len(requests) ctxlens = ctxlens if ctxlens else [None] * len(requests)
conn = TCPConnector(limit=self._concurrent, ssl=self.verify_certificate) conn = TCPConnector(limit=self._concurrent, ssl=self.verify_certificate)
sem = asyncio.Semaphore(self._concurrent)
async with ClientSession( async with ClientSession(
connector=conn, timeout=ClientTimeout(total=self.timeout) connector=conn, timeout=ClientTimeout(total=self.timeout)
) as session: ) as session:
...@@ -542,12 +549,16 @@ class TemplateAPI(TemplateLM): ...@@ -542,12 +549,16 @@ class TemplateAPI(TemplateLM):
stop=stop_after_attempt(self.max_retries), stop=stop_after_attempt(self.max_retries),
wait=wait_exponential(multiplier=0.5, min=1, max=10), wait=wait_exponential(multiplier=0.5, min=1, max=10),
reraise=True, reraise=True,
before_sleep=lambda retry_state: eval_logger.info(
f"Retry attempt {retry_state.attempt_number}"
),
)(self.amodel_call) )(self.amodel_call)
# Create tasks for each batch of request # Create tasks for each batch of request
tasks = [ tasks = [
asyncio.create_task( asyncio.create_task(
retry_( retry_(
session=session, session=session,
sem=sem,
messages=message, messages=message,
cache_keys=cache_key, cache_keys=cache_key,
generate=generate, generate=generate,
......
from __future__ import annotations
import copy import copy
import logging import logging
import os import os
from collections.abc import Iterator, Sequence
from datetime import timedelta from datetime import timedelta
from pathlib import Path from pathlib import Path
from typing import TYPE_CHECKING, Dict, List, Literal, Optional, Tuple, Union from typing import TYPE_CHECKING, Any, Literal
import jinja2 import jinja2
import torch import torch
...@@ -17,6 +20,7 @@ from accelerate import ( ...@@ -17,6 +20,7 @@ from accelerate import (
from accelerate.utils import get_max_memory from accelerate.utils import get_max_memory
from huggingface_hub import HfApi from huggingface_hub import HfApi
from packaging import version from packaging import version
from packaging.version import parse as vparse
from tqdm import tqdm from tqdm import tqdm
from transformers.models.auto.modeling_auto import ( from transformers.models.auto.modeling_auto import (
MODEL_FOR_CAUSAL_LM_MAPPING_NAMES, MODEL_FOR_CAUSAL_LM_MAPPING_NAMES,
...@@ -24,7 +28,6 @@ from transformers.models.auto.modeling_auto import ( ...@@ -24,7 +28,6 @@ from transformers.models.auto.modeling_auto import (
) )
from lm_eval import utils from lm_eval import utils
from lm_eval.api.instance import Instance
from lm_eval.api.model import TemplateLM from lm_eval.api.model import TemplateLM
from lm_eval.api.registry import register_model from lm_eval.api.registry import register_model
from lm_eval.models.utils import ( from lm_eval.models.utils import (
...@@ -34,20 +37,23 @@ from lm_eval.models.utils import ( ...@@ -34,20 +37,23 @@ from lm_eval.models.utils import (
get_dtype, get_dtype,
handle_stop_sequences, handle_stop_sequences,
pad_and_concat, pad_and_concat,
postprocess_generated_text,
stop_sequences_criteria, stop_sequences_criteria,
) )
if TYPE_CHECKING: if TYPE_CHECKING:
from transformers.quantizers import AutoQuantizationConfig from transformers.quantizers.auto import AutoQuantizationConfig
from lm_eval.api.instance import Instance
eval_logger = logging.getLogger(__name__) eval_logger = logging.getLogger(__name__)
TOKENIZER_INFINITY = 1000000000000000019884624838656
@register_model("hf-auto", "hf", "huggingface") @register_model("hf-auto", "hf", "huggingface")
class HFLM(TemplateLM): class HFLM(TemplateLM):
""" """An abstracted Huggingface model class. Enables usage with both models of
An abstracted Huggingface model class. Enables usage with both models of
`transformers.AutoModelForCausalLM` and `transformers.AutoModelForSeq2SeqLM` classes. `transformers.AutoModelForCausalLM` and `transformers.AutoModelForSeq2SeqLM` classes.
Supports data-parallel multi-GPU with HF Accelerate. Supports data-parallel multi-GPU with HF Accelerate.
...@@ -58,42 +64,45 @@ class HFLM(TemplateLM): ...@@ -58,42 +64,45 @@ class HFLM(TemplateLM):
def __init__( def __init__(
self, self,
pretrained: Union[str, transformers.PreTrainedModel], pretrained: str | transformers.PreTrainedModel,
backend: Literal["default", "causal", "seq2seq"] = "default", backend: Literal["default", "causal", "seq2seq"] = "default",
# override whether the model should be treated as decoder-only (causal) or encoder-decoder (seq2seq) # override whether the model should be treated as decoder-only (causal) or encoder-decoder (seq2seq)
revision: Optional[str] = "main", revision: str | None = "main",
subfolder: str = "", subfolder: str = "",
tokenizer: Optional[ tokenizer: str
Union[ | transformers.PreTrainedTokenizer
str, | transformers.PreTrainedTokenizerFast
transformers.PreTrainedTokenizer, | None = None,
transformers.PreTrainedTokenizerFast, truncation: bool | None = False,
]
] = None,
truncation: Optional[bool] = False,
logits_cache: bool = True, logits_cache: bool = True,
max_length: Optional[int] = None, max_length: int | None = None,
device: Optional[str] = "cuda", device: str | None = "cuda",
dtype: Optional[Union[str, torch.dtype]] = "auto", dtype: str | torch.dtype | None = "auto",
softmax_dtype: Optional[Union[str, torch.dtype]] = None, softmax_dtype: str | torch.dtype | None = None,
batch_size: Optional[Union[int, str]] = 1, mixed_precision_dtype: str | torch.dtype | None = None,
max_batch_size: Optional[int] = 64, batch_size: int | str | None = 1,
trust_remote_code: Optional[bool] = False, max_batch_size: int | None = 64,
use_fast_tokenizer: Optional[bool] = True, trust_remote_code: bool | None = False,
add_bos_token: Optional[bool] = False, use_fast_tokenizer: bool | None = True,
prefix_token_id: Optional[int] = None, add_bos_token: bool | None = False,
prefix_token_id: int | None = None,
# arguments used for splitting a model across GPUs naively. # arguments used for splitting a model across GPUs naively.
# only used if `parallelize=True`. # only used if `parallelize=True`.
parallelize: Optional[bool] = False, parallelize: bool | None = False,
max_memory_per_gpu: Optional[Union[int, str]] = None, max_memory_per_gpu: int | str | None = None,
max_cpu_memory: Optional[Union[int, str]] = None, max_cpu_memory: int | str | None = None,
offload_folder: Optional[Union[str, os.PathLike]] = "./offload", offload_folder: str | os.PathLike | None = "./offload",
# PEFT, delta weights and quantization options # PEFT, delta weights and quantization options
peft: Optional[str] = None, peft: str | None = None,
delta: Optional[str] = None, delta: str | None = None,
autogptq: Optional[Union[bool, str]] = False, autogptq: bool | str | None = False,
gptqmodel: Optional[bool] = False, gptqmodel: bool | None = False,
gguf_file: Optional[str] = None, gguf_file: str | None = None,
# end token for thinking, either the string or int token id.
# splits to get response after this token (if provided).
think_end_token: str | int | None = None,
enable_thinking: bool | None = None,
chat_template_args: dict[str, Any] | None = None,
**kwargs, **kwargs,
) -> None: ) -> None:
super().__init__() super().__init__()
...@@ -223,11 +232,21 @@ class HFLM(TemplateLM): ...@@ -223,11 +232,21 @@ class HFLM(TemplateLM):
self.model.eval() self.model.eval()
self.model.tie_weights() self.model.tie_weights()
self.think_end_token = (
int(think_end_token)
if (isinstance(think_end_token, str) and think_end_token.isdigit())
else think_end_token
)
self.truncation = truncation self.truncation = truncation
self.logits_cache = logits_cache self.logits_cache = logits_cache
self.vocab_size = self.tokenizer.vocab_size self.vocab_size = self.tokenizer.vocab_size
# select (or create) a pad token to use # select (or create) a pad token to use
self.tokenizer = configure_pad_token(self.tokenizer, model_config=self.config) self.tokenizer = configure_pad_token(self.tokenizer, model_config=self.config)
self.chat_template_args = (
chat_template_args or {} | dict(enable_thinking=enable_thinking)
if enable_thinking is not None
else {}
)
self.add_bos_token = add_bos_token self.add_bos_token = add_bos_token
if "gemma" in getattr(self.config, "model_type", ""): if "gemma" in getattr(self.config, "model_type", ""):
...@@ -247,6 +266,11 @@ class HFLM(TemplateLM): ...@@ -247,6 +266,11 @@ class HFLM(TemplateLM):
self.softmax_dtype = ( self.softmax_dtype = (
get_dtype(softmax_dtype) if softmax_dtype is not None else None get_dtype(softmax_dtype) if softmax_dtype is not None else None
) )
self.mixed_precision_dtype = (
get_dtype(mixed_precision_dtype)
if mixed_precision_dtype is not None
else None
)
if str(batch_size).startswith("auto"): if str(batch_size).startswith("auto"):
batch_size = batch_size.split(":") batch_size = batch_size.split(":")
...@@ -256,9 +280,10 @@ class HFLM(TemplateLM): ...@@ -256,9 +280,10 @@ class HFLM(TemplateLM):
self.batch_size_per_gpu = int(batch_size) self.batch_size_per_gpu = int(batch_size)
if isinstance(pretrained, str): if isinstance(pretrained, str):
if gpus >= 1 or str(self.device) == "mps": if (gpus >= 1 or str(self.device) == "mps") and not (
parallelize or autogptq or hasattr(self, "accelerator")
):
# TODO: can remove this whole snippet except in the mps case, perhaps? # TODO: can remove this whole snippet except in the mps case, perhaps?
if not (parallelize or autogptq or hasattr(self, "accelerator")):
# place model onto device requested manually, # place model onto device requested manually,
# if not using HF Accelerate or device_map # if not using HF Accelerate or device_map
# or any other option that preloads model onto device # or any other option that preloads model onto device
...@@ -312,12 +337,12 @@ class HFLM(TemplateLM): ...@@ -312,12 +337,12 @@ class HFLM(TemplateLM):
def _get_accelerate_args( def _get_accelerate_args(
self, self,
parallelize: Optional[bool] = None, parallelize: bool | None = None,
device_map: Optional[str] = "auto", device_map: str | None = "auto",
max_memory_per_gpu: Optional[Union[int, str]] = None, max_memory_per_gpu: int | str | None = None,
max_cpu_memory: Optional[Union[int, str]] = None, max_cpu_memory: int | str | None = None,
offload_folder: Optional[str] = "./offload", offload_folder: str | None = "./offload",
gpus: Optional[int] = None, gpus: int | None = None,
) -> dict: ) -> dict:
"""Returns the kwargs needed to apply `accelerate` in `AutoModel.from_pretrained`.""" """Returns the kwargs needed to apply `accelerate` in `AutoModel.from_pretrained`."""
num_local_processes = int(os.environ.get("LOCAL_WORLD_SIZE", 1)) num_local_processes = int(os.environ.get("LOCAL_WORLD_SIZE", 1))
...@@ -355,13 +380,8 @@ class HFLM(TemplateLM): ...@@ -355,13 +380,8 @@ class HFLM(TemplateLM):
} }
else: # Estimating the possible memory requirements else: # Estimating the possible memory requirements
max_memory_all_gpus = get_max_memory() max_memory_all_gpus = get_max_memory()
if "cpu" in max_memory_all_gpus: max_memory_all_gpus.pop("cpu", None)
del max_memory_all_gpus["cpu"] if hasattr(self, "accelerator"):
if not hasattr(self, "accelerator"):
max_memory_per_gpu_map = {
k: v for k, v in max_memory_all_gpus.items()
}
else:
# use only 1 / num_processes of the GPUs if we are running under accelerate launch # use only 1 / num_processes of the GPUs if we are running under accelerate launch
max_memory_per_gpu_map = { max_memory_per_gpu_map = {
k: v k: v
...@@ -369,6 +389,9 @@ class HFLM(TemplateLM): ...@@ -369,6 +389,9 @@ class HFLM(TemplateLM):
if k % num_local_processes if k % num_local_processes
== (self.accelerator.process_index % num_local_processes) == (self.accelerator.process_index % num_local_processes)
} }
else:
max_memory_per_gpu_map = max_memory_all_gpus
args["max_memory"] = max_memory_per_gpu_map args["max_memory"] = max_memory_per_gpu_map
args["device_map"] = "auto" if device_map is None else device_map args["device_map"] = "auto" if device_map is None else device_map
eval_logger.info( eval_logger.info(
...@@ -412,12 +435,12 @@ class HFLM(TemplateLM): ...@@ -412,12 +435,12 @@ class HFLM(TemplateLM):
return self._model return self._model
@property @property
def eot_token_id(self): def eot_token_id(self) -> int:
# we use EOT because end of *text* is more accurate for what we're doing than end of *sentence* # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence*
return self.tokenizer.eos_token_id return self.tokenizer.eos_token_id
@property @property
def prefix_token_id(self): def prefix_token_id(self) -> int:
# it is used as prefix for loglikelihood # it is used as prefix for loglikelihood
if self.custom_prefix_token_id is not None: if self.custom_prefix_token_id is not None:
return self.custom_prefix_token_id return self.custom_prefix_token_id
...@@ -426,7 +449,7 @@ class HFLM(TemplateLM): ...@@ -426,7 +449,7 @@ class HFLM(TemplateLM):
return self.tokenizer.eos_token_id return self.tokenizer.eos_token_id
@property @property
def max_length(self): def max_length(self) -> int:
if self._max_length: # if max length manually set, return it if self._max_length: # if max length manually set, return it
return self._max_length return self._max_length
seqlen_config_attrs = ("n_positions", "max_position_embeddings", "n_ctx") seqlen_config_attrs = ("n_positions", "max_position_embeddings", "n_ctx")
...@@ -434,7 +457,7 @@ class HFLM(TemplateLM): ...@@ -434,7 +457,7 @@ class HFLM(TemplateLM):
if hasattr(self.model.config, attr): if hasattr(self.model.config, attr):
return getattr(self.model.config, attr) return getattr(self.model.config, attr)
if hasattr(self.tokenizer, "model_max_length"): if hasattr(self.tokenizer, "model_max_length"):
if self.tokenizer.model_max_length == 1000000000000000019884624838656: if self.tokenizer.model_max_length == TOKENIZER_INFINITY:
return self._DEFAULT_MAX_LENGTH return self._DEFAULT_MAX_LENGTH
return self.tokenizer.model_max_length return self.tokenizer.model_max_length
return self._DEFAULT_MAX_LENGTH return self._DEFAULT_MAX_LENGTH
...@@ -465,12 +488,12 @@ class HFLM(TemplateLM): ...@@ -465,12 +488,12 @@ class HFLM(TemplateLM):
def _get_backend( def _get_backend(
self, self,
config: Union[transformers.PretrainedConfig, transformers.AutoConfig], config: transformers.PretrainedConfig | transformers.AutoConfig,
backend: Literal["default", "causal", "seq2seq"] = "default", backend: Literal["default", "causal", "seq2seq"] = "default",
trust_remote_code: Optional[bool] = False, trust_remote_code: bool | None = False,
) -> None: ) -> None:
""" """Helper method during initialization.
Helper method during initialization.
Determines the backend ("causal" (decoder-only) or "seq2seq" (encoder-decoder)) model type to be used. Determines the backend ("causal" (decoder-only) or "seq2seq" (encoder-decoder)) model type to be used.
sets `self.AUTO_MODEL_CLASS` appropriately if not already set. sets `self.AUTO_MODEL_CLASS` appropriately if not already set.
...@@ -482,9 +505,7 @@ class HFLM(TemplateLM): ...@@ -482,9 +505,7 @@ class HFLM(TemplateLM):
if backend != "default": if backend != "default":
# if we've settled on non-default backend, use that manually # if we've settled on non-default backend, use that manually
if backend == "causal": if backend in ["causal", "seq2seq"]:
self.backend = backend
elif backend == "seq2seq":
self.backend = backend self.backend = backend
eval_logger.info( eval_logger.info(
f"Overrode HF model backend type, and using type '{self.backend}'" f"Overrode HF model backend type, and using type '{self.backend}'"
...@@ -492,7 +513,7 @@ class HFLM(TemplateLM): ...@@ -492,7 +513,7 @@ class HFLM(TemplateLM):
else: else:
# determine and use the default HF backend for this model, based on its config + metadata. # determine and use the default HF backend for this model, based on its config + metadata.
if ( if (
getattr(config, "model_type") getattr(config, "model_type", None)
in MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES in MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES
): ):
# first check if model type is listed under seq2seq models, since some # first check if model type is listed under seq2seq models, since some
...@@ -501,7 +522,7 @@ class HFLM(TemplateLM): ...@@ -501,7 +522,7 @@ class HFLM(TemplateLM):
self.backend = "seq2seq" self.backend = "seq2seq"
eval_logger.debug(f"Using model type '{self.backend}'") eval_logger.debug(f"Using model type '{self.backend}'")
elif ( elif (
getattr(self.config, "model_type") in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES getattr(config, "model_type", None) in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
): ):
self.backend = "causal" self.backend = "causal"
eval_logger.debug(f"Using model type '{self.backend}'") eval_logger.debug(f"Using model type '{self.backend}'")
...@@ -530,10 +551,10 @@ class HFLM(TemplateLM): ...@@ -530,10 +551,10 @@ class HFLM(TemplateLM):
pretrained: str, pretrained: str,
revision: str = "main", revision: str = "main",
trust_remote_code: bool = False, trust_remote_code: bool = False,
gguf_file: Optional[str] = None, gguf_file: str | None = None,
subfolder: str = "", subfolder: str = "",
) -> None: ) -> None:
"""Return the model config for HuggingFace models""" """Return the model config for HuggingFace models."""
self._config = transformers.AutoConfig.from_pretrained( self._config = transformers.AutoConfig.from_pretrained(
pretrained, pretrained,
revision=revision, revision=revision,
...@@ -545,29 +566,28 @@ class HFLM(TemplateLM): ...@@ -545,29 +566,28 @@ class HFLM(TemplateLM):
def _create_model( def _create_model(
self, self,
pretrained: str, pretrained: str,
revision: Optional[str] = "main", revision: str | None = "main",
dtype: Optional[Union[str, torch.dtype]] = "auto", dtype: str | torch.dtype | None = "auto",
trust_remote_code: Optional[bool] = False, trust_remote_code: bool | None = False,
# arguments used for splitting a model across GPUs naively. # arguments used for splitting a model across GPUs naively.
# only used if `parallelize=True`. # only used if `parallelize=True`.
# (accelerate naive PP (device_map) options) # (accelerate naive PP (device_map) options)
parallelize: Optional[bool] = False, parallelize: bool | None = False,
gpus: Optional[int] = None, gpus: int | None = None,
max_memory_per_gpu: Optional[Union[int, str]] = None, max_memory_per_gpu: int | str | None = None,
max_cpu_memory: Optional[Union[int, str]] = None, max_cpu_memory: int | str | None = None,
offload_folder: Optional[str] = "./offload", offload_folder: str | None = "./offload",
# PEFT, delta weights and quantization options # PEFT, delta weights and quantization options
peft: Optional[str] = None, peft: str | None = None,
delta: Optional[str] = None, delta: str | None = None,
autogptq: Optional[Union[bool, str]] = False, autogptq: bool | str | None = False,
gptqmodel: Optional[bool] = False, gptqmodel: bool | None = False,
gguf_file: Optional[str] = None, gguf_file: str | None = None,
quantization_config: Optional["AutoQuantizationConfig"] = None, quantization_config: AutoQuantizationConfig | None = None,
subfolder: str = "", subfolder: str = "",
**kwargs, **kwargs,
) -> None: ) -> None:
""" """Initializes an HF or HF-compatible PreTrainedModel from scratch
Initializes an HF or HF-compatible PreTrainedModel from scratch
inside HFLM, using the kwargs passed into self.__init__(). inside HFLM, using the kwargs passed into self.__init__().
Also handles functionality such as AutoGPTQ usage and PEFT wrapping. Also handles functionality such as AutoGPTQ usage and PEFT wrapping.
...@@ -578,12 +598,12 @@ class HFLM(TemplateLM): ...@@ -578,12 +598,12 @@ class HFLM(TemplateLM):
please consider subclassing HFLM and overriding this and other methods as needed. please consider subclassing HFLM and overriding this and other methods as needed.
""" """
model_kwargs = kwargs if kwargs else {} model_kwargs = kwargs or {}
model_kwargs.update( model_kwargs.update(
self._get_accelerate_args( self._get_accelerate_args(
parallelize=parallelize, parallelize=parallelize,
device_map=kwargs.get("device_map", None), device_map=kwargs.get("device_map"),
max_memory_per_gpu=max_memory_per_gpu, max_memory_per_gpu=max_memory_per_gpu,
max_cpu_memory=max_cpu_memory, max_cpu_memory=max_cpu_memory,
offload_folder=offload_folder, offload_folder=offload_folder,
...@@ -592,16 +612,12 @@ class HFLM(TemplateLM): ...@@ -592,16 +612,12 @@ class HFLM(TemplateLM):
) )
if not autogptq and not gptqmodel: if not autogptq and not gptqmodel:
if model_kwargs.get("load_in_4bit", None): if model_kwargs.get("load_in_4bit"):
assert transformers.__version__ >= "4.30.0", ( assert vparse(transformers.__version__) >= vparse("4.30.0"), (
"load_in_4bit requires transformers >= 4.30.0" "load_in_4bit requires transformers >= 4.30.0"
) )
if transformers.__version__ >= "4.30.0": if compute_dtype := model_kwargs.get("bnb_4bit_compute_dtype"):
if model_kwargs.get("load_in_4bit", None): model_kwargs["bnb_4bit_compute_dtype"] = get_dtype(compute_dtype)
if model_kwargs.get("bnb_4bit_compute_dtype", None):
model_kwargs["bnb_4bit_compute_dtype"] = get_dtype(
model_kwargs["bnb_4bit_compute_dtype"]
)
self._model = self.AUTO_MODEL_CLASS.from_pretrained( self._model = self.AUTO_MODEL_CLASS.from_pretrained(
pretrained, pretrained,
...@@ -626,7 +642,7 @@ class HFLM(TemplateLM): ...@@ -626,7 +642,7 @@ class HFLM(TemplateLM):
raise type(exception)( raise type(exception)(
"Tried to load auto_gptq, but auto-gptq is not installed ", "Tried to load auto_gptq, but auto-gptq is not installed ",
"please install auto-gptq via pip install lm-eval[gptq] or pip install -e .[gptq]", "please install auto-gptq via pip install lm-eval[gptq] or pip install -e .[gptq]",
) ) from exception
self._model = AutoGPTQForCausalLM.from_quantized( self._model = AutoGPTQForCausalLM.from_quantized(
pretrained, pretrained,
...@@ -645,7 +661,7 @@ class HFLM(TemplateLM): ...@@ -645,7 +661,7 @@ class HFLM(TemplateLM):
raise type(exception)( raise type(exception)(
"Tried to load gptqmodel, but gptqmodel is not installed ", "Tried to load gptqmodel, but gptqmodel is not installed ",
"please install gptqmodel via `pip install gptqmodel --no-build-isolation` or `pip install lm-eval[gptqmodel] --no-build-isolation`", "please install gptqmodel via `pip install gptqmodel --no-build-isolation` or `pip install lm-eval[gptqmodel] --no-build-isolation`",
) ) from exception
self._model = GPTQModel.from_quantized( self._model = GPTQModel.from_quantized(
pretrained, trust_remote_code=trust_remote_code, **model_kwargs pretrained, trust_remote_code=trust_remote_code, **model_kwargs
...@@ -660,8 +676,9 @@ class HFLM(TemplateLM): ...@@ -660,8 +676,9 @@ class HFLM(TemplateLM):
from peft import PeftModel from peft import PeftModel
from peft import __version__ as PEFT_VERSION from peft import __version__ as PEFT_VERSION
if model_kwargs.get("load_in_4bit", None): if model_kwargs.get("load_in_4bit") and vparse(PEFT_VERSION) < vparse(
if version.parse(PEFT_VERSION) < version.parse("0.4.0"): "0.4.0"
):
raise AssertionError("load_in_4bit requires peft >= 0.4.0") raise AssertionError("load_in_4bit requires peft >= 0.4.0")
if self._model.config.vocab_size != len(self.tokenizer): if self._model.config.vocab_size != len(self.tokenizer):
# resize model for LoRAs with added tokens # resize model for LoRAs with added tokens
...@@ -687,36 +704,32 @@ class HFLM(TemplateLM): ...@@ -687,36 +704,32 @@ class HFLM(TemplateLM):
for name, param in self._model.state_dict().items(): for name, param in self._model.state_dict().items():
try: try:
param.data += _model_delta.state_dict()[name] param.data += _model_delta.state_dict()[name]
except KeyError: except KeyError as e:
raise KeyError(f"Delta model is missing weights for layer: {name}") raise KeyError(
f"Delta model is missing weights for layer: {name}"
) from e
except Exception as e: except Exception as e:
raise RuntimeError( raise RuntimeError(
f"Failed to add delta weights to layer {name}. Error: {e}" f"Failed to add delta weights to layer {name}. Error: {e}"
) ) from e
del _model_delta del _model_delta
return None
def _create_tokenizer( def _create_tokenizer(
self, self,
pretrained: Union[str, transformers.PreTrainedModel], pretrained: str | transformers.PreTrainedModel,
tokenizer: Optional[ tokenizer: str
Union[ | transformers.PreTrainedTokenizer
str, | transformers.PreTrainedTokenizerFast
transformers.PreTrainedTokenizer, | None,
transformers.PreTrainedTokenizerFast, revision: str | None = "main",
] trust_remote_code: bool | None = False,
], use_fast_tokenizer: bool | None = True,
revision: Optional[str] = "main", gguf_file: str | None = None,
trust_remote_code: Optional[bool] = False, add_bos_token: bool | None = False,
use_fast_tokenizer: Optional[bool] = True, subfolder: str | None = "",
gguf_file: Optional[str] = None,
add_bos_token: Optional[bool] = False,
subfolder: Optional[str] = "",
) -> None: ) -> None:
""" """Helper method during initialization.
Helper method during initialization.
Create a tokenizer object corresponding to the correct Create a tokenizer object corresponding to the correct
tokenizer for value of `pretrained`, or use the pre-initialized tokenizer passed. tokenizer for value of `pretrained`, or use the pre-initialized tokenizer passed.
...@@ -745,8 +758,12 @@ class HFLM(TemplateLM): ...@@ -745,8 +758,12 @@ class HFLM(TemplateLM):
) )
else: else:
assert isinstance( assert isinstance(
tokenizer, transformers.PreTrainedTokenizer tokenizer,
) or isinstance(tokenizer, transformers.PreTrainedTokenizerFast) (
transformers.PreTrainedTokenizer,
transformers.PreTrainedTokenizerFast,
),
)
self.tokenizer = tokenizer self.tokenizer = tokenizer
else: else:
# Get tokenizer based on 'pretrained' # Get tokenizer based on 'pretrained'
...@@ -758,9 +775,8 @@ class HFLM(TemplateLM): ...@@ -758,9 +775,8 @@ class HFLM(TemplateLM):
self.tokenizer = transformers.AutoTokenizer.from_pretrained( self.tokenizer = transformers.AutoTokenizer.from_pretrained(
model_name, **kwargs model_name, **kwargs
) )
return None
def _detect_batch_size(self, requests=None, pos: int = 0): def _detect_batch_size(self, requests: Sequence | None = None, pos: int = 0):
if requests: if requests:
_, context_enc, continuation_enc = requests[pos] _, context_enc, continuation_enc = requests[pos]
max_length = len( max_length = len(
...@@ -775,7 +791,7 @@ class HFLM(TemplateLM): ...@@ -775,7 +791,7 @@ class HFLM(TemplateLM):
# if OOM, then halves batch_size and tries again # if OOM, then halves batch_size and tries again
@find_executable_batch_size(starting_batch_size=self.max_batch_size) @find_executable_batch_size(starting_batch_size=self.max_batch_size)
def forward_batch(batch_size): def forward_batch(batch_size: int):
if self.backend == "seq2seq": if self.backend == "seq2seq":
length = max(max_context_enc, max_cont_enc) length = max(max_context_enc, max_cont_enc)
batched_conts = torch.ones( batched_conts = torch.ones(
...@@ -822,8 +838,11 @@ class HFLM(TemplateLM): ...@@ -822,8 +838,11 @@ class HFLM(TemplateLM):
return batch_size return batch_size
def tok_encode( def tok_encode(
self, string: str, left_truncate_len=None, add_special_tokens=None self,
) -> List[int]: string: str,
left_truncate_len: int | None = None,
add_special_tokens: bool | None = None,
) -> list[int]:
""" """ """ """
# default for None - empty dict, use predefined tokenizer param # default for None - empty dict, use predefined tokenizer param
# used for all models except for CausalLM or predefined value # used for all models except for CausalLM or predefined value
...@@ -849,11 +868,11 @@ class HFLM(TemplateLM): ...@@ -849,11 +868,11 @@ class HFLM(TemplateLM):
def tok_batch_encode( def tok_batch_encode(
self, self,
strings: List[str], strings: list[str],
padding_side: str = "left", padding_side: str = "left",
left_truncate_len: int = None, left_truncate_len: int | None = None,
truncation: bool = False, truncation: bool = False,
) -> Tuple[torch.Tensor, torch.Tensor]: ) -> tuple[torch.Tensor, torch.Tensor]:
# encode a batch of strings. converts to tensors and pads automatically, unlike tok_encode. # encode a batch of strings. converts to tensors and pads automatically, unlike tok_encode.
old_padding_side = self.tokenizer.padding_side old_padding_side = self.tokenizer.padding_side
self.tokenizer.padding_side = padding_side self.tokenizer.padding_side = padding_side
...@@ -872,7 +891,7 @@ class HFLM(TemplateLM): ...@@ -872,7 +891,7 @@ class HFLM(TemplateLM):
if left_truncate_len: if left_truncate_len:
original_lengths = encoding["input_ids"].size(1) original_lengths = encoding["input_ids"].size(1)
if original_lengths > left_truncate_len: if original_lengths > left_truncate_len:
eval_logger.warn( eval_logger.warning(
f"Left truncation applied. Original sequence length was {original_lengths}, " f"Left truncation applied. Original sequence length was {original_lengths}, "
f"truncating to last {left_truncate_len} tokens. Some content will be lost.", f"truncating to last {left_truncate_len} tokens. Some content will be lost.",
) )
...@@ -884,11 +903,17 @@ class HFLM(TemplateLM): ...@@ -884,11 +903,17 @@ class HFLM(TemplateLM):
return encoding["input_ids"], encoding["attention_mask"] return encoding["input_ids"], encoding["attention_mask"]
def tok_decode(self, tokens, skip_special_tokens=True): def tok_decode(self, tokens: Iterator[list[str]], skip_special_tokens: bool = True):
return self.tokenizer.decode(tokens, skip_special_tokens=skip_special_tokens) return self.tokenizer.decode(tokens, skip_special_tokens=skip_special_tokens)
def _model_call(self, inps, attn_mask=None, labels=None): def _model_call(
self,
inps: torch.Tensor,
attn_mask: torch.Tensor | None = None,
labels: torch.Tensor | None = None,
) -> torch.Tensor:
""" """
:param inps: torch.Tensor :param inps: torch.Tensor
A torch tensor of shape [batch, (sequence_ctx + sequence_cont)] or of shape A torch tensor of shape [batch, (sequence_ctx + sequence_cont)] or of shape
[batch, sequence_ctx]. the size of sequence may vary from call to call [batch, sequence_ctx]. the size of sequence may vary from call to call
...@@ -902,27 +927,40 @@ class HFLM(TemplateLM): ...@@ -902,27 +927,40 @@ class HFLM(TemplateLM):
A torch tensor of shape [batch, sequence, vocab] with the A torch tensor of shape [batch, sequence, vocab] with the
logits returned from the model's decoder logits returned from the model's decoder
""" """
with torch.no_grad(): with (
torch.no_grad(),
torch.autocast(
device_type=self.device.type,
dtype=self.mixed_precision_dtype,
enabled=self.mixed_precision_dtype is not None,
),
):
if attn_mask is not None or labels is not None: if attn_mask is not None or labels is not None:
assert attn_mask is not None and labels is not None assert attn_mask is not None and labels is not None
assert self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM assert transformers.AutoModelForSeq2SeqLM == self.AUTO_MODEL_CLASS
return self.model( return self.model(
input_ids=inps, attention_mask=attn_mask, labels=labels input_ids=inps, attention_mask=attn_mask, labels=labels
).logits ).logits
else:
assert self.AUTO_MODEL_CLASS in ( assert self.AUTO_MODEL_CLASS in (
transformers.AutoModelForCausalLM, transformers.AutoModelForCausalLM,
transformers.AutoModelForVision2Seq, transformers.AutoModelForVision2Seq,
) )
return self.model(inps).logits return self.model(inps).logits
def _model_generate(self, context, max_length, stop, **generation_kwargs): def _model_generate(
self,
context,
max_length: int,
stop: list[str],
**generation_kwargs: dict[str, Any],
) -> torch.Tensor:
# temperature = 0.0 if not set # temperature = 0.0 if not set
# if do_sample is false and temp==0.0: # if do_sample is false and temp==0.0:
# remove temperature, as do_sample=False takes care of this # remove temperature, as do_sample=False takes care of this
# and we don't want a warning from HF # and we don't want a warning from HF
generation_kwargs["temperature"] = generation_kwargs.get("temperature", 0.0) generation_kwargs["temperature"] = generation_kwargs.get("temperature", 0.0)
do_sample = generation_kwargs.get("do_sample", None) do_sample = generation_kwargs.get("do_sample")
# The temperature has to be a strictly positive float -- if it is 0.0, use greedy decoding strategies # The temperature has to be a strictly positive float -- if it is 0.0, use greedy decoding strategies
if generation_kwargs.get("temperature") == 0.0 and do_sample is None: if generation_kwargs.get("temperature") == 0.0 and do_sample is None:
...@@ -934,6 +972,11 @@ class HFLM(TemplateLM): ...@@ -934,6 +972,11 @@ class HFLM(TemplateLM):
stopping_criteria = stop_sequences_criteria( stopping_criteria = stop_sequences_criteria(
self.tokenizer, stop, context.shape[1], context.shape[0] self.tokenizer, stop, context.shape[1], context.shape[0]
) )
with torch.autocast(
device_type=self.device.type,
dtype=self.mixed_precision_dtype,
enabled=self.mixed_precision_dtype is not None,
):
return self.model.generate( return self.model.generate(
input_ids=context, input_ids=context,
max_length=max_length, max_length=max_length,
...@@ -944,7 +987,10 @@ class HFLM(TemplateLM): ...@@ -944,7 +987,10 @@ class HFLM(TemplateLM):
) )
def _select_cont_toks( def _select_cont_toks(
self, logits: torch.Tensor, contlen: int = None, inplen: int = None self,
logits: torch.Tensor,
contlen: int | None = None,
inplen: int | None = None,
) -> torch.Tensor: ) -> torch.Tensor:
if self.backend == "causal": if self.backend == "causal":
assert contlen and inplen, ( assert contlen and inplen, (
...@@ -964,8 +1010,8 @@ class HFLM(TemplateLM): ...@@ -964,8 +1010,8 @@ class HFLM(TemplateLM):
return logits return logits
def loglikelihood_rolling( def loglikelihood_rolling(
self, requests: List[Instance], disable_tqdm: bool = False self, requests: list[Instance], disable_tqdm: bool = False
) -> List[float]: ) -> list[float]:
adaptive_batch_size = None adaptive_batch_size = None
if self.batch_size == "auto": if self.batch_size == "auto":
# using rolling window with maximum context # using rolling window with maximum context
...@@ -984,7 +1030,7 @@ class HFLM(TemplateLM): ...@@ -984,7 +1030,7 @@ class HFLM(TemplateLM):
disable=(disable_tqdm or (self.rank != 0)), disable=(disable_tqdm or (self.rank != 0)),
) )
): ):
rolling_token_windows: List[Tuple[List[int], List[int]]] = list( rolling_token_windows: list[tuple[list[int], list[int]]] = list(
map( map(
utils.make_disjoint_window, utils.make_disjoint_window,
utils.get_rolling_token_windows( utils.get_rolling_token_windows(
...@@ -1068,15 +1114,15 @@ class HFLM(TemplateLM): ...@@ -1068,15 +1114,15 @@ class HFLM(TemplateLM):
def _loglikelihood_tokens( def _loglikelihood_tokens(
self, self,
requests: List[Tuple[Tuple[str, str], List[int], List[int]]], requests: list[tuple[tuple[str, str], list[int], list[int]]],
disable_tqdm: bool = False, disable_tqdm: bool = False,
override_bs: int = None, override_bs: int | None = None,
) -> List[Tuple[float, bool]]: ) -> list[tuple[float, bool]]:
# TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context # TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context
res = [] res = []
def _collate(req: Tuple[Tuple[str, str], List[int], List[int]]): def _collate(req: tuple[tuple[str, str], list[int], list[int]]):
"""Defines the key for the sorted method""" """Defines the key for the sorted method."""
# the negative sign on len(toks) sorts descending - this has a few advantages: # the negative sign on len(toks) sorts descending - this has a few advantages:
# - time estimates will always be over not underestimates, which is more useful for planning # - time estimates will always be over not underestimates, which is more useful for planning
# - to know the size of a batch when going through the list, you know the first one is always the batch # - to know the size of a batch when going through the list, you know the first one is always the batch
...@@ -1087,8 +1133,8 @@ class HFLM(TemplateLM): ...@@ -1087,8 +1133,8 @@ class HFLM(TemplateLM):
toks = req[1] + req[2] toks = req[1] + req[2]
return -len(toks), tuple(toks) return -len(toks), tuple(toks)
def _lookup_one_token_cont(req: Tuple[Tuple[str, str], List[int], List[int]]): def _lookup_one_token_cont(req: tuple[tuple[str, str], list[int], list[int]]):
"""Defines the key to group and lookup one-token continuations""" """Defines the key to group and lookup one-token continuations."""
# Use with group_by="contexts" (optional)" # Use with group_by="contexts" (optional)"
# allows for the creation of a lookup, so we can reuse logits in case of one-token continuations. # allows for the creation of a lookup, so we can reuse logits in case of one-token continuations.
# speeds up some multiple-choice tasks proportionally to the number of choices. # speeds up some multiple-choice tasks proportionally to the number of choices.
...@@ -1261,7 +1307,7 @@ class HFLM(TemplateLM): ...@@ -1261,7 +1307,7 @@ class HFLM(TemplateLM):
# original args. Otherwise, expands the logits batch dimension and yields each # original args. Otherwise, expands the logits batch dimension and yields each
# batch along with matching continuation tokens and prompt strings. # batch along with matching continuation tokens and prompt strings.
# logits -> [1, seq, vocab] # logits -> [1, seq, vocab]
for request_str, cont_toks, logits in re_ord.get_cache( for request_str, cont_toks, logits in re_ord.get_cache( # noqa
req_str=request_str, req_str=request_str,
cxt_toks=ctx_tokens, cxt_toks=ctx_tokens,
cont_toks=cont_toks, cont_toks=cont_toks,
...@@ -1302,11 +1348,11 @@ class HFLM(TemplateLM): ...@@ -1302,11 +1348,11 @@ class HFLM(TemplateLM):
return re_ord.get_original(res) return re_ord.get_original(res)
def generate_until( def generate_until(
self, requests: List[Instance], disable_tqdm: bool = False self, requests: list[Instance], disable_tqdm: bool = False
) -> List[str]: ) -> list[str]:
res = [] res = []
def _collate(req: Tuple[str, dict]): def _collate(req: tuple[str, dict]):
"""Defines the key for the sorted method""" """Defines the key for the sorted method"""
# the negative sign on len(toks) sorts descending - this has a few advantages: # the negative sign on len(toks) sorts descending - this has a few advantages:
# - time estimates will always be over not underestimates, which is more useful for planning # - time estimates will always be over not underestimates, which is more useful for planning
...@@ -1366,10 +1412,10 @@ class HFLM(TemplateLM): ...@@ -1366,10 +1412,10 @@ class HFLM(TemplateLM):
# add EOS token to stop sequences # add EOS token to stop sequences
until = handle_stop_sequences(kwargs.pop("until", None), eos=eos) until = handle_stop_sequences(kwargs.pop("until", None), eos=eos)
else: else:
raise ValueError( raise TypeError(
f"Expected `kwargs` to be of type `dict` but got {type(gen_kwargs)}" f"Expected `kwargs` to be of type `dict` but got {type(gen_kwargs)}"
) )
if "max_gen_toks" in kwargs.keys(): if "max_gen_toks" in kwargs:
max_gen_toks = kwargs.pop("max_gen_toks") max_gen_toks = kwargs.pop("max_gen_toks")
else: else:
max_gen_toks = self.max_gen_toks max_gen_toks = self.max_gen_toks
...@@ -1411,15 +1457,30 @@ class HFLM(TemplateLM): ...@@ -1411,15 +1457,30 @@ class HFLM(TemplateLM):
if self.backend == "causal": if self.backend == "causal":
cont_toks = cont_toks[context_enc.shape[1] :] cont_toks = cont_toks[context_enc.shape[1] :]
# Handle integer think_end_token: find last occurrence and strip tokens after it
if isinstance(self.think_end_token, int):
think_token_indices = [
i
for i, token in enumerate(cont_toks)
if token == self.think_end_token
]
if think_token_indices:
cont_toks = cont_toks[think_token_indices[-1] + 1 :]
s = self.tok_decode(cont_toks) s = self.tok_decode(cont_toks)
# use secondary stop seqs to cut off should-have-been-stopped content post-hoc # Strip leading whitespace if we removed thinking tokens
for term in until: if isinstance(self.think_end_token, int):
if len(term) > 0: s = s.lstrip()
# ignore '' separator,
# for seq2seq case where self.tok_decode(self.eot_token_id) = ''
s = s.split(term)[0]
# Apply post-processing: remove stop sequences and string-based thinking tokens
s = postprocess_generated_text(
generation=s,
stop=until,
think_end_token=self.think_end_token
if isinstance(self.think_end_token, str)
else None,
)
res.append(s) res.append(s)
self.cache_hook.add_partial("generate_until", (context, gen_kwargs), s) self.cache_hook.add_partial("generate_until", (context, gen_kwargs), s)
...@@ -1432,17 +1493,16 @@ class HFLM(TemplateLM): ...@@ -1432,17 +1493,16 @@ class HFLM(TemplateLM):
return res return res
def apply_chat_template( def apply_chat_template(
self, chat_history: List[Dict[str, str]], add_generation_prompt: bool = True self, chat_history: list[dict[str, str]], add_generation_prompt: bool = True
) -> str: ) -> str:
""" """Method to apply a chat template to a list of chat history between user and model."""
Method to apply a chat template to a list of chat history between user and model.
"""
try: try:
chat_templated = self.tokenizer.apply_chat_template( chat_templated = self.tokenizer.apply_chat_template(
chat_history, chat_history,
tokenize=False, tokenize=False,
add_generation_prompt=add_generation_prompt, add_generation_prompt=add_generation_prompt,
continue_final_message=not add_generation_prompt, continue_final_message=not add_generation_prompt,
**self.chat_template_args,
) )
except jinja2.exceptions.TemplateError: except jinja2.exceptions.TemplateError:
eval_logger.warning( eval_logger.warning(
...@@ -1454,14 +1514,13 @@ class HFLM(TemplateLM): ...@@ -1454,14 +1514,13 @@ class HFLM(TemplateLM):
tokenize=False, tokenize=False,
add_generation_prompt=add_generation_prompt, add_generation_prompt=add_generation_prompt,
continue_final_message=not add_generation_prompt, continue_final_message=not add_generation_prompt,
**self.chat_template_args,
) )
return chat_templated return chat_templated
def get_model_info(self) -> dict: def get_model_info(self) -> dict:
""" """Method to get Hugging Face model information for experiment reproducibility."""
Method to get Hugging Face model information for experiment reproducibility.
"""
def get_model_num_params(model) -> int: def get_model_num_params(model) -> int:
if hasattr(model, "num_parameters"): if hasattr(model, "num_parameters"):
......
...@@ -16,8 +16,8 @@ eval_logger = logging.getLogger(__name__) ...@@ -16,8 +16,8 @@ eval_logger = logging.getLogger(__name__)
class LocalCompletionsAPI(TemplateAPI): class LocalCompletionsAPI(TemplateAPI):
def __init__( def __init__(
self, self,
base_url=None, base_url: str = None,
tokenizer_backend="huggingface", tokenizer_backend: str = "huggingface",
**kwargs, **kwargs,
): ):
super().__init__( super().__init__(
...@@ -108,9 +108,9 @@ class LocalCompletionsAPI(TemplateAPI): ...@@ -108,9 +108,9 @@ class LocalCompletionsAPI(TemplateAPI):
class LocalChatCompletion(LocalCompletionsAPI): class LocalChatCompletion(LocalCompletionsAPI):
def __init__( def __init__(
self, self,
base_url=None, base_url: str = None,
tokenizer_backend=None, tokenizer_backend: str = None,
tokenized_requests=False, tokenized_requests: bool = False,
**kwargs, **kwargs,
): ):
eval_logger.warning( eval_logger.warning(
...@@ -236,6 +236,7 @@ class OpenAIChatCompletion(LocalChatCompletion): ...@@ -236,6 +236,7 @@ class OpenAIChatCompletion(LocalChatCompletion):
eval_logger.warning( eval_logger.warning(
"o1 models do not support `stop` and only support temperature=1" "o1 models do not support `stop` and only support temperature=1"
) )
super().__init__( super().__init__(
base_url=base_url, base_url=base_url,
tokenizer_backend=tokenizer_backend, tokenizer_backend=tokenizer_backend,
......
...@@ -11,6 +11,7 @@ from lm_eval.api.registry import register_model ...@@ -11,6 +11,7 @@ from lm_eval.api.registry import register_model
from lm_eval.models.utils import ( from lm_eval.models.utils import (
Collator, Collator,
handle_stop_sequences, handle_stop_sequences,
postprocess_generated_text,
) )
from lm_eval.utils import ( from lm_eval.utils import (
get_rolling_token_windows, get_rolling_token_windows,
...@@ -59,6 +60,8 @@ class SGLangLM(TemplateLM): ...@@ -59,6 +60,8 @@ class SGLangLM(TemplateLM):
dp_size: int = 1, dp_size: int = 1,
tp_size: int = 1, tp_size: int = 1,
prefix_token_id: Optional[int] = None, prefix_token_id: Optional[int] = None,
# End marker for thinking tags - splits to get response after this token (if provided).
think_end_token: Optional[str] = None,
**kwargs, **kwargs,
): ):
super().__init__() super().__init__()
...@@ -74,6 +77,7 @@ class SGLangLM(TemplateLM): ...@@ -74,6 +77,7 @@ class SGLangLM(TemplateLM):
"Either context_length or max_model_len may be provided, but not both" "Either context_length or max_model_len may be provided, but not both"
) )
# Initialize your sglang model here # Initialize your sglang model here
self.think_end_token = think_end_token
self._max_length = ( self._max_length = (
max_model_len if max_model_len is not None else context_length max_model_len if max_model_len is not None else context_length
) )
...@@ -263,6 +267,9 @@ class SGLangLM(TemplateLM): ...@@ -263,6 +267,9 @@ class SGLangLM(TemplateLM):
# cache generations # cache generations
for output, context in zip(cont, context): for output, context in zip(cont, context):
generated_text = output.get("text", "") generated_text = output.get("text", "")
generated_text = postprocess_generated_text(
generated_text, until, self.think_end_token
)
res.append(generated_text) res.append(generated_text)
self.cache_hook.add_partial( self.cache_hook.add_partial(
"generate_until", (context, gen_kwargs), generated_text "generate_until", (context, gen_kwargs), generated_text
......
...@@ -852,3 +852,32 @@ def truncate_tokens( ...@@ -852,3 +852,32 @@ def truncate_tokens(
right_length = max_length - left_length right_length = max_length - left_length
return tokens[:left_length] + tokens[-right_length:] return tokens[:left_length] + tokens[-right_length:]
return None return None
def postprocess_generated_text(
generation: str, stop: Union[list[str], str, None], think_end_token: Optional[str]
) -> str:
"""
Post-processes the generated text by stripping stop sequences and optional thinking markers.
Args:
generation (str): The generated text to be processed.
stop (Optional[list[str]]): Stop sequence(s) to remove. Text is truncated
at the first occurrence of any stop sequence.
think_end_token (Optional[str]): Token marking end of thinking section. If provided,
returns only the text after this token (discarding thinking content).
Returns:
str: The processed generation - text before stop sequences and after thinking sections.
"""
if stop:
stop = [stop] if isinstance(stop, str) else stop
for term in stop:
if len(term) > 0:
# ignore '' separator,
# for seq2seq case where self.tok_decode(self.eot_token_id) = ''
generation = generation.split(term)[0]
if think_end_token:
generation = generation.split(think_end_token)[-1].lstrip()
return generation
...@@ -22,6 +22,7 @@ from lm_eval.models.utils import ( ...@@ -22,6 +22,7 @@ from lm_eval.models.utils import (
Collator, Collator,
configure_pad_token, configure_pad_token,
handle_stop_sequences, handle_stop_sequences,
postprocess_generated_text,
undistribute, undistribute,
) )
from lm_eval.utils import ( from lm_eval.utils import (
...@@ -130,10 +131,14 @@ class VLLM(TemplateLM): ...@@ -130,10 +131,14 @@ class VLLM(TemplateLM):
max_model_len: int = None, max_model_len: int = None,
seed: int = 1234, seed: int = 1234,
gpu_memory_utilization: float = 0.9, gpu_memory_utilization: float = 0.9,
device: str = "cuda",
data_parallel_size: int = 1, data_parallel_size: int = 1,
lora_local_path: str = None, lora_local_path: str = None,
enable_thinking: bool = False, # VLLM: enable thinking tags in the prompt.
enable_thinking: bool = True,
chat_template_args: Optional[dict] = None,
# End marker for thinking tags - splits to get response after this token (if provided).
think_end_token: Optional[str] = None,
max_lora_rank: int = 16,
**kwargs, **kwargs,
): ):
super().__init__() super().__init__()
...@@ -147,6 +152,8 @@ class VLLM(TemplateLM): ...@@ -147,6 +152,8 @@ class VLLM(TemplateLM):
assert max_length is None or max_model_len is None, ( assert max_length is None or max_model_len is None, (
"Either max_length or max_model_len may be provided, but not both" "Either max_length or max_model_len may be provided, but not both"
) )
kwargs.pop("device", None)
self.think_end_token = think_end_token
self.V1 = os.environ.get("VLLM_USE_V1", "1") != "0" self.V1 = os.environ.get("VLLM_USE_V1", "1") != "0"
self._max_length = max_model_len if max_model_len is not None else max_length self._max_length = max_model_len if max_model_len is not None else max_length
self.tensor_parallel_size = int(tensor_parallel_size) self.tensor_parallel_size = int(tensor_parallel_size)
...@@ -166,7 +173,8 @@ class VLLM(TemplateLM): ...@@ -166,7 +173,8 @@ class VLLM(TemplateLM):
"swap_space": int(swap_space), "swap_space": int(swap_space),
"quantization": quantization, "quantization": quantization,
"seed": int(seed), "seed": int(seed),
"device": str(device), "enable_lora": True if lora_local_path else False,
"max_lora_rank": int(max_lora_rank),
} }
self.model_args.update(kwargs) self.model_args.update(kwargs)
self.batch_size = ( self.batch_size = (
...@@ -201,7 +209,10 @@ class VLLM(TemplateLM): ...@@ -201,7 +209,10 @@ class VLLM(TemplateLM):
add_bos_token=add_bos_token, add_bos_token=add_bos_token,
) )
self.tokenizer = configure_pad_token(self.tokenizer, model_config=self._config) self.tokenizer = configure_pad_token(self.tokenizer, model_config=self._config)
self.enable_thinking = enable_thinking self.chat_template_args = chat_template_args or {}
self.enable_thinking = self.chat_template_args.pop(
"enable_thinking", enable_thinking
)
self.add_bos_token = add_bos_token self.add_bos_token = add_bos_token
if "gemma" in pretrained.lower(): if "gemma" in pretrained.lower():
self.add_bos_token = True self.add_bos_token = True
...@@ -309,6 +320,7 @@ class VLLM(TemplateLM): ...@@ -309,6 +320,7 @@ class VLLM(TemplateLM):
continue_final_message=not add_generation_prompt, continue_final_message=not add_generation_prompt,
chat_template=self.hf_chat_template, chat_template=self.hf_chat_template,
enable_thinking=self.enable_thinking, enable_thinking=self.enable_thinking,
**self.chat_template_args,
) )
except jinja2.exceptions.TemplateError: except jinja2.exceptions.TemplateError:
eval_logger.warning( eval_logger.warning(
...@@ -321,6 +333,7 @@ class VLLM(TemplateLM): ...@@ -321,6 +333,7 @@ class VLLM(TemplateLM):
continue_final_message=not add_generation_prompt, continue_final_message=not add_generation_prompt,
chat_template=self.hf_chat_template, chat_template=self.hf_chat_template,
enable_thinking=self.enable_thinking, enable_thinking=self.enable_thinking,
**self.chat_template_args,
) )
return chat_templated return chat_templated
...@@ -627,11 +640,11 @@ class VLLM(TemplateLM): ...@@ -627,11 +640,11 @@ class VLLM(TemplateLM):
# cache generations # cache generations
for output, context in zip(cont, context): for output, context in zip(cont, context):
generated_text = output.outputs[0].text generated_text: str = output.outputs[0].text
# use secondary stop seqs to cut off should-have-been-stopped content post-hoc # use secondary stop seqs to cut off should-have-been-stopped content post-hoc
for term in until: generated_text = postprocess_generated_text(
if len(term) > 0: generated_text, until, self.think_end_token
generated_text = generated_text.split(term)[0] )
res.append(generated_text) res.append(generated_text)
self.cache_hook.add_partial( self.cache_hook.add_partial(
"generate_until", (context, gen_kwargs), generated_text "generate_until", (context, gen_kwargs), generated_text
......
...@@ -6,7 +6,7 @@ ...@@ -6,7 +6,7 @@
For more information, including a full list of task names and their precise meanings or sources, follow the links provided to the individual README.md files for each subfolder. For more information, including a full list of task names and their precise meanings or sources, follow the links provided to the individual README.md files for each subfolder.
| Task Family | Description | Language(s) | | Task Family | Description | Language(s) |
|--------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------| |--------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
| [aclue](aclue/README.md) | Tasks focusing on ancient Chinese language understanding and cultural aspects. | Ancient Chinese | | [aclue](aclue/README.md) | Tasks focusing on ancient Chinese language understanding and cultural aspects. | Ancient Chinese |
| [acp_bench](acpbench/README.md) | Tasks evaluating the reasoning ability about Action, Change, and Planning | English | | [acp_bench](acpbench/README.md) | Tasks evaluating the reasoning ability about Action, Change, and Planning | English |
| [acp_bench_hard](acpbench/README.md) | Tasks evaluating the reasoning ability about Action, Change, and Planning | English | | [acp_bench_hard](acpbench/README.md) | Tasks evaluating the reasoning ability about Action, Change, and Planning | English |
...@@ -17,7 +17,7 @@ ...@@ -17,7 +17,7 @@
| [arabic_leaderboard_light](arabic_leaderboard_light/README.md) | A light version of the tasks in the Open Arabic LLM Leaderboard (i.e., 10% samples of the test set in the original benchmarks), focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. | Arabic (Some MT) | | [arabic_leaderboard_light](arabic_leaderboard_light/README.md) | A light version of the tasks in the Open Arabic LLM Leaderboard (i.e., 10% samples of the test set in the original benchmarks), focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. | Arabic (Some MT) |
| [arabicmmlu](arabicmmlu/README.md) | Localized Arabic version of MMLU with multiple-choice questions from 40 subjects. | Arabic | | [arabicmmlu](arabicmmlu/README.md) | Localized Arabic version of MMLU with multiple-choice questions from 40 subjects. | Arabic |
| [ArabCulture](arab_culture/README.md) | Benchmark for evaluating modeles' commonsense cultural knowledge across different 13 different Arab Countries. | Arabic | | [ArabCulture](arab_culture/README.md) | Benchmark for evaluating modeles' commonsense cultural knowledge across different 13 different Arab Countries. | Arabic |
[AraDICE](aradice/README.md) | A collection of multiple tasks carefully designed to evaluate dialectal and cultural capabilities in large language models (LLMs). | Arabic | | [AraDICE](aradice/README.md) | A collection of multiple tasks carefully designed to evaluate dialectal and cultural capabilities in large language models (LLMs). | Arabic |
| [arc](arc/README.md) | Tasks involving complex reasoning over a diverse set of questions. | English | | [arc](arc/README.md) | Tasks involving complex reasoning over a diverse set of questions. | English |
| [arithmetic](arithmetic/README.md) | Tasks involving numerical computations and arithmetic reasoning. | English | | [arithmetic](arithmetic/README.md) | Tasks involving numerical computations and arithmetic reasoning. | English |
| [asdiv](asdiv/README.md) | Tasks involving arithmetic and mathematical reasoning challenges. | English | | [asdiv](asdiv/README.md) | Tasks involving arithmetic and mathematical reasoning challenges. | English |
...@@ -44,8 +44,10 @@ ...@@ -44,8 +44,10 @@
| csatqa | Tasks related to SAT and other standardized testing questions for academic assessment. | Korean | | csatqa | Tasks related to SAT and other standardized testing questions for academic assessment. | Korean |
| [darija_bench](darija_bench/README.md) | Traditional NLP tasks (Translation, Summariation, etc..) for Moroccan Darija | Moroccan Darija (some MT) | | [darija_bench](darija_bench/README.md) | Traditional NLP tasks (Translation, Summariation, etc..) for Moroccan Darija | Moroccan Darija (some MT) |
| [darijahellaswag](darijahellaswag/README.md) | Moroccan Darija version of HellaSwag. | Moroccan Darija (MT) | | [darijahellaswag](darijahellaswag/README.md) | Moroccan Darija version of HellaSwag. | Moroccan Darija (MT) |
| [darijammlu](darijammlu/README.md)| Multiple-choice QA in Moroccan Darija (an Arabic dialect). | Moroccan Darija (MT) | | [darijammlu](darijammlu/README.md) | Multiple-choice QA in Moroccan Darija (an Arabic dialect). | Moroccan Darija (MT) |
| [drop](drop/README.md) | Tasks requiring numerical reasoning, reading comprehension, and question answering. | English | | [drop](drop/README.md) | Tasks requiring numerical reasoning, reading comprehension, and question answering. | English |
| [egyhellaswag](egyhellaswag/README.md) | Egyptian Arabic (Masri) version of HellaSwag. | Egyptian Arabic (MT) |
| [egymmlu](egymmlu/README.md) | Multiple-choice QA in Egyptian Arabic. | Egyptian Arabic (MT) |
| [eq_bench](eq_bench/README.md) | Tasks focused on equality and ethics in question answering and decision-making. | English | | [eq_bench](eq_bench/README.md) | Tasks focused on equality and ethics in question answering and decision-making. | English |
| [eus_exams](eus_exams/README.md) | Tasks based on various professional and academic exams in the Basque language. | Basque | | [eus_exams](eus_exams/README.md) | Tasks based on various professional and academic exams in the Basque language. | Basque |
| [eus_proficiency](eus_proficiency/README.md) | Tasks designed to test proficiency in the Basque language across various topics. | Basque | | [eus_proficiency](eus_proficiency/README.md) | Tasks designed to test proficiency in the Basque language across various topics. | Basque |
...@@ -83,6 +85,7 @@ ...@@ -83,6 +85,7 @@
| [lambada_multilingual_stablelm](lambada_multilingual_stablelm/README.md) | Multilingual LAMBADA dataset. Users should prefer evaluating on this version of the multilingual dataset instead of on `lambada_multilingual`. | German, English, Spanish, French, Italian, Dutch, Portuguese | | [lambada_multilingual_stablelm](lambada_multilingual_stablelm/README.md) | Multilingual LAMBADA dataset. Users should prefer evaluating on this version of the multilingual dataset instead of on `lambada_multilingual`. | German, English, Spanish, French, Italian, Dutch, Portuguese |
| [leaderboard](leaderboard/README.md) | Task group used by Hugging Face's [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard). Those tasks are static and will not change through time | English | | [leaderboard](leaderboard/README.md) | Task group used by Hugging Face's [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard). Those tasks are static and will not change through time | English |
| [lingoly](lingoly/README.md) | Challenging logical reasoning benchmark in low-resource languages with controls for memorization | English, Multilingual | | [lingoly](lingoly/README.md) | Challenging logical reasoning benchmark in low-resource languages with controls for memorization | English, Multilingual |
| [libra](libra/README.md) | Evaluates long-context understanding in Russian across four complexity levels | Russian (MT) |
| [logiqa](logiqa/README.md) | Logical reasoning tasks requiring advanced inference and deduction. | English, Chinese | | [logiqa](logiqa/README.md) | Logical reasoning tasks requiring advanced inference and deduction. | English, Chinese |
| [logiqa2](logiqa2/README.md) | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. | English, Chinese | | [logiqa2](logiqa2/README.md) | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. | English, Chinese |
| [mastermind](mastermind/README.md) | Reasoning benchmark based on the board game of Mastermind. | English | | [mastermind](mastermind/README.md) | Reasoning benchmark based on the board game of Mastermind. | English |
...@@ -109,6 +112,7 @@ ...@@ -109,6 +112,7 @@
| model_written_evals | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. | | | model_written_evals | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. | |
| [moral_stories](moral_stories/README.md) | A crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations. | English | | [moral_stories](moral_stories/README.md) | A crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations. | English |
| [mts_dialog](mts_dialog/README.md) | Open-ended healthcare QA from the MTS-Dialog dataset. | English | | [mts_dialog](mts_dialog/README.md) | Open-ended healthcare QA from the MTS-Dialog dataset. | English |
| [multiblimp](multiblimp/README.md) | MultiBLiMP is a (synthetic) multilingual benchmark testing models on linguistic minimal pairs to judge grammatical acceptability | Multiple (101 languages) - Synthetic |
| [mutual](mutual/README.md) | A retrieval-based dataset for multi-turn dialogue reasoning. | English | | [mutual](mutual/README.md) | A retrieval-based dataset for multi-turn dialogue reasoning. | English |
| [noreval](noreval/README.md) | A human-created Norwegian language understanding and generation benchmark. | Norwegian (Bokmål and Nynorsk) | | [noreval](noreval/README.md) | A human-created Norwegian language understanding and generation benchmark. | Norwegian (Bokmål and Nynorsk) |
| [nq_open](nq_open/README.md) | Open domain question answering tasks based on the Natural Questions dataset. | English | | [nq_open](nq_open/README.md) | Open domain question answering tasks based on the Natural Questions dataset. | English |
......
...@@ -4,9 +4,9 @@ include: _boolq_cot_2shot_yaml ...@@ -4,9 +4,9 @@ include: _boolq_cot_2shot_yaml
fewshot_config: fewshot_config:
sampler: first_n sampler: first_n
samples: samples:
- context: 'This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time. There are 2 locations and 5 cars, numbered consecutively. Currently, the ferry is at l1, with the car c4 on board. The cars are at locations as follows: c0 and c3 are at l1; c1 and c2 are at l0.' - context: "This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time. There are 2 locations and 5 cars, numbered consecutively. Currently, the ferry is at l1, with the car c4 on board. The cars are at locations as follows: c0 and c3 are at l1; c1 and c2 are at l0."
question: 'Is it possible to transition to a state where the action "travel by sea from location l0 to location l1" can be applied?' question: 'Is it possible to transition to a state where the action "travel by sea from location l0 to location l1" can be applied?'
answer: "Let's think step by step. Step 1: Verify if there is a sequence of actions which transforms the current state into a state where the precondition of the action \"travel by sea from location l0 to location l1\" hold. Step 2: The following sequence of actions would transition to such a state: sail from location l1 to location l0, unload the car c4 from the ferry to location l0, board car c1 at location l0. **Final Answer**: Yes." answer: "Let's think step by step. Step 1: Verify if there is a sequence of actions which transforms the current state into a state where the precondition of the action \"travel by sea from location l0 to location l1\" hold. Step 2: The following sequence of actions would transition to such a state: sail from location l1 to location l0, unload the car c4 from the ferry to location l0, board car c1 at location l0. **Final Answer**: Yes."
- context: 'There are several cities, each containing several locations, some of which are airports. There are also trucks, which can drive within a single city, and airplanes, which can fly between airports. The goal is to get some packages from various locations to various new locations. There are 2 trucks and 1 airplane, as well as 4 packages. There are 6 locations across 2 cities. The locations are in cities as follows: l0-0, l0-1, and l0-2 are in c0; l1-1, l1-2, and l1-0 are in c1. Currently, a0 is at l1-0, t1 is at l1-1, t0 is at l0-0, p2 and p1 are in t1, p0 and p3 are in a0.' - context: "There are several cities, each containing several locations, some of which are airports. There are also trucks, which can drive within a single city, and airplanes, which can fly between airports. The goal is to get some packages from various locations to various new locations. There are 2 trucks and 1 airplane, as well as 4 packages. There are 6 locations across 2 cities. The locations are in cities as follows: l0-0, l0-1, and l0-2 are in c0; l1-1, l1-2, and l1-0 are in c1. Currently, a0 is at l1-0, t1 is at l1-1, t0 is at l0-0, p2 and p1 are in t1, p0 and p3 are in a0."
question: 'Is it possible to transition to a state where the action "offload the object p0 from the truck p0 at location p1" can be applied?' question: 'Is it possible to transition to a state where the action "offload the object p0 from the truck p0 at location p1" can be applied?'
answer: "Let's think step by step. Step 1: Verify if there is a sequence of actions which transforms the current state into a state where the precondition of the action \"offload the object p0 from the truck p0 at location p1\" hold. Step 2: Action preconditions are \"p0 is in p0 and p0 is at p1\". Step 3: These facts are not reachable together, as they include mutually exclusive facts \"p0 is in p0 and p0 is at p1\". **Final Answer**: No." answer: "Let's think step by step. Step 1: Verify if there is a sequence of actions which transforms the current state into a state where the precondition of the action \"offload the object p0 from the truck p0 at location p1\" hold. Step 2: Action preconditions are \"p0 is in p0 and p0 is at p1\". Step 3: These facts are not reachable together, as they include mutually exclusive facts \"p0 is in p0 and p0 is at p1\". **Final Answer**: No."
...@@ -67,7 +67,7 @@ def span_f1_agg(items): ...@@ -67,7 +67,7 @@ def span_f1_agg(items):
def remove_blank_spaces(text): def remove_blank_spaces(text):
text = re.sub(pattern=get_blank_spaces_pattern(), repl="", string=text) text = re.sub(pattern=get_blank_spaces_pattern(), repl="", string=text)
text = re.sub("\s+", " ", text) text = re.sub(r"\s+", " ", text)
return text return text
def remove_punctuation(text): def remove_punctuation(text):
......
...@@ -67,7 +67,7 @@ def span_f1_agg(items): ...@@ -67,7 +67,7 @@ def span_f1_agg(items):
def remove_blank_spaces(text): def remove_blank_spaces(text):
text = re.sub(pattern=get_blank_spaces_pattern(), repl="", string=text) text = re.sub(pattern=get_blank_spaces_pattern(), repl="", string=text)
text = re.sub("\s+", " ", text) text = re.sub(r"\s+", " ", text)
return text return text
def remove_punctuation(text): def remove_punctuation(text):
......
...@@ -67,7 +67,7 @@ def span_f1_agg(items): ...@@ -67,7 +67,7 @@ def span_f1_agg(items):
def remove_blank_spaces(text): def remove_blank_spaces(text):
text = re.sub(pattern=get_blank_spaces_pattern(), repl="", string=text) text = re.sub(pattern=get_blank_spaces_pattern(), repl="", string=text)
text = re.sub("\s+", " ", text) text = re.sub(r"\s+", " ", text)
return text return text
def remove_punctuation(text): def remove_punctuation(text):
......
...@@ -67,7 +67,7 @@ def span_f1_agg(items): ...@@ -67,7 +67,7 @@ def span_f1_agg(items):
def remove_blank_spaces(text): def remove_blank_spaces(text):
text = re.sub(pattern=get_blank_spaces_pattern(), repl="", string=text) text = re.sub(pattern=get_blank_spaces_pattern(), repl="", string=text)
text = re.sub("\s+", " ", text) text = re.sub(r"\s+", " ", text)
return text return text
def remove_punctuation(text): def remove_punctuation(text):
......
...@@ -67,7 +67,7 @@ def span_f1_agg(items): ...@@ -67,7 +67,7 @@ def span_f1_agg(items):
def remove_blank_spaces(text): def remove_blank_spaces(text):
text = re.sub(pattern=get_blank_spaces_pattern(), repl="", string=text) text = re.sub(pattern=get_blank_spaces_pattern(), repl="", string=text)
text = re.sub("\s+", " ", text) text = re.sub(r"\s+", " ", text)
return text return text
def remove_punctuation(text): def remove_punctuation(text):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment