Commit 3e5e9da2 authored by lintangsutawika's avatar lintangsutawika
Browse files

merged from main

parents d429b47f 7852985b
......@@ -129,6 +129,53 @@ These two options (`accelerate launch` and `parallelize=True`) are mutually excl
**Note: we do not currently support multi-node evaluations natively, and advise using either an externally hosted server to run inference requests against, or creating a custom integration with your distributed framework [as is done for the GPT-NeoX library](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py).**
### NVIDIA `nemo` models
[NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) is a generative AI framework built for researchers and pytorch developers working on language models.
To evaluate a `nemo` model, start by installing NeMo following [the documentation](https://github.com/NVIDIA/NeMo?tab=readme-ov-file#installation). We highly recommended to use the NVIDIA PyTorch or NeMo container, especially if having issues installing Apex or any other dependencies (see [latest released containers](https://github.com/NVIDIA/NeMo/releases)). Please also install the lm evaluation harness library following the instructions in [the Install section](https://github.com/EleutherAI/lm-evaluation-harness/tree/main?tab=readme-ov-file#install).
NeMo models can be obtained through [NVIDIA NGC Catalog](https://catalog.ngc.nvidia.com/models) or in [NVIDIA's Hugging Face page](https://huggingface.co/nvidia). In [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo/tree/main/scripts/nlp_language_modeling) there are conversion scripts to convert the `hf` checkpoints of popular models like llama, falcon, mixtral or mpt to `nemo`.
Run a `nemo` model on one GPU:
```bash
lm_eval --model nemo_lm \
--model_args path=<path_to_nemo_model> \
--tasks hellaswag \
--batch_size 32
```
It is recommended to unpack the `nemo` model to avoid the unpacking inside the docker container - it may overflow disk space. For that you can run:
```
mkdir MY_MODEL
tar -xvf MY_MODEL.nemo -c MY_MODEL
```
#### Multi-GPU evaluation with NVIDIA `nemo` models
By default, only one GPU is used. But we do support either data replication or tensor/pipeline parallelism during evaluation, on one node.
1) To enable data replication, set the `model_args` of `devices` to the number of data replicas to run. For example, the command to run 8 data replicas over 8 GPUs is:
```bash
torchrun --nproc-per-node=8 --no-python lm_eval \
--model nemo_lm \
--model_args path=<path_to_nemo_model>,devices=8 \
--tasks hellaswag \
--batch_size 32
```
2) To enable tensor and/or pipeline parallelism, set the `model_args` of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. In addition, you also have to set up `devices` to be equal to the product of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. For example, the command to use one node of 4 GPUs with tensor parallelism of 2 and pipeline parallelism of 2 is:
```bash
torchrun --nproc-per-node=4 --no-python lm_eval \
--model nemo_lm \
--model_args path=<path_to_nemo_model>,devices=4,tensor_model_parallel_size=2,pipeline_model_parallel_size=2 \
--tasks hellaswag \
--batch_size 32
```
Note that it is recommended to substitute the `python` command by `torchrun --nproc-per-node=<number of devices> --no-python` to facilitate loading the model into the GPUs. This is especially important for large checkpoints loaded into multiple GPUs.
Not supported yet: multi-node evaluation and combinations of data replication with tensor or pipeline parallelism.
### Tensor + Data Parallel and Optimized Inference with `vLLM`
......@@ -175,6 +222,7 @@ Note that for externally hosted models, configs such as `--device` and `--batch_
| OpenAI Completions | :heavy_check_mark: | `openai-completions`, `local-completions` | All OpenAI Completions API models | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| OpenAI ChatCompletions | :heavy_check_mark: | `openai-chat-completions`, `local-chat-completions` | [All ChatCompletions API models](https://platform.openai.com/docs/guides/gpt) | `generate_until` (no logprobs) |
| Anthropic | :heavy_check_mark: | `anthropic` | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model) | `generate_until` (no logprobs) |
| Anthropic Chat | :heavy_check_mark: | `anthropic-chat`, `anthropic-chat-completions` | [Supported Anthropic Engines](https://docs.anthropic.com/claude/docs/models-overview) | `generate_until` (no logprobs) |
| Textsynth | :heavy_check_mark: | `textsynth` | [All supported engines](https://textsynth.com/documentation.html#engines) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Cohere | [:hourglass: - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark: | `gguf`, `ggml` | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp) | `generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented) |
......@@ -188,6 +236,10 @@ Models which do not supply logits or logprobs can be used with tasks of type `ge
For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).
> [!Note]
> For best performance with closed chat model APIs such as Anthropic Claude 3 and GPT-4, we recommend carefully looking at a few sample outputs using `--limit 10` first to confirm answer extraction and scoring on generative tasks is performing as expected. providing `system="<some system prompt here>"` within `--model_args` for anthropic-chat-completions, to instruct the model what format to respond in, may be useful.
### Other Frameworks
A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
......@@ -198,7 +250,7 @@ To create your own custom integration you can follow instructions from [this tut
> [!Note]
> For tasks unsuitable for direct evaluation — either due risks associated with executing untrusted code or complexities in the evaluation process — the `--predict_only` flag is available to obtain decoded generations for post-hoc evaluation.
If you have a Metal compatible Mac, you can run the eval harness using the MPS back-end by replacing `--device cuda:0` with `--device mps` (requires PyTorch version 2.1 or higher).
If you have a Metal compatible Mac, you can run the eval harness using the MPS back-end by replacing `--device cuda:0` with `--device mps` (requires PyTorch version 2.1 or higher). **Note that the PyTorch MPS backend is still in early stages of development, so correctness issues or unsupported operations may exist. If you observe oddities in model performance on the MPS back-end, we recommend first checking that a forward pass of your model on `--device cpu` and `--device mps` match.**
> [!Note]
> You can inspect what the LM inputs look like by running the following command:
......
......@@ -387,7 +387,8 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
if args.log_samples:
for task_name, config in results["configs"].items():
output_name = "{}_{}".format(
re.sub("/|=", "__", args.model_args), task_name
re.sub(r"[\"<>:/\|\\?\*\[\]]+", "__", args.model_args),
task_name,
)
filename = path.joinpath(f"{output_name}.jsonl")
samples_dumped = json.dumps(
......
......@@ -6,6 +6,7 @@ import os
from typing import List, Optional, Tuple, Type, TypeVar
import transformers
import transformers
from sqlitedict import SqliteDict
from tqdm import tqdm
......@@ -303,17 +304,17 @@ class TemplateLM(LM):
continuation = context[-n_spaces:] + continuation
context = context[:-n_spaces]
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
whole_enc = self.tok_encode(context + continuation, add_special_tokens=False)
context_enc = self.tok_encode(context, add_special_tokens=False)
context_enc_len = len(context_enc)
continuation_enc = whole_enc[context_enc_len:]
model_class = getattr(self, "AUTO_MODEL_CLASS", None)
elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
# The encoder may require context end with special tokens
if model_class == transformers.AutoModelForSeq2SeqLM:
context_enc = self.tok_encode(context)
continuation_enc = self.tok_encode(continuation, add_special_tokens=False)
else:
whole_enc = self.tok_encode(context + continuation)
context_enc = self.tok_encode(context)
context_enc_len = len(context_enc)
continuation_enc = whole_enc[context_enc_len:]
return context_enc, continuation_enc
......
......@@ -4,6 +4,7 @@ from . import (
gguf,
huggingface,
mamba_lm,
nemo_lm,
neuron_optimum,
openai_completions,
optimum_lm,
......
......@@ -45,7 +45,7 @@ def anthropic_completion(
except ModuleNotFoundError:
raise Exception(
"attempted to use 'anthropic' LM type, but package `anthropic` is not installed. \
please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e .[anthropic]`",
please install anthropic via `pip install 'lm-eval[anthropic]'` or `pip install -e '.[anthropic]'`",
)
def _exception_callback(e: Exception, sleep_time: float) -> None:
......@@ -74,6 +74,70 @@ please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e
return completion()
def anthropic_chat(
client, #: anthropic.Anthropic,
model: str,
prompt: str,
max_tokens: int,
temperature: float,
stop: List[str],
**kwargs: Any,
) -> str:
"""Wrapper function around the Anthropic completion API client with exponential back-off
in case of RateLimitError.
params:
client: anthropic.Anthropic
Anthropic API client
model: str
Anthropic model e.g. 'claude-3-opus-20240229', 'claude-3-sonnet-20240229'
prompt: str
Prompt to feed to the model
max_tokens: int
Maximum number of tokens to sample from the model
temperature: float
Sampling temperature
stop: List[str]
List of stop sequences
kwargs: Any
Additional model_args to pass to the API client
"""
try:
import anthropic
except ModuleNotFoundError:
raise Exception(
"attempted to use 'anthropic' LM type, but package `anthropic` is not installed. \
please install anthropic via `pip install 'lm-eval[anthropic]'` or `pip install -e '.[anthropic]'`",
)
def _exception_callback(e: Exception, sleep_time: float) -> None:
eval_logger.warning(
f"RateLimitError occurred: {e.__cause__}\n Retrying in {sleep_time} seconds"
)
@retry_on_specific_exceptions(
on_exceptions=[
anthropic.RateLimitError,
anthropic.APIConnectionError,
anthropic.APIStatusError,
],
max_retries=None, # retry forever, consider changing
on_exception_callback=_exception_callback,
)
def messages():
response = client.messages.create(
model=model,
max_tokens=max_tokens,
temperature=temperature,
messages=[{"role": "user", "content": f"{prompt}"}],
**kwargs,
)
return response.content[0].text
return messages()
@register_model("anthropic")
class AnthropicLM(LM):
REQ_CHUNK_SIZE = 20 # TODO: not used
......@@ -104,7 +168,7 @@ class AnthropicLM(LM):
except ModuleNotFoundError:
raise Exception(
"attempted to use 'anthropic' LM type, but package `anthropic` is not installed. \
please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e .[anthropic]`",
please install anthropic via `pip install 'lm-eval[anthropic]'` or `pip install -e '.[anthropic]'`",
)
self.model = model
......@@ -153,7 +217,7 @@ please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e
except ModuleNotFoundError:
raise Exception(
"attempted to use 'anthropic' LM type, but package `anthropic` is not installed. \
please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e .[anthropic]`",
please install anthropic via `pip install 'lm-eval[anthropic]'` or `pip install -e '.[anthropic]'`",
)
if not requests:
......@@ -204,3 +268,93 @@ please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e
def loglikelihood_rolling(self, requests, disable_tqdm: bool = False):
raise NotImplementedError("No support for logits.")
@register_model("anthropic-chat", "anthropic-chat-completions")
class AnthropicChatLM(AnthropicLM):
REQ_CHUNK_SIZE = 20 # TODO: not used
def __init__(
self,
model: str,
batch_size: int = 1,
max_tokens: int = 256,
temperature: float = 0, # defaults to 1
**kwargs, # top_p, top_k, etc.
) -> None:
"""Anthropic API wrapper.
:param model: str
Anthropic model e.g. 'claude-3-opus-20240229', 'claude-3-sonnet-20240229'
:param max_tokens: int
Maximum number of tokens to sample from the model
:param temperature: float
Sampling temperature
:param kwargs: Any
Additional model_args to pass to the API client
"""
super().__init__()
try:
import anthropic
except ModuleNotFoundError:
raise Exception(
"attempted to use 'anthropic' LM type, but package `anthropic` is not installed. \
please install anthropic via `pip install 'lm-eval[anthropic]'` or `pip install -e '.[anthropic]'`",
)
self.model = model
# defaults to os.environ.get("ANTHROPIC_API_KEY")
self.client = anthropic.Anthropic()
self.temperature = temperature
self.max_token = max_tokens
self.tokenizer = self.client.get_tokenizer()
self.kwargs = kwargs
@property
def max_gen_toks(self) -> int:
return self.max_tokens
def generate_until(self, requests) -> List[str]:
try:
import anthropic
except ModuleNotFoundError:
raise Exception(
"attempted to use 'anthropic' LM type, but package `anthropic` is not installed. \
please install anthropic via `pip install 'lm-eval[anthropic]'` or `pip install -e '.[anthropic]'`",
)
if not requests:
return []
_requests: List[Tuple[str, dict]] = [req.args for req in requests]
res = []
for request in tqdm(_requests):
try:
inp = request[0]
request_args = request[1]
# generation_kwargs
until = request_args.get("until")
max_tokens = request_args.get("max_gen_toks", self.max_length)
temperature = request_args.get("temperature", self.temperature)
response = anthropic_chat(
client=self.client,
model=self.model,
prompt=inp,
max_tokens=max_tokens,
temperature=temperature, # TODO: implement non-greedy sampling for Anthropic
stop=until, # type: ignore
**self.kwargs,
)
res.append(response)
self.cache_hook.add_partial("generate_until", request, response)
except anthropic.APIConnectionError as e: # type: ignore # noqa: F821
eval_logger.critical(f"Server unreachable: {e.__cause__}")
break
except anthropic.APIStatusError as e: # type: ignore # noqa: F821
eval_logger.critical(f"API error {e.status_code}: {e.message}")
break
return res
......@@ -565,7 +565,8 @@ class HFLM(TemplateLM):
if peft:
if model_kwargs.get("load_in_4bit", None):
assert PEFT_VERSION >= "0.4.0", "load_in_4bit requires peft >= 0.4.0"
if version.parse(PEFT_VERSION) < version.parse("0.4.0"):
raise AssertionError("load_in_4bit requires peft >= 0.4.0")
self._model = PeftModel.from_pretrained(
self._model, peft, revision=revision
)
......@@ -680,14 +681,21 @@ class HFLM(TemplateLM):
self, string: str, left_truncate_len=None, add_special_tokens=None
) -> List[int]:
""" """
# default for None - empty dict, use predefined tokenizer param
# used for all models except for CausalLM or predefined value
special_tokens_kwargs = {}
add_special_tokens = {}
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
add_special_tokens = {
"add_special_tokens": False or self.add_bos_token
}
encoding = self.tokenizer.encode(string, **add_special_tokens)
# by default for CausalLM - false or self.add_bos_token is set
if add_special_tokens is None:
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
special_tokens_kwargs = {
"add_special_tokens": False or self.add_bos_token
}
# otherwise the method explicitly defines the value
else:
special_tokens_kwargs = {"add_special_tokens": add_special_tokens}
encoding = self.tokenizer.encode(string, **special_tokens_kwargs)
# left-truncate the encoded context to be at most `left_truncate_len` tokens long
if left_truncate_len:
......@@ -708,9 +716,7 @@ class HFLM(TemplateLM):
add_special_tokens = {}
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
add_special_tokens = {
"add_special_tokens": False or self.add_bos_token
}
add_special_tokens = {"add_special_tokens": False or self.add_bos_token}
encoding = self.tokenizer(
strings,
......@@ -729,10 +735,7 @@ class HFLM(TemplateLM):
return encoding["input_ids"], encoding["attention_mask"]
def tok_decode(self, tokens, skip_special_tokens=True):
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
return self.tokenizer.decode(tokens)
elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
return self.tokenizer.decode(tokens, skip_special_tokens=skip_special_tokens)
return self.tokenizer.decode(tokens, skip_special_tokens=skip_special_tokens)
def _model_call(self, inps, attn_mask=None, labels=None):
"""
......
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import importlib
import pathlib
from copy import deepcopy
from typing import List, Literal
import filelock
import numpy as np
import torch
from tqdm import tqdm
from lm_eval.api.instance import Instance
from lm_eval.api.model import LM
from lm_eval.api.registry import register_model
from lm_eval.models.utils import Collator
from lm_eval.utils import (
eval_logger,
get_rolling_token_windows,
make_disjoint_window,
simple_parse_args_string,
)
def _patch_pretrained_cfg(
pretrained_cfg, trainer, tensor_model_parallel_size, pipeline_model_parallel_size
):
try:
import omegaconf
except ModuleNotFoundError:
raise Exception(
"Attempted to use 'nemo_lm' model type, but package `nemo` is not installed"
"Please install nemo following the instructions in the README: either with a NVIDIA PyTorch or NeMo container, "
"or installing nemo following https://github.com/NVIDIA/NeMo.",
)
omegaconf.OmegaConf.set_struct(pretrained_cfg, True)
with omegaconf.open_dict(pretrained_cfg):
attributes_to_update = {
"sequence_parallel": False,
"activations_checkpoint_granularity": None,
"activations_checkpoint_method": None,
"precision": trainer.precision,
"global_batch_size": None,
"tensor_model_parallel_size": tensor_model_parallel_size,
"pipeline_model_parallel_size": pipeline_model_parallel_size,
"apply_rope_fusion": False,
}
for name, value in attributes_to_update.items():
if hasattr(pretrained_cfg, name):
pretrained_cfg[name] = value
return pretrained_cfg
def _get_target_from_class(target_class) -> str:
return f"{target_class.__module__}.{target_class.__name__}"
def load_model(
model_path: str,
trainer,
tensor_model_parallel_size: int,
pipeline_model_parallel_size: int,
) -> torch.nn.Module:
try:
from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import (
MegatronGPTModel,
)
from nemo.collections.nlp.parts.nlp_overrides import NLPSaveRestoreConnector
except ModuleNotFoundError:
raise Exception(
"Attempted to use 'nemo_lm' model type, but package `nemo` is not installed"
"Please install nemo following the instructions in the README: either with a NVIDIA PyTorch or NeMo container, "
"or installing nemo following https://github.com/NVIDIA/NeMo.",
)
model_path = pathlib.Path(model_path)
save_restore_connector = NLPSaveRestoreConnector()
if model_path.is_dir():
save_restore_connector.model_extracted_dir = model_path.as_posix()
pretrained_cfg = save_restore_connector.restore_from(
None, model_path.as_posix(), return_config=True, trainer=trainer
)
if not hasattr(pretrained_cfg, "target"):
pretrained_cfg["target"] = _get_target_from_class(MegatronGPTModel)
pretrained_cfg = _patch_pretrained_cfg(
pretrained_cfg,
trainer,
tensor_model_parallel_size=tensor_model_parallel_size,
pipeline_model_parallel_size=pipeline_model_parallel_size,
)
model_to_load_path = model_path
override_config = pretrained_cfg
module_name, class_name = override_config.target.rsplit(".", 1)
model_class = getattr(importlib.import_module(module_name), class_name)
# monkeypatch _build_tokenizer method to be process-safe
tokenizer_lock = filelock.FileLock(f"/tmp/{model_path.name}.tokenizer.lock")
def _synced_build_tokenizer(self):
with tokenizer_lock:
self._original_build_tokenizer()
model_class._original_build_tokenizer = model_class._build_tokenizer
model_class._build_tokenizer = _synced_build_tokenizer
model = model_class.restore_from(
restore_path=model_to_load_path.as_posix(),
trainer=trainer,
override_config_path=override_config,
save_restore_connector=save_restore_connector,
map_location=f"cuda:{trainer.local_rank}",
)
model.freeze()
model.training = False
try:
# Have to turn off activations_checkpoint_method for inference
model.model.language_model.encoder.activations_checkpoint_method = None
except AttributeError:
pass
return model
def setup_distributed_environment(trainer):
try:
from nemo.utils.app_state import AppState
except ModuleNotFoundError:
raise Exception(
"Attempted to use 'nemo_lm' model type, but package `nemo` is not installed"
"Please install nemo following the instructions in the README: either with a NVIDIA PyTorch or NeMo container, "
"or installing nemo following https://github.com/NVIDIA/NeMo.",
)
def dummy():
return
if trainer.strategy.launcher is not None:
trainer.strategy.launcher.launch(dummy, trainer=trainer)
trainer.strategy.setup_environment()
app_state = AppState()
return app_state
@register_model("nemo_lm")
class NeMoLM(LM):
def __init__(
self,
path: str,
max_length: int = 4096,
batch_size: int = 1,
max_gen_toks: int = 256,
devices: int = 1,
num_nodes: int = 1,
tensor_model_parallel_size: int = 1,
pipeline_model_parallel_size: int = 1,
precision: Literal[
"16-mixed",
"bf16-mixed",
"32-true",
"64-true",
64,
32,
16,
"64",
"32",
"16",
"bf16",
] = "bf16",
**kwargs,
):
try:
from nemo.collections.nlp.modules.common.text_generation_utils import (
generate,
)
from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy
from pytorch_lightning.trainer.trainer import Trainer
self.generate = generate
except ModuleNotFoundError:
raise Exception(
"Attempted to use 'nemo_lm' model type, but package `nemo` is not installed"
"Please install nemo following the instructions in the README: either with a NVIDIA PyTorch or NeMo container, "
"or installing nemo following https://github.com/NVIDIA/NeMo.",
)
super().__init__()
if (
tensor_model_parallel_size == 1
and pipeline_model_parallel_size == 1
and devices > 1
):
eval_logger.info(
f"The number of data replicas for evaluation is {devices}."
)
eval_logger.info(f"The total number of devices is {devices}.")
eval_logger.info(
"No tensor parallelism or pipeline parallelism is applied."
)
elif tensor_model_parallel_size * pipeline_model_parallel_size == devices:
eval_logger.info(
f"Setting tensor parallelism to {tensor_model_parallel_size} and pipeline parallelism to {pipeline_model_parallel_size}."
)
eval_logger.info(f"The total number of devices is {devices}.")
eval_logger.info("No data parallelism is applied.")
else:
raise ValueError(
"Please set the product of tensor_model_parallel_size and pipeline_model_parallel_size"
"equal to the specified number of devices."
)
if num_nodes > 1:
raise ValueError(
"A number of nodes greater than 1 is not supported yet. Please set num_nodes as 1."
)
trainer = Trainer(
strategy=NLPDDPStrategy(),
devices=devices,
accelerator="gpu",
num_nodes=num_nodes,
precision=precision,
logger=False,
enable_checkpointing=False,
use_distributed_sampler=False,
)
# Modify the following flags only for data replication
if (
tensor_model_parallel_size == 1
and pipeline_model_parallel_size == 1
and devices > 1
):
self._device = torch.device(f"cuda:{trainer.global_rank}")
self._rank = trainer.global_rank
self._world_size = trainer.world_size
self.model = load_model(
path,
trainer,
tensor_model_parallel_size=tensor_model_parallel_size,
pipeline_model_parallel_size=pipeline_model_parallel_size,
).cuda()
self.tokenizer = self.model.tokenizer
self.app_state = setup_distributed_environment(trainer)
self._max_length = max_length
self._batch_size = int(batch_size)
self._max_gen_toks = max_gen_toks
@classmethod
def create_from_arg_string(cls, arg_string, additional_config=None):
args = simple_parse_args_string(arg_string)
if additional_config:
args["batch_size"] = additional_config.get("batch_size", 1)
return cls(**args)
@property
def eot_token_id(self):
try:
return self.tokenizer.eos_id
except AttributeError:
return None
@property
def max_length(self):
return self._max_length
@property
def max_gen_toks(self):
return self._max_gen_toks
@property
def batch_size(self):
return self._batch_size
@property
def device(self):
return self._device
@property
def rank(self):
return self._rank
@property
def world_size(self):
return self._world_size
@property
def accelerator(self):
return self._Accelerator(self.world_size)
class _Accelerator:
def __init__(self, world_size):
self.world_size = world_size
def wait_for_everyone(self):
torch.distributed.barrier()
def gather(self, local_tensor):
gathered_tensors = [
torch.zeros(1, dtype=local_tensor.dtype).cuda()
for _ in range(self.world_size)
]
torch.distributed.all_gather(gathered_tensors, local_tensor)
return torch.cat(gathered_tensors)
def tok_encode(self, string: str):
return self.tokenizer.text_to_ids(string)
def tok_decode(self, tokens):
return self.tokenizer.ids_to_text(tokens)
def _encode_pair(self, context, continuation):
n_spaces = len(context) - len(context.rstrip())
if n_spaces > 0:
continuation = context[-n_spaces:] + continuation
context = context[:-n_spaces]
whole_enc = self.tok_encode(context + continuation)
context_enc = self.tok_encode(context)
context_enc_len = len(context_enc)
continuation_enc = whole_enc[context_enc_len:]
return context_enc, continuation_enc
def loglikelihood(self, requests):
new_reqs = []
for context, continuation in [req.args for req in requests]:
if context == "":
# end of text as context
context_enc, continuation_enc = (
[self.eot_token_id],
self.tok_encode(continuation),
)
else:
context_enc, continuation_enc = self._encode_pair(context, continuation)
new_reqs.append(((context, continuation), context_enc, continuation_enc))
return self._loglikelihood_tokens(new_reqs)
def loglikelihood_rolling(
self, requests: List[Instance], disable_tqdm: bool = False
) -> List[float]:
loglikelihoods = []
for (string,) in tqdm([req.args for req in requests], disable=disable_tqdm):
rolling_token_windows = list(
map(
make_disjoint_window,
get_rolling_token_windows(
token_list=self.tok_encode(string),
prefix_token=self.eot_token_id,
max_seq_len=self.max_length - 1,
context_len=1,
),
)
)
rolling_token_windows = [(None,) + x for x in rolling_token_windows]
string_nll = self._loglikelihood_tokens(
rolling_token_windows,
)
# discard is_greedy
string_nll = [x[0] for x in string_nll]
string_nll = sum(string_nll)
loglikelihoods.append(string_nll)
return loglikelihoods
def _loglikelihood_tokens(self, requests, disable_tqdm=False):
res = []
def _collate(x):
toks = x[1] + x[2]
return -len(toks), tuple(toks)
re_ord = Collator(requests, sort_fn=_collate)
chunks = re_ord.get_batched(n=self.batch_size, batch_fn=None)
pbar = tqdm(
total=len(requests),
disable=(disable_tqdm or (self.rank != 0)),
desc="Running loglikelihood requests",
)
for chunk in chunks:
inps = []
ctxlens = []
contlens = []
for _, context_enc, continuation_enc in chunk:
# Leave one token for generation. Tokens_to_generate = 0 breaks NeMo.
inp = (context_enc + continuation_enc)[-(self.max_length - 1) :]
ctxlen = len(context_enc) - max(
0, len(context_enc) + len(continuation_enc) - (self.max_length - 1)
)
ctxlens.append(ctxlen)
contlens.append(len(continuation_enc))
inps.append(self.tok_decode(inp))
output = self.generate(
self.model,
inputs=inps,
tokens_to_generate=1,
min_tokens_to_generate=1,
compute_logprob=True,
all_probs=True,
)
batch_token_ids = np.asarray(output["token_ids"])[:, :-1]
batch_logprobs = output["logprob"][:, :-1]
batch_full_logprob = output["full_logprob"][:, :-1, :]
# Compute greedy tokens for entire batch rather than calling it with proper ctxlen for each sample.
# Additional tokens for each sample will be trimmed later.
min_ctxlen = min(ctxlens)
# Use min_ctxlen-1 instead of min_ctxlen since full_logprobs are not returns for the first token.
batch_greedy_tokens = (
torch.argmax(batch_full_logprob[:, min_ctxlen - 1 :, :], -1)
.cpu()
.numpy()
)
for token_ids, greedy_tokens, logprobs, ctxlen, contlen, (
cache_key,
_,
_,
) in zip(
batch_token_ids,
batch_greedy_tokens,
batch_logprobs,
ctxlens,
contlens,
chunk,
):
# Trim at contlen since shorter contexts in a batch will have more than one token generated.
# Use ctxlen-1 instead of ctxlen same as for full_logprob in batch_greedy_tokens calculation
logprobs = (logprobs[ctxlen - 1 :])[:contlen]
logprob = sum(logprobs).tolist()
continuation_tokens = (token_ids[ctxlen:])[:contlen]
len_diff = ctxlen - min_ctxlen
is_greedy = continuation_tokens == (greedy_tokens[len_diff:])[:contlen]
if not isinstance(is_greedy, bool):
is_greedy = is_greedy.all()
answer = (logprob, is_greedy)
if cache_key is not None:
self.cache_hook.add_partial("loglikelihood", cache_key, answer)
res.append(answer)
pbar.update(1)
pbar.close()
return re_ord.get_original(res)
def generate_until(self, requests):
if not requests:
return []
res = []
def get_until(req_args):
until = req_args.get("until", [])
until = deepcopy(until) # prevent from modifying req_args for cache_key
if self.eot_token_id not in until:
until.append(self.eot_token_id)
return until
def _collate(x):
toks = self.tok_encode(x[0])
return len(toks), x[0]
re_ords = Collator(
[reg.args for reg in requests], sort_fn=_collate, group_by="gen_kwargs"
)
chunks = re_ords.get_batched(n=self.batch_size, batch_fn=None)
for chunk in chunks:
contexts, all_gen_kwargs = zip(*chunk)
# we assume all gen kwargs in the batch are the same
# this is safe to assume because the `grouper` object ensures it.
req_args = all_gen_kwargs[0]
# unpack our keyword arguments.
until = get_until(req_args)
max_gen_toks = req_args.get("max_gen_toks", self.max_gen_toks)
remaining_length = self.max_length - max_gen_toks
contexts = []
for context, _ in chunk:
encoded_context = self.tok_encode(context)
encoded_context = encoded_context[-remaining_length:]
contexts.append(self.tok_decode(encoded_context))
output = self.generate(
self.model,
inputs=contexts,
tokens_to_generate=max_gen_toks,
end_strings=until,
greedy=True,
)
answers = output["sentences"]
continuations = []
for context, answer in zip(contexts, answers):
continuations.append(answer[len(context) :])
for term in until:
continuations = [answer.split(term)[0] for answer in continuations]
for request, answer in zip(chunk, continuations):
self.cache_hook.add_partial("greedy_until", request, answer)
res.append(answer)
return re_ords.get_original(res)
......@@ -111,7 +111,7 @@ class OpenaiCompletionsLM(TemplateLM):
self.base_url = base_url
self.tokenizer_backend = tokenizer_backend
self.truncate = truncate
self._batch_size = batch_size
self._batch_size = int(batch_size)
self._max_gen_toks = max_gen_toks
self._max_length = max_length
......
# BasqueGLUE
### Paper
Title: `BasqueGLUE: A Natural Language Understanding Benchmark for Basque`
Abstract: `https://aclanthology.org/2022.lrec-1.172/`
Natural Language Understanding (NLU) technology has improved significantly over the last few years and multitask benchmarks such as GLUE are key to evaluate this improvement in a robust and general way. These benchmarks take into account a wide and diverse set of NLU tasks that require some form of language understanding, beyond the detection of superficial, textual clues. However, they are costly to develop and language-dependent, and therefore they are only available for a small number of languages. In this paper, we present BasqueGLUE, the first NLU benchmark for Basque, a less-resourced language, which has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. We also report the evaluation of two state-of-the-art language models for Basque on BasqueGLUE, thus providing a strong baseline to compare upon. BasqueGLUE is freely available under an open license.
Homepage: `https://github.com/orai-nlp/BasqueGLUE`
Title: `Latxa: An Open Language Model and Evaluation Suite for Basque`
Abstract: `https://arxiv.org/abs/2403.20266`
The use of BasqueGLUE for evaluating the performance of decoder models in Basque is presented in this paper.
Homepage: `https://github.com/hitz-zentroa/latxa`
### Citation
```
@InProceedings{urbizu2022basqueglue,
author = {Urbizu, Gorka and San Vicente, Iñaki and Saralegi, Xabier and Agerri, Rodrigo and Soroa, Aitor},
title = {BasqueGLUE: A Natural Language Understanding Benchmark for Basque},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
month = {June},
year = {2022},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {1603--1612},
url = {https://aclanthology.org/2022.lrec-1.172}
}
@misc{etxaniz2024latxa,
title={Latxa: An Open Language Model and Evaluation Suite for Basque},
author={Julen Etxaniz and Oscar Sainz and Naiara Perez and Itziar Aldabe and German Rigau and Eneko Agirre and Aitor Ormazabal and Mikel Artetxe and Aitor Soroa},
year={2024},
eprint={2403.20266},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Groups and Tasks
#### Groups
* `basque-glue`: First version of the implementation
#### Tasks
* `bhtc_v2`: Topic classification of news extracts with 12 categories.
* `bec`: Sentiment analysis on tweets about the campaign for the 2016 Basque elections.
* `vaxx_stance`: Stance detection on tweets around the anti-vaccine movement.
* `qnlieu`: Q&A NLI as in [glue/qnli](../glue/qnli).
* `wiceu`: Word-in-Context as in [super_glue/wic](../super_glue/wic).
* `epec_korref_bin`: Correference detection as in [super_glue/wsc](../super_glue/wsc).
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group: basque-glue
task: bec2016eu
dataset_path: orai-nlp/basqueGLUE
dataset_name: bec
output_type: multiple_choice
validation_split: validation
test_split: test
doc_to_text: "Testua: {{text}}\nGaldera: Nolako jarrera agertzen du aurreko testuak?\nErantzuna:"
doc_to_target: label
doc_to_choice: ['negatiboa', 'neutrala', 'positiboa']
metric_list:
- metric: f1
aggregation: !function utils.micro_f1_score
higher_is_better: true
metadata:
- version: 1.0
group: basque-glue
task: bhtc_v2
dataset_path: orai-nlp/basqueGLUE
dataset_name: bhtc
output_type: multiple_choice
validation_split: validation
test_split: test
doc_to_text: "Testua: {{text}}\nGaldera: Zein da aurreko testuaren gaia?\nErantzuna:"
doc_to_target: label
doc_to_choice: ['Ekonomia', 'Euskal Herria', 'Euskara', 'Gizartea', 'Historia', 'Ingurumena', 'Iritzia', 'Komunikazioa', 'Kultura', 'Nazioartea', 'Politika', 'Zientzia']
metric_list:
- metric: f1
aggregation: !function utils.micro_f1_score
higher_is_better: true
metadata:
- version: 1.0
group: basque-glue
task: epec_koref_bin
dataset_path: orai-nlp/basqueGLUE
dataset_name: coref
output_type: multiple_choice
validation_split: validation
test_split: test
doc_to_text: !function utils.coref_doc_to_text
doc_to_target: label
doc_to_choice: ['ez', 'bai']
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
group: basque-glue
task: qnlieu
dataset_path: orai-nlp/basqueGLUE
dataset_name: qnli
output_type: multiple_choice
validation_split: validation
test_split: test
doc_to_text: "{{question}}\n{{sentence}}\nGaldera: aurreko galderari erantzuten al dio emandako testuak?\nErantzuna:"
doc_to_target: label
doc_to_choice: ['bai', 'ez']
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
import html
import re
from datasets import load_metric
def general_detokenize(string):
string = re.sub(r"\s+([.,;:!?)])", r"\1", string)
string = re.sub(r"(\s+|^)\(\s+([^)]+)\s+\)", r"\1(\2)", string)
string = re.sub(r"(\s+|^)\[\s+([^)]+)\s+\]", r"\1[\2]", string)
string = re.sub(r'(\s+|^)"\s+([^"]+)\s+"', r'\1"\2"', string)
string = re.sub(r"(\s+|^)'\s+([^']+)\s+'", r"\1'\2'", string)
return string
def process_doc(string):
string = html.unescape(string)
string = general_detokenize(string)
return string
def process_wic_docs(dataset):
def _helper(doc):
# there's some issues with the encoding on this one
doc["sentence1"] = (
process_doc(doc["sentence1"]).encode("latin-1").decode("utf-8")
)
doc["sentence2"] = (
process_doc(doc["sentence2"]).encode("latin-1").decode("utf-8")
)
return doc
return dataset.map(_helper)
def coref_doc_to_text(x):
def _span_in_context(span_index, span_text):
span_start = span_index
span_end = span_start + len(span_text.split(" ")) - 1
tokens[span_start] = f"*{tokens[span_start]}"
tokens[span_end] = f"{tokens[span_end]}*"
tokens = x["text"].split(" ")
_span_in_context(x["span1_index"], x["span1_text"])
_span_in_context(
x["span2_index"] - 1, x["span2_text"]
) # span1_index is 0-based but span2_index is 1-based ??
context = process_doc(" ".join(tokens))
span_1 = process_doc(x["span1_text"])
span_2 = process_doc(x["span2_text"])
text = (
f"Testua: {context}\n"
+ f'Galdera: Aurreko testuan, "*{span_1}*" eta "*{span_2}*" gauza bera dira?\n'
+ "Erantzuna:"
)
return text
# Measure F1 as in the benchmark repo: https://github.com/orai-nlp/BasqueGLUE/blob/main/eval_basqueglue.py
def micro_f1_score(items):
f1_metric = load_metric("f1")
golds, preds = list(zip(*items))
f1_score = f1_metric.compute(references=golds, predictions=preds, average="micro")[
"f1"
]
return f1_score
def vaxx_f1_score(items):
f1_metric = load_metric("f1")
golds, preds = list(zip(*items))
f1_class = f1_metric.compute(
references=golds, predictions=preds, labels=[0, 2], average=None
)["f1"]
f1_score = sum(f1_class) / len(f1_class)
return f1_score
group: basque-glue
task: vaxx_stance
dataset_path: orai-nlp/basqueGLUE
dataset_name: vaxx
output_type: multiple_choice
validation_split: validation
test_split: test
doc_to_text: "Testua: {{text}}\nGaldera: Nolako jarrera agertzen du aurreko testuak txertoei buruz?\nErantzuna:"
doc_to_target: label
doc_to_choice: ['aurka', 'neutrala', 'alde']
metric_list:
- metric: f1
aggregation: !function utils.vaxx_f1_score
higher_is_better: true
metadata:
- version: 1.0
group: basque-glue
task: wiceu
dataset_path: orai-nlp/basqueGLUE
dataset_name: wic
output_type: multiple_choice
validation_split: validation
test_split: test
process_docs: !function utils.process_wic_docs
doc_to_text: "1. esaldia: {{sentence1}}\n2. esaldia: {{sentence2}}\nGaldera: Aurreko bi esaldietan, \"{{word}}\" hitzak esanahi berdina du?\nErantzuna:"
doc_to_target: label
doc_to_choice: ['ez', 'bai']
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
# EusExams
### Paper
Title: Latxa: An Open Language Model and Evaluation Suite for Basque
Abstract: https://arxiv.org/abs/2403.20266
EusExams is a collection of tests designed to prepare individuals for Public Service examinations conducted by several Basque institutions, including the public health system Osakidetza, the Basque Government, the City Councils of Bilbao and Gasteiz, and the University of the Basque Country (UPV/EHU). Within each of these groups, there are different exams for public positions, such as administrative and assistant roles. Each multiple-choice question contains 2 to 4 choices (3.90 on average) and one correct answer. The dataset is mostly parallel with 16k questions in Basque and 18k in Spanish.
Homepage: https://github.com/hitz-zentroa/latxa
### Citation
```
@misc{etxaniz2024latxa,
title={Latxa: An Open Language Model and Evaluation Suite for Basque},
author={Julen Etxaniz and Oscar Sainz and Naiara Perez and Itziar Aldabe and German Rigau and Eneko Agirre and Aitor Ormazabal and Mikel Artetxe and Aitor Soroa},
year={2024},
eprint={2403.20266},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Groups and Tasks
#### Groups
* `eus_exams_eu`: The Basque version of the exams.
* `eus_exams_es`: The Spanish version of the exams.
#### Tasks
Basque and Spanish versions of the exams are available as separate tasks starting with `eus_exams_eu` and `eus_exams_es` respectively.
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
import argparse
import json
import requests
import yaml
# get configs from huggingface datasets server by doing a request
response = requests.get(
"https://datasets-server.huggingface.co/splits?dataset=HiTZ%2FEusExams", timeout=5
)
response_json = json.loads(response.text)
CONFIGS = [split["config"] for split in response_json["splits"]]
def gen_config_yamls(output_dir: str, overwrite: bool) -> None:
"""
Generate a yaml file for each configuage.
:param output_dir: The directory to output the files to.
:param overwrite: Whether to overwrite files if they already exist.
"""
err = []
for config in CONFIGS:
file_name = f"eus_exams_{config}.yaml"
try:
with open(f"{output_dir}/{file_name}", "w" if overwrite else "x") as f:
f.write("# Generated by utils.py\n")
yaml.dump(
{
"include": "eus_exams_es"
if "eus_exams_es" in config
else "eus_exams_eu",
"dataset_name": config,
"task": f"eus_exams_{config}",
},
f,
)
except FileExistsError:
err.append(file_name)
if len(err) > 0:
raise FileExistsError(
"Files were not created because they already exist (use --overwrite flag):"
f" {', '.join(err)}"
)
def main() -> None:
"""Parse CLI args and generate configuage-specific yaml files."""
parser = argparse.ArgumentParser()
parser.add_argument(
"--overwrite",
default=False,
action="store_true",
help="Overwrite files if they already exist",
)
parser.add_argument(
"--output-dir", default=".", help="Directory to write yaml files to"
)
args = parser.parse_args()
gen_config_yamls(output_dir=args.output_dir, overwrite=args.overwrite)
if __name__ == "__main__":
main()
dataset_path: HiTZ/EusExams
dataset_name: null
validation_split: null
test_split: test
fewshot_split: test
process_docs: !function utils.process_docs
output_type: multiple_choice
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 0.0
include: eus_exams
group:
- eus_exams_es
doc_to_text: "Pregunta: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nD: {{candidates[3]}}\nRespuesta:"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment