Commit 8da401e0 authored by lintangsutawika's avatar lintangsutawika
Browse files

Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into alt_worlds

parents c21240c0 30936bc7
...@@ -146,7 +146,7 @@ A full accounting of the supported and planned libraries + APIs can be seen belo ...@@ -146,7 +146,7 @@ A full accounting of the supported and planned libraries + APIs can be seen belo
| GooseAI | :heavy_check_mark: (not separately maintained) | `openai`, `openai-completions`, `gooseai` (same interface as OpenAI Completions) | | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | | GooseAI | :heavy_check_mark: (not separately maintained) | `openai`, `openai-completions`, `gooseai` (same interface as OpenAI Completions) | | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Textsynth | Needs testing | `textsynth` | ??? | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | | Textsynth | Needs testing | `textsynth` | ??? | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Cohere | :hourglass: - blocked on Cohere API bug | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | | Cohere | :hourglass: - blocked on Cohere API bug | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| GGML | :hourglass: [PR](https://github.com/EleutherAI/lm-evaluation-harness/pull/617) | N/A | ??? | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | | GGML/[Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark: | `gguf`, `ggml` | Llama-architecture models (Llama, Llama 2, Llemma, Mistral(?), Llama finetunes) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| vLLM | :x: Not yet - needs help! | N/A | All HF models | `generate_until` (no logprobs) | | vLLM | :x: Not yet - needs help! | N/A | All HF models | `generate_until` (no logprobs) |
| Your inference server here! | ... | ... | ... | ... | | ... | | Your inference server here! | ... | ... | ... | ... | | ... |
...@@ -194,7 +194,7 @@ python -m lm_eval \ ...@@ -194,7 +194,7 @@ python -m lm_eval \
--check_integrity --check_integrity
``` ```
## Advanced Usage ## Advanced Usage Tips
For models loaded with the HuggingFace `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument: For models loaded with the HuggingFace `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:
```bash ```bash
...@@ -216,21 +216,31 @@ python -m lm_eval \ ...@@ -216,21 +216,31 @@ python -m lm_eval \
We support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`. We support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`.
To save evaluation results provide an `--output_path`. We also support logging model responses with the `--log_samples` flag for post-hoc analysis.
Additionally, one can provide a directory with `--use_cache` to cache the results of prior runs. This allows you to avoid repeated execution of the same (model, task) pairs for re-scoring.
For a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/interface.md) guide in our documentation!
## How to Contribute or Learn More? ## How to Contribute or Learn More?
For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help. For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
You can also ask for help, or discuss new features with the maintainers in the #lm-thunderdome channel of the EleutherAI discord! If you've used the library and have had a positive (or negative) experience, we'd love to hear from you! You can also ask for help, or discuss new features with the maintainers in the #lm-thunderdome channel of the EleutherAI discord! If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
### Implementing new tasks ### Implementing new tasks
To implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md). To implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md).
In general, we following the following priority list for addressing concerns about prompting and other eval details:
1. If there is widespread agreement among people who train LLMs, use the agreed upon procedure.
2. If there is a clear and unambiguous official implementation, use that procedure.
3. If there is widespread agreement among people who evaluate LLMs, use the agreed upon procedure.
4. If there are multiple common implementations but not universal or widespread agreement, use our preferred option among the common implementations. As before, prioritize choosing from among the implementations found in LLM training papers.
As a start, we currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, we support prompts authored in the [Promptsource Library](https://github.com/bigscience-workshop/promptsource/tree/main) as described further in [the task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md) and [the advanced task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/advanced_task_guide.md) and welcome contributions of novel task templates and task variants. These are guidelines and not rules, and can be overruled in special circumstances.
We try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different papers despite our discouragement of the practice. Historically, we also prioritized the implementation from "Language Models are Few Shot Learners" as our original goal was specifically to compare results with that paper.
## Cite as ## Cite as
......
# New Model Guide # New Model Guide
The `lm-evaluation-harness` is intended to be a model-agnostic framework for evaluating . We provide first-class support for HuggingFace `AutoModelForCausalLM` and `AutoModelForSeq2SeqLM` type models, but
This guide may be of special interest to users who are using the library outside of the repository, via installing the library via pypi and calling `lm_eval.evaluator.evaluate()` to evaluate an existing model. This guide may be of special interest to users who are using the library outside of the repository, via installing the library via pypi and calling `lm_eval.evaluator.evaluate()` to evaluate an existing model.
In order to properly evaluate a given LM, we require implementation of a wrapper class subclassing the `lm_eval.api.model.LM` class, that defines how the Evaluation Harness should interface with your model. This guide walks through how to write this `LM` subclass via adding it to the library! In order to properly evaluate a given LM, we require implementation of a wrapper class subclassing the `lm_eval.api.model.LM` class, that defines how the Evaluation Harness should interface with your model. This guide walks through how to write this `LM` subclass via adding it to the library!
......
...@@ -273,6 +273,24 @@ to the top of any Python file that is run or imported when performing evaluation ...@@ -273,6 +273,24 @@ to the top of any Python file that is run or imported when performing evaluation
Passing `--tasks /path/to/yaml/file` is also accepted. Passing `--tasks /path/to/yaml/file` is also accepted.
## Beautifying Table Display
To avoid conflict, each task needs to be registered with a unique name. Because of this, slight variations of task are still counted as unique tasks and need to be named uniquely. This could be done by appending an additional naming that may refer to the variation such as in MMLU where the template used to evaluated for flan are differentiated from the default by the prefix `mmlu_flan_*`. Printing the full task names can easily clutter the results table at the end of the evaluation especially when you have a long list of tasks or are using a benchmark that comprises of many tasks. To make it more legible, you can use `task_alias` and `group_alias` to provide an alternative task name and group name that will be printed.
``
for example in `mmlu_abstract_algebra.yaml` we set `group_alias` to `stem` and `task_alias` to `abstract_algebra`.
```
"dataset_name": "abstract_algebra"
"description": "The following are multiple choice questions (with answers) about abstract\
\ algebra.\n\n"
"group": "mmlu_stem"
"group_alias": "stem"
"include": "_default_template_yaml"
"task": "mmlu_abstract_algebra"
"task_alias": "abstract_algebra"
```
Note: Even though `group` can be a list, for now, `group_alias` can only be a single string.
## Checking validity ## Checking validity
After registering your task, you can now check on your data downloading and verify that the few-shot samples look as intended. Run the following command with your desired args: After registering your task, you can now check on your data downloading and verify that the few-shot samples look as intended. Run the following command with your desired args:
......
This folder is meant to contain instructions and task setups required to evaluate certain papers which may perform non-standard evaluation setups.
Tasks can be supported already in the library under `lm_eval/tasks`, or if highly paper-specific, may remain as YAMLs in the respective `examples/paper-title` folder.
## Verified Papers:
* [WIP] [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903)
* Further details can be found in the `chain_of_thought` subfolder.
## Candidates to Support:
* Least-to-Most Prompting
* Algorithmic Prompting
* Other in-scope prompting techniques
* Multi-turn prompting strategies are likely out of scope for the repository.
* Pythia Suite: Term Frequencies over training
* All setups from GPT-3 Paper
* Varying few-shot orderings + selection ; Varying the label choices for multiple-choice tasks
* Your Paper Here!
# Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
https://arxiv.org/abs/2201.11903
## All Tasks in Paper
* ...
* ...
* ...
## Reproduction Scripts
* ...
...@@ -3,6 +3,7 @@ from . import openai_completions ...@@ -3,6 +3,7 @@ from . import openai_completions
from . import textsynth from . import textsynth
from . import dummy from . import dummy
from . import anthropic_llms from . import anthropic_llms
from . import gguf
# TODO: implement __all__ # TODO: implement __all__
import requests
import logging
import time
from tqdm import tqdm
from requests.exceptions import RequestException
from lm_eval.api.model import LM
from lm_eval.api.registry import register_model
logger = logging.getLogger(__name__)
def get_result(logprobs, context_length):
is_greedy = True
offsets = logprobs["text_offset"]
tokens = logprobs["tokens"]
tokens_logprobs = logprobs["token_logprobs"]
idx = 0
while offsets[idx] < context_length:
idx += 1
continuation_logprobs = sum(tokens_logprobs[idx:-1])
for i in range(idx, len(tokens)):
token = tokens[i]
top_tokens = logprobs["top_logprobs"][i]
top_token = max(top_tokens.keys(), key=lambda x: top_tokens[x])
if top_token != token:
is_greedy = False
break
return continuation_logprobs, is_greedy
@register_model("gguf", "ggml")
class GGUFLM(LM):
def __init__(self, base_url=None, max_length=2048, **kwargs):
super().__init__()
self.base_url = base_url
assert self.base_url, "must pass `base_url` to use GGUF LM!"
self.logprobs = 10
self.temperature = 0.0
self.max_length = max_length
def gguf_completion(
self, context, continuation=None, stop=None, retries=3, delay=5, **kwargs
):
for _ in range(retries):
try:
prompt = context
request = {
"prompt": prompt,
"logprobs": self.logprobs,
"temperature": self.temperature,
}
if continuation:
prompt += continuation
request.update({"prompt": prompt, "max_tokens": 1, "echo": True})
if stop is not None:
request["stop"] = stop
response = requests.post(
f"{self.base_url}/v1/completions", json=request
)
response.raise_for_status()
return response.json()
except RequestException as e:
logger.error(f"RequestException: {e}")
time.sleep(delay) # wait before retrying
else:
raise Exception(f"Failed to get a valid response after {retries} retries.")
def loglikelihood(self, requests):
if not requests:
return []
res = []
for context, continuation in tqdm([req.args for req in requests]):
response = self.gguf_completion(context=context, continuation=continuation)
if response and "choices" in response and response["choices"]:
choice = response["choices"][0]
logprobs = choice.get("logprobs")
if (
logprobs
and "token_logprobs" in logprobs
and logprobs["token_logprobs"]
):
logprob, is_greedy = get_result(logprobs, len(context))
res.append((logprob, is_greedy))
else:
logger.warning(
"Invalid logprobs data. Expected 'logprobs' to contain 'token_logprobs' list."
)
else:
logger.error(
f"Invalid response for loglikelihood. Response: {response}"
)
assert False
return res
def generate_until(self, requests):
if not requests:
return []
res = []
for request in tqdm([req.args for req in requests]):
inp = request[0]
request_args = request[1]
until = request_args.get("until", ["</s>"])
response = self.gguf_completion(context=inp, stop=until)
if response and "choices" in response and response["choices"]:
choice = response["choices"][0]
if "text" in choice:
generated_text = choice["text"].strip()
res.append(generated_text)
else:
logger.error(
f"Invalid response for greedy_until. Response: {response}"
)
res.append(None) # Add default value in case of error
else:
logger.error(f"Invalid response for greedy_until. Response: {response}")
res.append(None) # Add default value in case of error
return res
def loglikelihood_rolling(self, requests):
raise NotImplementedError(
"loglikelihood_rolling not yet supported for GGUF models"
)
...@@ -889,8 +889,6 @@ class HFLM(LM): ...@@ -889,8 +889,6 @@ class HFLM(LM):
max_gen_toks = kwargs.pop("max_gen_toks") max_gen_toks = kwargs.pop("max_gen_toks")
else: else:
max_gen_toks = self.max_gen_toks max_gen_toks = self.max_gen_toks
# first stop sequence is used to halt generation upon encountering
primary_until = [until[0]]
# set the max length in tokens of inputs ("context_enc") # set the max length in tokens of inputs ("context_enc")
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM: if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
...@@ -916,7 +914,7 @@ class HFLM(LM): ...@@ -916,7 +914,7 @@ class HFLM(LM):
cont = self._model_generate( cont = self._model_generate(
context=context_enc, context=context_enc,
attention_mask=attn_masks, attention_mask=attn_masks,
stop=primary_until, stop=until,
**kwargs, **kwargs,
) )
......
...@@ -586,7 +586,14 @@ class MultiTokenEOSCriteria(transformers.StoppingCriteria): ...@@ -586,7 +586,14 @@ class MultiTokenEOSCriteria(transformers.StoppingCriteria):
self.done_tracker = [False] * batch_size self.done_tracker = [False] * batch_size
self.sequence = sequence self.sequence = sequence
self.sequence_ids = tokenizer.encode(sequence, add_special_tokens=False) self.sequence_ids = tokenizer.encode(sequence, add_special_tokens=False)
self.sequence_id_len = len(self.sequence_ids) # we look back for 2 more tokens than it takes to encode our stop sequence
# because tokenizers suck, and a model might generate `['\n', '\n']` but our `sequence` is `['\n\n']`
# and we don't want to mistakenly not stop a generation because our
# (string) stop sequence was output in a different tokenization
# NOTE: there is a minor danger that this will end up looking back 2 tokens into the past, into the inputs to the model,
# and stopping generation immediately as a result. With only 2 extra tokens of lookback, this risk is minimized
self.sequence_id_len = len(self.sequence_ids) + 2
self.tokenizer = tokenizer self.tokenizer = tokenizer
def __call__(self, input_ids, scores, **kwargs) -> bool: def __call__(self, input_ids, scores, **kwargs) -> bool:
...@@ -596,7 +603,6 @@ class MultiTokenEOSCriteria(transformers.StoppingCriteria): ...@@ -596,7 +603,6 @@ class MultiTokenEOSCriteria(transformers.StoppingCriteria):
] ]
lookback_tokens_batch = self.tokenizer.batch_decode(lookback_ids_batch) lookback_tokens_batch = self.tokenizer.batch_decode(lookback_ids_batch)
for i, done in enumerate(self.done_tracker): for i, done in enumerate(self.done_tracker):
if not done: if not done:
self.done_tracker[i] = self.sequence in lookback_tokens_batch[i] self.done_tracker[i] = self.sequence in lookback_tokens_batch[i]
......
import unittest
from unittest.mock import patch
import hashlib
import json
import os
import pickle
from lm_eval.models.gguf import GGUFLM
from lm_eval.api.instance import Instance
base_url = "https://matthoffner-ggml-llm-api.hf.space"
def gguf_completion_mock(base_url=None, **kwargs):
# Generate a hash from the parameters
hash_kwargs = {"base_url": base_url, **kwargs}
hash = hashlib.sha256(
json.dumps(hash_kwargs, sort_keys=True).encode("utf-8")
).hexdigest()
fname = f"./tests/testdata/gguf_test_{hash}.pkl"
if os.path.exists(fname):
with open(fname, "rb") as fh:
return pickle.load(fh)
else:
print("The file does not exist, attempting to write...")
if "stop" in kwargs:
result = {
"choices": [
{
"text": f"generated text until {kwargs['stop']}",
"logprobs": {"token_logprobs": [-1.2345], "text_offset": 0},
"finish_reason": "length",
}
]
}
else:
# generated with # curl -X 'POST' 'http://localhost:8000/v1/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"prompt": "string", "logprobs": 10, "temperature": 0.0, "max_tokens": 1, "echo": true}'
result = {
"id": "cmpl-4023976b-bc6a-43b0-a5a9-629f4216c7f3",
"object": "text_completion",
"created": 1700511361,
"model": "../llama-2-7b.Q8_0.gguf",
"choices": [
{
"text": "string(",
"index": 0,
"logprobs": {
"text_offset": [0, 7],
"token_logprobs": [None, -1.033263319857306],
"tokens": [" string", "("],
"top_logprobs": [
None,
{
"(": -1.033263319857306,
"[]": -2.6530743779017394,
".": -3.0377145947291324,
"\n": -3.0399156750513976,
"_": -3.510376089937872,
" =": -3.6957918347193663,
",": -3.9309459866358702,
" of": -4.2834550083949035,
'("': -4.322762841112799,
"()": -4.426229113466925,
},
],
},
"finish_reason": "length",
}
],
"usage": {
"prompt_tokens": 2,
"completion_tokens": 1,
"total_tokens": 3,
},
}
try:
os.makedirs(os.path.dirname(fname), exist_ok=True)
print("Writing file at", fname)
with open(fname, "wb") as fh:
pickle.dump(result, fh)
print("File written successfully")
except Exception as e:
print("File writing failed:", e)
return result
class GGUFLMTest(unittest.TestCase):
@patch(
"lm_eval.models.gguf.GGUFLM.gguf_completion", side_effect=gguf_completion_mock
)
def test_loglikelihood(self, gguf_completion_mock):
lm = GGUFLM(base_url)
# Test loglikelihood
requests = [
Instance(
request_type="loglikelihood",
doc=args,
arguments=args,
idx=i,
)
for i, args in enumerate([("str", "ing"), ("str", "ing")])
]
res = lm.loglikelihood(requests)
# Assert the loglikelihood response is correct
expected_res = [(logprob, True) for logprob in [0, 0]]
self.assertEqual(res, expected_res)
@patch(
"lm_eval.models.gguf.GGUFLM.gguf_completion", side_effect=gguf_completion_mock
)
def test_generate_until(self, gguf_completion_mock):
lm = GGUFLM(base_url)
# Test generate_until
requests = [
Instance(
request_type="generate_until",
doc={"input": doc},
arguments=(doc, {"until": stop}),
idx=i,
)
for i, (doc, stop) in enumerate([("input1", "stop1"), ("input2", "stop2")])
]
res = lm.generate_until(requests)
# Assert the generate_until response is correct
expected_res = ["generated text until stop1", "generated text until stop2"]
self.assertEqual(res, expected_res)
# @patch('lm_eval.models.gguf.GGUFLM.gguf_completion', side_effect=gguf_completion_mock)
# def test_loglikelihood_rolling(self, gguf_completion_mock):
# lm = GGUFLM(base_url)
# # Test loglikelihood_rolling
# requests = ["input1", "input2"]
# res = lm.loglikelihood_rolling(requests)
# # Assert the loglikelihood_rolling response is correct
# expected_res = [(-1.2345, True), (-1.2345, True)]
# self.assertEqual(res, expected_res)
if __name__ == "__main__":
unittest.main()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment