"src/vscode:/vscode.git/clone" did not exist on "5b686addc0081b90653b3ff61b68d373c3d82f80"
Unverified Commit 070d31df authored by KonradSzafer's avatar KonradSzafer Committed by GitHub
Browse files

Add chat template (#1873)



* initial chat template

* tokenizer attribute check

* variable rename

* interface update

* system instruction

* system inst default update

* fewshot as multiturn

* typing update

* indent update

* added comments

* Adding a fewshot in a more readable way

* linting

* Moved apply chat template to LM

* multiturn alternation fix

* cache key update

* apply chat template method fix

* add system prompt hash to cache_key

* tokenizer name property for cache_key

* property name fix

* linting backward compatibility fix

* docs and errors update

* add documentation on adding chat template compatibility to model_guide

* fewshot as multiturn check fix

* saving system inst and chat template in results

* eval tracker update

* docs update

* Apply suggestions from code review
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

---------
Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: default avatarClémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
parent 3e500e9d
...@@ -44,6 +44,12 @@ This mode supports a number of command-line arguments, the details of which can ...@@ -44,6 +44,12 @@ This mode supports a number of command-line arguments, the details of which can
- `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`. - `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`.
- `--system_instruction`: Specifies a system instruction string to prepend to the prompt.
- `--apply_chat_template` : If this flag is on, a chat template will be applied to the prompt. For Hugging Face models, the chat template is taken from the tokenizer, if the tokenizer does not have a chat template, a default one will be applied. For other models, chat templating is not currently implemented.
- `--fewshot_as_multiturn` : If this flag is on, the Fewshot examples are treated as a multi-turn conversation. Questions are provided as user content and answers are provided as assistant responses. Requires `--num_fewshot` to be set to be greater than 0, and `--apply_chat_template` to be on.
- `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results. - `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.
* `--seed`: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility). E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`. E.g, `--seed 42` sets all three seeds to 42. * `--seed`: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility). E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`. E.g, `--seed 42` sets all three seeds to 42.
......
...@@ -107,6 +107,53 @@ Using this decorator results in the class being added to an accounting of the us ...@@ -107,6 +107,53 @@ Using this decorator results in the class being added to an accounting of the us
We also recommend that new model contributions be accompanied by short tests of their 3 core functionalities, at minimum. To see an example of such tests, look at https://github.com/EleutherAI/lm-evaluation-harness/blob/35bdecd379c0cefad6897e67db892f4a6026a128/tests/test_ggml.py . We also recommend that new model contributions be accompanied by short tests of their 3 core functionalities, at minimum. To see an example of such tests, look at https://github.com/EleutherAI/lm-evaluation-harness/blob/35bdecd379c0cefad6897e67db892f4a6026a128/tests/test_ggml.py .
## Chat Templating
Many models are fine-tuned with a [Chat Template](https://huggingface.co/docs/transformers/main/en/chat_templating) in order to enable back-and-forth interaction between a "User"'s queries and the model (often called "Assistant")'s responses. It can be desirable to evaluate fine-tuned models on evaluation tasks while wrapped in the conversational format they expect.
In order to make your model optionally compatible with a chat format, three additional methods must be implemented:
```python
class MyCustomLM(LM):
#...
@property
def tokenizer_name(self) -> str:
# should return a string denoting the name of the model's tokenizer and/or the accompanying chat template.
@property
def chat_template(self) -> str:
# should return a chat template formatting string that is used to build prompt from a user/assistant chat history.
# this will be saved in the evaluation results for reproducibility.
def apply_chat_template(self, chat_history: List[Dict[str, str]]) -> str:
# responsible for taking as input a chat history that would be fed into the model, and
# rendering it as a string that can be then tokenized and input into the model.
#...
```
- `apply_chat_template`
- This method performs the bulk of the work required for chat-formatting.
- As input, a `chat_history: List[Dict[str, str]]` is passed in. This is a transcript of a conversation of a form similar to
```
[
{"system": <user-provided system message such as "You are a helpful math-focused chatbot">},
{"user": <task example - a few-shot example 'input'>}
{"assistant": <correct response to the above example>},
# ... more few-shot examples, potentially
{"user": <test set query--response on which we will evaluate>},
]
```
which can then be converted into a string input.
- The output is a string representing this conversation that can be fed into the model.
- For example, this consists of simply calling `tokenizer.apply_chat_template` for HFLM--see the implementation there for reference.
- `tokenizer_name`
- LM Eval Harness supports [caching requests](https://github.com/EleutherAI/lm-evaluation-harness/blob/4902aaaf1f374682f95ac25fe2e13b23faddc91a/lm_eval/__main__.py#L140) that are sent to a model, for faster setup when repeating an already-performed evaluation.
- However, we don't want to use the cache of chat transcripts rendered using one chat template or system prompt to send to a model with a different template! So, we use this `lm.tokenizer_name` string to distinguish caches for a given model (and chat template) from one another.
- `chat_template`
- Chat templates are typically provided as a Jinja template string or a string formatted with str.format to include user and assistant messages in a single prompt. This template string is saved in the evaluation results to ensure reproducibility.
If not implemented for a given model type, the flags `--apply_chat_template` , `--fewshot_as_multiturn`, and `--system_instruction` cannot be used.
## Other ## Other
**Pro tip**: In order to make the Evaluation Harness overestimate total runtimes rather than underestimate it, HuggingFace models come in-built with the ability to provide responses on data points in *descending order by total input length* via `lm_eval.utils.Reorderer`. Take a look at `lm_eval.models.hf_causal.HFLM` to see how this is done, and see if you can implement it in your own model! **Pro tip**: In order to make the Evaluation Harness overestimate total runtimes rather than underestimate it, HuggingFace models come in-built with the ability to provide responses on data points in *descending order by total input length* via `lm_eval.utils.Reorderer`. Take a look at `lm_eval.models.hf_causal.HFLM` to see how this is done, and see if you can implement it in your own model!
......
...@@ -162,6 +162,24 @@ def setup_parser() -> argparse.ArgumentParser: ...@@ -162,6 +162,24 @@ def setup_parser() -> argparse.ArgumentParser:
default=False, default=False,
help="If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis. Use with --output_path.", help="If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis. Use with --output_path.",
) )
parser.add_argument(
"--system_instruction",
type=str,
default=None,
help="System instruction to be used in the prompt",
)
parser.add_argument(
"--apply_chat_template",
action="store_true",
default=False,
help="If True, applies the chat template to the prompt",
)
parser.add_argument(
"--fewshot_as_multiturn",
action="store_true",
default=False,
help="If True, uses the fewshot as a multi-turn conversation",
)
parser.add_argument( parser.add_argument(
"--show_config", "--show_config",
action="store_true", action="store_true",
...@@ -261,10 +279,6 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -261,10 +279,6 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
args.hf_hub_log_args += f",token={os.environ.get('HF_TOKEN')}" args.hf_hub_log_args += f",token={os.environ.get('HF_TOKEN')}"
evaluation_tracker_args = simple_parse_args_string(args.hf_hub_log_args) evaluation_tracker_args = simple_parse_args_string(args.hf_hub_log_args)
evaluation_tracker = EvaluationTracker(**evaluation_tracker_args) evaluation_tracker = EvaluationTracker(**evaluation_tracker_args)
evaluation_tracker.general_config_tracker.log_experiment_args(
model_source=args.model,
model_args=args.model_args,
)
if args.predict_only: if args.predict_only:
args.log_samples = True args.log_samples = True
...@@ -273,6 +287,18 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -273,6 +287,18 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
"Specify --output_path if providing --log_samples or --predict_only" "Specify --output_path if providing --log_samples or --predict_only"
) )
if args.fewshot_as_multiturn and args.apply_chat_template is False:
raise ValueError(
"If fewshot_as_multiturn is set, apply_chat_template must be set to True."
)
if (
args.num_fewshot is None or args.num_fewshot == 0
) and args.fewshot_as_multiturn:
raise ValueError(
"If fewshot_as_multiturn is set, num_fewshot must be greater than 0."
)
if args.include_path is not None: if args.include_path is not None:
eval_logger.info(f"Including path: {args.include_path}") eval_logger.info(f"Including path: {args.include_path}")
task_manager = TaskManager(args.verbosity, include_path=args.include_path) task_manager = TaskManager(args.verbosity, include_path=args.include_path)
...@@ -353,6 +379,10 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -353,6 +379,10 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
check_integrity=args.check_integrity, check_integrity=args.check_integrity,
write_out=args.write_out, write_out=args.write_out,
log_samples=args.log_samples, log_samples=args.log_samples,
evaluation_tracker=evaluation_tracker,
system_instruction=args.system_instruction,
apply_chat_template=args.apply_chat_template,
fewshot_as_multiturn=args.fewshot_as_multiturn,
gen_kwargs=args.gen_kwargs, gen_kwargs=args.gen_kwargs,
task_manager=task_manager, task_manager=task_manager,
verbosity=args.verbosity, verbosity=args.verbosity,
......
...@@ -3,7 +3,7 @@ import hashlib ...@@ -3,7 +3,7 @@ import hashlib
import json import json
import logging import logging
import os import os
from typing import List, Optional, Tuple, Type, TypeVar from typing import Dict, List, Optional, Tuple, Type, TypeVar
import transformers import transformers
from sqlitedict import SqliteDict from sqlitedict import SqliteDict
...@@ -114,6 +114,20 @@ class LM(abc.ABC): ...@@ -114,6 +114,20 @@ class LM(abc.ABC):
""" """
pass pass
def apply_chat_template(self, chat_history: List[Dict[str, str]]) -> str:
"""
Defines how to transform few-shot examples provided as chat history into a format that can be used as input to the LM.
:param chat_history: list[dict[str, str]]
A list of dictionaries with keys 'role' and 'content'.
Values are strings representing the role name and the content of the message, respectively.
:return: str
A string representing the chat history in a format that can be used as input to the LM.
"""
raise NotImplementedError(
"To use this model with chat templates, please implement the 'apply_chat_template' method for your model type."
)
@classmethod @classmethod
def create_from_arg_string( def create_from_arg_string(
cls: Type[T], arg_string: str, additional_config: Optional[dict] = None cls: Type[T], arg_string: str, additional_config: Optional[dict] = None
...@@ -169,6 +183,26 @@ class LM(abc.ABC): ...@@ -169,6 +183,26 @@ class LM(abc.ABC):
# not support multi-device parallelism nor expect it. # not support multi-device parallelism nor expect it.
return self._world_size return self._world_size
@property
def tokenizer_name(self) -> str:
"""Must be defined for LM subclasses which implement Chat Templating.
Should return the name of the tokenizer or chat template used.
Used only to properly fingerprint caches when requests are being cached with `--cache_requests`, otherwise not used.
"""
raise NotImplementedError(
"To use this model with chat templates, please implement the 'tokenizer_name' property."
)
@property
def chat_template(self) -> str:
"""Must be defined for LM subclasses that implement Chat Templating.
Should return the structure of the chat template applied to user/assistant messages.
This is used only to save in the experiment results for reproducibility.
"""
raise NotImplementedError(
"To use this model with chat templates, please implement the 'chat_template' property."
)
def set_cache_hook(self, cache_hook) -> None: def set_cache_hook(self, cache_hook) -> None:
self.cache_hook = cache_hook self.cache_hook = cache_hook
......
...@@ -42,37 +42,79 @@ class ContextSampler: ...@@ -42,37 +42,79 @@ class ContextSampler:
# TODO: should we just stop people from using fewshot from same split as evaluating? # TODO: should we just stop people from using fewshot from same split as evaluating?
selected_docs = [x for x in fewshotex if x != doc][:num_fewshot] selected_docs = [x for x in fewshotex if x != doc][:num_fewshot]
labeled_examples = ( labeled_examples = ""
self.fewshot_delimiter.join( for doc in selected_docs:
[ doc_content = self.doc_to_text(doc)
# TODO: is separating doc_to_text and doc_to_target by one space always desired? doc_target = self.doc_to_target(doc)
( labeled_examples += (
self.doc_to_text(doc) doc_content
if ( if self.config.doc_to_choice is None or isinstance(doc_content, str)
self.config.doc_to_choice is None else self.doc_to_choice(doc)[doc_content]
or isinstance(self.doc_to_text(doc), str)
)
else self.doc_to_choice(doc)[self.doc_to_text(doc)]
)
+ self.target_delimiter
+ (
str(self.doc_to_target(doc)[0])
if isinstance(self.doc_to_target(doc), list)
else self.doc_to_target(doc)
if (
self.config.doc_to_choice is None
or isinstance(self.doc_to_target(doc), str)
)
else str(self.doc_to_choice(doc)[self.doc_to_target(doc)])
)
for doc in selected_docs
]
) )
+ self.fewshot_delimiter labeled_examples += self.target_delimiter
) labeled_examples += (
str(doc_target[0])
if isinstance(doc_target, list)
else doc_target
if self.config.doc_to_choice is None or isinstance(doc_target, str)
else str(self.doc_to_choice(doc)[doc_target])
)
labeled_examples += self.fewshot_delimiter
return labeled_examples return labeled_examples
def get_chat_context(
self,
doc,
num_fewshot,
fewshot_as_multiturn: bool = False,
):
chat_history = []
# draw an extra fewshot sample if using same split as evaluating on
n_samples = (
num_fewshot + 1
if self.config.fewshot_split == self.config.test_split
else num_fewshot
)
# draw `n_samples` docs from fewshot_docs
fewshotex = self.sample(n_samples)
# get rid of the doc that's the one we're evaluating, if it's in the fewshot
# TODO: should we just stop people from using fewshot from same split as evaluating?
selected_docs = [x for x in fewshotex if x != doc][:num_fewshot]
if fewshot_as_multiturn:
for doc in selected_docs:
doc_content = self.doc_to_text(doc)
doc_target = self.doc_to_target(doc)
chat_history.append(
{
"role": "user",
"content": doc_content
if self.config.doc_to_choice is None
or isinstance(doc_content, str)
else self.doc_to_choice(doc)[doc_content],
}
)
chat_history.append(
{
"role": "assistant",
"content": str(doc_target[0])
if isinstance(doc_target, list)
else doc_target
if self.config.doc_to_choice is None
or isinstance(doc_target, str)
else str(self.doc_to_choice(doc)[doc_target]),
}
)
else:
# get fewshot context as one user turn
chat_history.append(
{"role": "user", "content": self.get_context(doc, num_fewshot)}
)
return chat_history
def sample(self, n): def sample(self, n):
""" """
Draw `n` samples from our fewshot docs. This method should be overridden by subclasses. Draw `n` samples from our fewshot docs. This method should be overridden by subclasses.
......
...@@ -373,6 +373,10 @@ class Task(abc.ABC): ...@@ -373,6 +373,10 @@ class Task(abc.ABC):
world_size=None, world_size=None,
cache_requests=False, cache_requests=False,
rewrite_requests_cache=False, rewrite_requests_cache=False,
system_instruction=None,
apply_chat_template=False,
fewshot_as_multiturn=False,
lm=None,
) -> None: ) -> None:
"""Build a set of Instances for a task, and store them in task.instances""" """Build a set of Instances for a task, and store them in task.instances"""
...@@ -380,6 +384,14 @@ class Task(abc.ABC): ...@@ -380,6 +384,14 @@ class Task(abc.ABC):
og_limit = limit og_limit = limit
cache_key = f"requests-{self._config.task}-{self.config.num_fewshot}shot-rank{rank}-world_size{world_size}" cache_key = f"requests-{self._config.task}-{self.config.num_fewshot}shot-rank{rank}-world_size{world_size}"
cache_key += "-chat_template" if apply_chat_template else ""
cache_key += "-fewshot_as_multiturn" if fewshot_as_multiturn else ""
cache_key += (
f"-system_prompt_hash{utils.hash_string(system_instruction)}"
if system_instruction is not None
else ""
)
cache_key += f"-tokenizer{lm.tokenizer_name}" if apply_chat_template else ""
cached_instances = load_from_cache(file_name=cache_key) cached_instances = load_from_cache(file_name=cache_key)
...@@ -421,6 +433,10 @@ class Task(abc.ABC): ...@@ -421,6 +433,10 @@ class Task(abc.ABC):
fewshot_ctx = self.fewshot_context( fewshot_ctx = self.fewshot_context(
doc, doc,
0 if self.config.num_fewshot is None else self.config.num_fewshot, 0 if self.config.num_fewshot is None else self.config.num_fewshot,
system_instruction,
apply_chat_template,
fewshot_as_multiturn,
lm,
) )
# TODO: we should override self.config.repeats if doing greedy gen so users don't waste time+compute # TODO: we should override self.config.repeats if doing greedy gen so users don't waste time+compute
...@@ -969,8 +985,37 @@ class ConfigurableTask(Task): ...@@ -969,8 +985,37 @@ class ConfigurableTask(Task):
) )
return super().fewshot_docs() return super().fewshot_docs()
@staticmethod
def append_target_question(
labeled_examples: List[Dict[str, str]],
question: str,
fewshot_as_multiturn: bool = False,
) -> None:
"""Adds a target question to the labeled examples list.
If fewshot_as_multiturn is True, or labeled_examples is empty, or the last entry is a system turn, appends the question as a new user entry.
Otherwise, it is appended to the last user entry, ensuring that the conversation alternates between the user and the assistant.
"""
if not fewshot_as_multiturn:
# if no messages or last message is system, append as new user entry
if len(labeled_examples) == 0 or labeled_examples[-1]["role"] == "system":
labeled_examples.append({"role": "user", "content": question})
# if last message is user, append to it to avoid two user messages in a row
else:
labeled_examples[-1]["content"] += question
else:
# if fewshot_as_multiturn is True, append as next user entry (last is always assistant)
labeled_examples.append({"role": "user", "content": question})
@utils.positional_deprecated @utils.positional_deprecated
def fewshot_context(self, doc: str, num_fewshot: int) -> str: def fewshot_context(
self,
doc: str,
num_fewshot: int,
system_instruction: Optional[str] = None,
apply_chat_template: bool = False,
fewshot_as_multiturn: bool = False,
lm=None,
) -> str:
"""Returns a fewshot context string that is made up of a prepended description """Returns a fewshot context string that is made up of a prepended description
(if provided), the `num_fewshot` number of examples, and an appended prompt example. (if provided), the `num_fewshot` number of examples, and an appended prompt example.
...@@ -978,22 +1023,90 @@ class ConfigurableTask(Task): ...@@ -978,22 +1023,90 @@ class ConfigurableTask(Task):
The document as returned from training_docs, validation_docs, or test_docs. The document as returned from training_docs, validation_docs, or test_docs.
:param num_fewshot: int :param num_fewshot: int
The number of fewshot examples to provide in the returned context string. The number of fewshot examples to provide in the returned context string.
:param system_instruction: str
System instruction to be applied to the prompt.
:param apply_chat_template: bool
Whether to apply the chat template to the fewshot context.
:param fewshot_as_multiturn: bool
Whether to provide the fewshot examples as a multiturn conversation or a single user turn.
:param lm:
Language model with definition of the tokenizer/function to use for applying the chat template.
:returns: str :returns: str
The fewshot context. The fewshot context.
""" """
if apply_chat_template:
labeled_examples = []
else:
labeled_examples = ""
# get task description
if description := self.config.description: if description := self.config.description:
description = utils.apply_template(self.config.description, doc) description = utils.apply_template(self.config.description, doc)
if num_fewshot == 0: # create system prompt based on the provided system instruction and description
# always prepend the (possibly empty) task description if system_instruction is not None and description:
labeled_examples = description system_prompt = (
f"{system_instruction}{self.sampler.fewshot_delimiter}{description}"
)
elif system_instruction is not None:
system_prompt = system_instruction
elif description:
system_prompt = description
else: else:
labeled_examples = description + self.sampler.get_context(doc, num_fewshot) system_prompt = ""
# add system prompt if specified
if system_prompt:
if apply_chat_template:
labeled_examples.append({"role": "system", "content": system_prompt})
else:
labeled_examples = system_prompt
# if few-shot - append examples after the system prompt
if num_fewshot > 0:
if apply_chat_template:
labeled_examples.extend(
self.sampler.get_chat_context(
doc, num_fewshot, fewshot_as_multiturn
)
)
else:
labeled_examples += self.sampler.get_context(doc, num_fewshot)
example = self.doc_to_text(doc) example = self.doc_to_text(doc)
if self.multiple_input: if apply_chat_template:
return labeled_examples if self.multiple_input:
return lm.apply_chat_template(labeled_examples)
if isinstance(example, str):
self.append_target_question(
labeled_examples, example, fewshot_as_multiturn
)
# for loglikelihood create a list of questions with appended choices
elif isinstance(example, list):
labeled_examples_list = []
# copy chat history for each example and append the answer
for ex in example:
chat = deepcopy(labeled_examples)
self.append_target_question(chat, ex, fewshot_as_multiturn)
labeled_examples_list.append(lm.apply_chat_template(chat))
return labeled_examples_list
# if example is an integer, append the choice or convert to string
elif isinstance(example, int):
if self.config.doc_to_choice is not None:
choices = self.doc_to_choice(doc)
self.append_target_question(
labeled_examples, choices[example], fewshot_as_multiturn
)
else:
self.append_target_question(
labeled_examples, str(example), fewshot_as_multiturn
)
# return lm.apply_chat_template(labeled_examples)
return lm.apply_chat_template(labeled_examples)
else: else:
if self.multiple_input:
return labeled_examples
if isinstance(example, str): if isinstance(example, str):
return labeled_examples + example return labeled_examples + example
elif isinstance(example, list): elif isinstance(example, list):
......
...@@ -21,6 +21,7 @@ from lm_eval.evaluator_utils import ( ...@@ -21,6 +21,7 @@ from lm_eval.evaluator_utils import (
print_writeout, print_writeout,
run_task_tests, run_task_tests,
) )
from lm_eval.loggers import EvaluationTracker
from lm_eval.loggers.utils import add_env_info, get_git_commit_hash from lm_eval.loggers.utils import add_env_info, get_git_commit_hash
from lm_eval.tasks import TaskManager, get_task_dict from lm_eval.tasks import TaskManager, get_task_dict
from lm_eval.utils import ( from lm_eval.utils import (
...@@ -55,6 +56,10 @@ def simple_evaluate( ...@@ -55,6 +56,10 @@ def simple_evaluate(
check_integrity: bool = False, check_integrity: bool = False,
write_out: bool = False, write_out: bool = False,
log_samples: bool = True, log_samples: bool = True,
evaluation_tracker: Optional[EvaluationTracker] = None,
system_instruction: Optional[str] = None,
apply_chat_template: bool = False,
fewshot_as_multiturn: bool = False,
gen_kwargs: Optional[str] = None, gen_kwargs: Optional[str] = None,
task_manager: Optional[TaskManager] = None, task_manager: Optional[TaskManager] = None,
verbosity: str = "INFO", verbosity: str = "INFO",
...@@ -99,6 +104,12 @@ def simple_evaluate( ...@@ -99,6 +104,12 @@ def simple_evaluate(
If True, write out an example document and model input for checking task integrity If True, write out an example document and model input for checking task integrity
:param log_samples: bool :param log_samples: bool
If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis
:param system_instruction: str
System instruction to be applied to the prompt
:param apply_chat_template: bool
If True, apply chat template to the prompt
:param fewshot_as_multiturn: bool
Whether to provide the fewshot examples as a multiturn conversation or a single user turn.
:param gen_kwargs: str :param gen_kwargs: str
String arguments for model generation String arguments for model generation
Ignored for all tasks with loglikelihood output_type Ignored for all tasks with loglikelihood output_type
...@@ -254,6 +265,14 @@ def simple_evaluate( ...@@ -254,6 +265,14 @@ def simple_evaluate(
if check_integrity: if check_integrity:
run_task_tests(task_list=tasks) run_task_tests(task_list=tasks)
if evaluation_tracker is not None:
evaluation_tracker.general_config_tracker.log_experiment_args(
model_source=model,
model_args=model_args,
system_instruction=system_instruction,
chat_template=lm.chat_template if apply_chat_template else None,
)
results = evaluate( results = evaluate(
lm=lm, lm=lm,
task_dict=task_dict, task_dict=task_dict,
...@@ -263,6 +282,9 @@ def simple_evaluate( ...@@ -263,6 +282,9 @@ def simple_evaluate(
bootstrap_iters=bootstrap_iters, bootstrap_iters=bootstrap_iters,
write_out=write_out, write_out=write_out,
log_samples=log_samples, log_samples=log_samples,
system_instruction=system_instruction,
apply_chat_template=apply_chat_template,
fewshot_as_multiturn=fewshot_as_multiturn,
verbosity=verbosity, verbosity=verbosity,
) )
...@@ -318,6 +340,9 @@ def evaluate( ...@@ -318,6 +340,9 @@ def evaluate(
bootstrap_iters: Optional[int] = 100000, bootstrap_iters: Optional[int] = 100000,
write_out: bool = False, write_out: bool = False,
log_samples: bool = True, log_samples: bool = True,
system_instruction: Optional[str] = None,
apply_chat_template: bool = False,
fewshot_as_multiturn: bool = False,
verbosity: str = "INFO", verbosity: str = "INFO",
): ):
"""Instantiate and evaluate a model on a list of tasks. """Instantiate and evaluate a model on a list of tasks.
...@@ -334,6 +359,12 @@ def evaluate( ...@@ -334,6 +359,12 @@ def evaluate(
If True, write out an example document and model input for checking task integrity If True, write out an example document and model input for checking task integrity
:param log_samples: bool :param log_samples: bool
If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis
:param system_instruction: str
System instruction to be applied to the prompt
:param apply_chat_template: bool
If True, apply chat template to the prompt
:param fewshot_as_multiturn: bool
Whether to provide the fewshot examples as a multiturn conversation or a single user turn.
:return :return
Dictionary of results Dictionary of results
""" """
...@@ -363,6 +394,10 @@ def evaluate( ...@@ -363,6 +394,10 @@ def evaluate(
world_size=lm.world_size, world_size=lm.world_size,
cache_requests=cache_requests, cache_requests=cache_requests,
rewrite_requests_cache=rewrite_requests_cache, rewrite_requests_cache=rewrite_requests_cache,
system_instruction=system_instruction,
apply_chat_template=apply_chat_template,
fewshot_as_multiturn=fewshot_as_multiturn,
lm=lm,
) )
eval_logger.debug( eval_logger.debug(
f"Task: {task_output.task_name}; number of requests on this rank: {len(task.instances)}" f"Task: {task_output.task_name}; number of requests on this rank: {len(task.instances)}"
......
...@@ -40,6 +40,10 @@ class GeneralConfigTracker: ...@@ -40,6 +40,10 @@ class GeneralConfigTracker:
model_source: str = None model_source: str = None
model_name: str = None model_name: str = None
model_name_sanitized: str = None model_name_sanitized: str = None
system_instruction: str = None
system_instruction_sha: str = None
chat_template: str = None
chat_template_sha: str = None
start_time: float = None start_time: float = None
end_time: float = None end_time: float = None
total_evaluation_time_seconds: str = None total_evaluation_time_seconds: str = None
...@@ -68,6 +72,8 @@ class GeneralConfigTracker: ...@@ -68,6 +72,8 @@ class GeneralConfigTracker:
self, self,
model_source: str, model_source: str,
model_args: str, model_args: str,
system_instruction: str,
chat_template: str,
) -> None: ) -> None:
"""Logs model parameters and job ID.""" """Logs model parameters and job ID."""
self.model_source = model_source self.model_source = model_source
...@@ -75,6 +81,12 @@ class GeneralConfigTracker: ...@@ -75,6 +81,12 @@ class GeneralConfigTracker:
self.model_name_sanitized = re.sub( self.model_name_sanitized = re.sub(
r"[\"<>:/\|\\?\*\[\]]+", "__", self.model_name r"[\"<>:/\|\\?\*\[\]]+", "__", self.model_name
) )
self.system_instruction = system_instruction
self.system_instruction_sha = (
hash_string(system_instruction) if system_instruction else None
)
self.chat_template = chat_template
self.chat_template_sha = hash_string(chat_template) if chat_template else None
def log_end_time(self) -> None: def log_end_time(self) -> None:
"""Logs the end time of the evaluation and calculates the total evaluation time.""" """Logs the end time of the evaluation and calculates the total evaluation time."""
......
...@@ -2,7 +2,7 @@ import copy ...@@ -2,7 +2,7 @@ import copy
import os import os
from datetime import timedelta from datetime import timedelta
from pathlib import Path from pathlib import Path
from typing import List, Literal, Optional, Tuple, Union from typing import Dict, List, Literal, Optional, Tuple, Union
import torch import torch
import torch.nn.functional as F import torch.nn.functional as F
...@@ -419,6 +419,16 @@ class HFLM(TemplateLM): ...@@ -419,6 +419,16 @@ class HFLM(TemplateLM):
def world_size(self): def world_size(self):
return self._world_size return self._world_size
@property
def tokenizer_name(self) -> str:
return self.tokenizer.name_or_path.replace("/", "__")
@property
def chat_template(self) -> str:
if self.tokenizer.chat_template is not None:
return self.tokenizer.chat_template
return self.tokenizer.default_chat_template
def _get_backend( def _get_backend(
self, self,
config: Union[transformers.PretrainedConfig, transformers.AutoConfig], config: Union[transformers.PretrainedConfig, transformers.AutoConfig],
...@@ -1290,6 +1300,14 @@ class HFLM(TemplateLM): ...@@ -1290,6 +1300,14 @@ class HFLM(TemplateLM):
return res return res
def apply_chat_template(self, chat_history: List[Dict[str, str]]) -> str:
"""
Method to apply a chat template to a list of chat history between user and model.
"""
return self.tokenizer.apply_chat_template(
chat_history, tokenize=False, add_generation_prompt=True
)
def get_model_info(self) -> dict: def get_model_info(self) -> dict:
""" """
Method to get Hugging Face model information for experiment reproducibility. Method to get Hugging Face model information for experiment reproducibility.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment