Unverified Commit 6c029a23 authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Merge pull request #1048 from EleutherAI/even-more-docs

parents 6432933d b75ac377
...@@ -24,12 +24,19 @@ ...@@ -24,12 +24,19 @@
"id": "0gDoM0AJAvEc" "id": "0gDoM0AJAvEc"
}, },
"source": [ "source": [
"Our refactor stems from our beliefs of the following that we think are best practices\n", "Our refactor stems from our desires to make the following believed best practices easier to carry out. \n",
"\n", "\n",
"1. Never Copy Results from Other Papers\n", "1. Never copy results from other papers\n",
"2. Always share your exact prompts\n", "2. Always share your exact prompts\n",
"3. Always provide model outputs\n", "3. Always provide model outputs\n",
"4. Qualitatively review a small batch of outputs before running evaluation jobs at scale\n" "4. Qualitatively review a small batch of outputs before running evaluation jobs at scale\n",
"\n",
"We also wanted to make the library a better experience to use and to contribute or design evaluations within. New features in the new release that serve this purpose include:\n",
"\n",
"1. Faster Evaluation Runtimes (accelerated data-parallel inference with HF Transformers + Accelerate, and commonly used or faster inference libraries such as vLLM and Llama-CPP)\n",
"2. Easier addition and sharing of new tasks (YAML-based task config formats, allowing single-file sharing of custom tasks)\n",
"3. More configurability, for more advanced workflows and easier operation with modifying prompts\n",
"4. Better logging of data at runtime and post-hoc"
] ]
}, },
{ {
...@@ -256,9 +263,9 @@ ...@@ -256,9 +263,9 @@
"id": "8rfUeX6n_wkK" "id": "8rfUeX6n_wkK"
}, },
"source": [ "source": [
"## Make task evaluations with configurable tasks\n", "## Create new evaluation tasks with config-based tasks\n",
"\n", "\n",
"Even within the same task, many works have reported numbers based on different choices of evaluation. Some report on the test sets, validation sets, or even subset of the training sets. Others have specialized prompts and verbalizers. We introduce YAMLs to allow users to easily make different variations. By leveraging the YAML configs to configure evaluations, the refactored LM-Eval takes the methods of the `Tasks` object and makes them configurable by setting the appropriate attributes in the config file. There, users can set the tasks they want by setting the name of the HF dataset (local tasks are also possible) the dataset splits used and much more. Key configurations such as `doc_to_text`, previously implemented as a method of the same name, is now configurable with jinja2 to allow high-level scripting to transform a HF dataset to text string as input to the model.\n", "Even within the same task, many works have reported numbers based on different choices of evaluation. Some report on the test sets, validation sets, or even subset of the training sets. Others have specialized prompts and verbalizers. We introduce YAMLs to allow users to easily make different variations. By leveraging the YAML configs to configure evaluations, the refactored LM-Eval takes the methods of the `Task` object and makes them configurable by setting the appropriate attributes in the config file. There, users can set the tasks they want by setting the name of the HF dataset (local tasks are also possible), the dataset splits used, and much more. Key configurations relating to prompting, such as `doc_to_text`, previously implemented as a method of the same name, are now configurable with jinja2 to allow high-level scripting to transform a HF dataset to text string as input to the model.\n",
"\n" "\n"
] ]
}, },
...@@ -268,7 +275,9 @@ ...@@ -268,7 +275,9 @@
"id": "HYFUhhfOSJKe" "id": "HYFUhhfOSJKe"
}, },
"source": [ "source": [
"A core-feature to LM-Eval is to configure tasks with YAML configs. With configs, fill preset fields to easily setup a task." "A core-feature to LM-Eval is to configure tasks with YAML configs. With configs, you can fill preset fields to easily set up a task.\n",
"\n",
"Here, we write a demo YAML config for a multiple-choice evaluation of BoolQ:"
] ]
}, },
{ {
...@@ -299,11 +308,11 @@ ...@@ -299,11 +308,11 @@
] ]
}, },
{ {
"cell_type": "code", "cell_type": "markdown",
"execution_count": null,
"metadata": {}, "metadata": {},
"outputs": [], "source": [
"source": [] "And we can now run evaluation on this task, by pointing to the config file we've just created:"
]
}, },
{ {
"cell_type": "code", "cell_type": "code",
...@@ -368,7 +377,7 @@ ...@@ -368,7 +377,7 @@
"id": "LOUHK7PtQfq4" "id": "LOUHK7PtQfq4"
}, },
"source": [ "source": [
"Often, tasks are part of a larger group used to meausre different capabilities. The dynamism of the field today means new dimensions of evaluation can come about which would mix and match new and older tasks alike. In LM-Eval, We can also group tasks and call that the group name to evaluate on a set of tasks easily. In this instance, let's evaluate the group `yes_or_no_tasks` which comprise of the tasks `demo_boolq` and `demo_cola`; tasks which are multiple choice tasks with options `yes` and `no` as the name suggests.\n", "Often, tasks are part of a larger group used to measure different capabilities. The dynamism of the field today means new dimensions of evaluation can come about which would mix and match new and older tasks alike. In LM-Eval, We can also group tasks and call that the group name to evaluate on a set of tasks easily. In this instance, let's evaluate the group `yes_or_no_tasks` which comprise of the tasks `demo_boolq` and `demo_cola`; tasks which are multiple choice tasks with options `yes` and `no` as the name suggests.\n",
"\n", "\n",
"<!-- making new groups is easier than ever, allowing user to work bottom-up by makiing individual tasks and linking them to a group or Top-Down, making a new group by listing existing tasks.\n", "<!-- making new groups is easier than ever, allowing user to work bottom-up by makiing individual tasks and linking them to a group or Top-Down, making a new group by listing existing tasks.\n",
"\n", "\n",
...@@ -685,18 +694,17 @@ ...@@ -685,18 +694,17 @@
"\n", "\n",
"`output_type`\n", "`output_type`\n",
"The current provided evaluation types comprise of the following:\n", "The current provided evaluation types comprise of the following:\n",
"1. `loglikelihood`: Evaluates the loglikeihood of a continuation\n", "1. `loglikelihood`: Evaluates the loglikelihood of a continuation, conditioned on some input string.\n",
"2. `loglikelihood_rolling`\n", "2. `loglikelihood_rolling`: evaluate the loglikelihood of producing a string, conditioned on the empty string. (Used for perplexity evaluations)\n",
"3. `multiple_choice`: Evaluates loglikelihood among the a number of choices predicted by the model.\n", "3. `multiple_choice`: Evaluates loglikelihood among the a number of choices predicted by the model.\n",
"4. `greedy_until`: Model outputs greedy generation (can be configured to to use beam search and other generation-related parameters)\n", "4. `greedy_until`: Model outputs greedy generation (can be configured to to use beam search and other generation-related parameters)\n",
"\n", "\n",
"The core prompt revolves around 3 fields.\n", "The core prompt revolves around 3 fields.\n",
"1. `doc_to_text`: Denotes the prompt template that will be used as input to the model.\n", "1. `doc_to_text`: Denotes the prompt template that will be used as input to the model.\n",
"2. `doc_to_choice`: Available choices that will be used as continuation for the model. This is used when the `output_type` is `multiple_choice`.\n", "2. `doc_to_choice`: Available choices that will be used as continuation for the model. This is used when the `output_type` is `multiple_choice`, and otherwise can be left as `None`.\n",
"3. `doc_to_target`: When `output_type` is `multiple_choice` this can be an index that corresponds to the correct answer or the answer string itself (must be a subset of `doc_to_choice`). For other tasks, this is expected to be a string. You can fill this field with a feature name from the HF dataset so long as the resulting feature follows the conditioned described.\n", "3. `doc_to_target`: When `output_type` is `multiple_choice`, this can be an index that corresponds to the correct answer, or the answer string itself (must be a subset of `doc_to_choice`). For other tasks, this is expected to be a string. You can fill this field with a feature name from the HF dataset so long as the resulting feature follows the conditioned described.\n",
"\n", "\n",
"<!-- Advanced notes:\n", "These three fields can be expressed as strings, column names from the source dataset, or as Jinja2 templates that can use fields from the source dataset as variables.\n"
"In some cases, like Winograd, we want to alternate the left-hand side of a prompt and keep the answer the same -->\n"
] ]
}, },
{ {
...@@ -707,10 +715,17 @@ ...@@ -707,10 +715,17 @@
"source": [ "source": [
"## What if Jinja is not Sufficient?\n", "## What if Jinja is not Sufficient?\n",
"\n", "\n",
"There can be times where Jinja is not enough to make the prompt we had in mind.\n", "There can be times where the Jinja2 templating language is not enough to make the prompt we had in mind. There are a few ways to circumvent this limitation:\n",
"\n", "\n",
"1. Use `!function` operator for the prompt-related fields to pass a python function to build the prompt template.\n", "1. Use `!function` operator for the prompt-related fields to pass a python function that takes as input the dataset row, and will output the prompt template component.\n",
"2. Transform the dataset" "2. Perform a transformation on the dataset beforehand."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below, we show an example of using `!function` to create `doc_to_text` from a python function:"
] ]
}, },
{ {
...@@ -784,6 +799,68 @@ ...@@ -784,6 +799,68 @@
" --output output/demo_mmlu_high_school_geography_function_prompt/ \\\n", " --output output/demo_mmlu_high_school_geography_function_prompt/ \\\n",
" --log_samples\n" " --log_samples\n"
] ]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we'll also show how to do this via preprocessing the dataset as necessary using the `process_docs` config field:\n",
"\n",
"We will write a function that will modify each document in our evaluation dataset's split to add a field that is suitable for us to use in `doc_to_text`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"YAML_mmlu_geo_string = '''\n",
"include: mmlu_high_school_geography.yaml\n",
"task: demo_mmlu_high_school_geography_function_prompt_2\n",
"process_docs: !function utils_process_docs.process_docs\n",
"doc_to_text: \"{{input}}\"\n",
"doc_to_choice: \"{{choices}}\"\n",
"'''\n",
"with open('demo_mmlu_high_school_geography_process_docs.yaml', 'w') as f:\n",
" f.write(YAML_mmlu_geo_string)\n",
"\n",
"DOC_TO_TEXT = '''\n",
"def process_docs(dataset):\n",
" def _process_doc(x):\n",
" question = x[\"question\"].strip()\n",
" choices = x[\"choices\"]\n",
" option_a = choices[0]\n",
" option_b = choices[1]\n",
" option_c = choices[2]\n",
" option_d = choices[3]\n",
" doc[\"input\"] = f\"{question}\\\\nA. {option_a}\\\\nB. {option_b}\\\\nC. {option_c}\\\\nD. {option_d}\\\\nAnswer:\"\n",
" return out_doc\n",
"\n",
" return dataset.map(_process_doc)\n",
"'''\n",
"\n",
"with open('utils_process_docs.py', 'w') as f:\n",
" f.write(DOC_TO_TEXT)\n",
"\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_mmlu_high_school_geography_function_prompt_2 \\\n",
" --limit 10 \\\n",
" --output output/demo_mmlu_high_school_geography_function_prompt_2/ \\\n",
" --log_samples\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We hope that this explainer gives you a sense of what can be done with and how to work with LM-Evaluation-Harnes v0.4.0 ! \n",
"\n",
"For more information, check out our documentation pages in the `docs/` folder, and if you have questions, please raise them in GitHub issues, or in #lm-thunderdome or #release-discussion on the EleutherAI discord server."
]
} }
], ],
"metadata": { "metadata": {
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment