Commit 4d4d8f59 authored by chenzk's avatar chenzk
Browse files

v1.0

parents
Pipeline #2741 canceled with stages
# Contrastive Learning From AI Revisions (CLAIR)
["Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment"](https://huggingface.co/papers/2408.06266) introduces both Contrastive
Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. While APO can be found in [TRL](https://huggingface.co/docs/trl/dpo_trainer#loss-functions), we have implemented a task for CLAIR in `distilabel`.
CLAIR is a method for creating preference pairs which minimally revises one output to express a preference, resulting in a more precise learning signal as opposed to conventional methods which use a judge to select a preferred response.
![CLAIR overview](../../../assets/pipelines/clair.png)
The athors from the original paper shared a [collection of datasets from CLAIR and APO](https://huggingface.co/collections/ContextualAI/clair-and-apo-66b52868672bb1c984d1f3d5), where [ContextualAI/ultrafeedback_clair_32k](https://huggingface.co/datasets/ContextualAI/ultrafeedback_clair_32k) corresponds to the CLAIR implementation.
### Replication
!!! NOTE
The section is named `Replication` but in this case we are showing how to use the [`CLAIR`][distilabel.steps.tasks.clair.CLAIR] task create revisions for your generations using `distilabel`.
To showcase CLAIR we will be using the [`CLAIR`][distilabel.steps.tasks.PrometheusEval] task implemented in `distilabel` and we are reusing a small sample of the already generated dataset by ContextualAI [`ContextualAI/ultrafeedback_clair_32k`](https://huggingface.co/datasets/ContextualAI/ultrafeedback_clair_32k) for testing.
#### Installation
To reproduce the code below, one will need to install `distilabel` as follows:
```bash
pip install "distilabel>=1.4.0"
```
Depending on the LLM provider you want to use, the requirements may vary, take a look at the dependencies in that case, we are using for the example the free inference endpoints from Hugging Face, but that won't apply for a bigger dataset.
#### Building blocks
In this case where we already have instructions and their generations, we will just need to load the data and the corresponding CLAIR task for the revisions:
- [`CLAIR`](https://distilabel.argilla.io/dev/components-gallery/tasks/clair/) to generate the revisions.
#### Code
Let's see the full pipeline applied to `ContextualAI/ultrafeedback_clair_32k` in `distilabel`:
```python
from typing import Any, Dict
from datasets import load_dataset
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import CLAIR
from distilabel.models import InferenceEndpointsLLM
def transform_ultrafeedback(example: Dict[str, Any]) -> Dict[str, Any]:
return {
"task": example["prompt"],
"student_solution": example["rejected"][1]["content"],
}
dataset = (
load_dataset("ContextualAI/ultrafeedback_clair_32k", split="train")
.select(range(10)) # We collect just 10 examples
.map(transform_ultrafeedback) # Apply the transformation to get just the text
)
with Pipeline(name="CLAIR UltraFeedback sample") as pipeline:
clair = CLAIR( # (1)
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
generation_kwargs={
"temperature": 0.7,
"max_new_tokens": 4096
}
)
)
if __name__ == "__main__":
distiset = pipeline.run(dataset=dataset) # (2)
distiset.push_to_hub(repo_id="username/clair-test", include_script=True) # (3)
```
1. This Pipeline uses just CLAIR because we already have the generations, but one can just include a first task to create generations from instructions, and then the revisions with CLAIR.
2. Include the dataset directly in the run method for simplicity.
3. Push the distiset to the hub with the script for reproducibility.
An example dataset can be found at: [distilabel-internal-testing/clair-test](https://huggingface.co/datasets/distilabel-internal-testing/clair-test).
# DeepSeek Prover
["DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data"](https://arxiv.org/abs/2405.14333) presents an approach to generate mathematical proofs for theorems generated from informal math problems. This approach shows promising results to advance the capabilities of models towards theorem proving using synthetic data. Until this moment the dataset and the model trained on top of it haven't been opened, let's see how the approach works to reproduce the pipeline using `distilabel`. The following figure depicts the approach taken to generate the dataset:
![DeepSeep prover approach](../../../assets/tutorials-assets/deepseek_prover.png)
The authors propose a method for generating [Lean 4](https://github.com/leanprover/lean4) proof data from informal mathematical problems. Their approach translates high-school and undergraduate-level mathematical competition problems into formal statements.
Here we show how to deal with steps 1 and 2, but the authors ensure the theorems are checked using the [lean4](https://github.com/leanprover/lean4) program on the generated proofs, and iterate for a series of steps, fine-tuning a model on the synthetic data (DeepSeek prover 7B), regenerating the dataset, and continue the process until no further improvement is found.
![DEITA pipeline overview](../../../assets/pipelines/deepseek.png)
### Replication
!!! Note
The section is named `Replication` but we will show how we can use `distilabel` to create the different steps outlined in the `DeepSeek-Prover` approach. We intentionally let some steps out of the pipeline, but this can easily be extended.
We will define the components needed to generate a dataset like the one depicted in the previous figure (we won't call lean4 or do the fine-tuning, this last step can be done outside of `distilabel`). The different blocks will have all the docstrings as we would have in the internal steps to showcase how they are done, but they can be omitted for brevity.
## Installation
To reproduce the code below, we need to install `distilabel` as it follows:
```bash
pip install "distilabel[hf-inference-endpoints]"
```
We have decided to use [`InferenceEndpointsLLM`](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/?h=inference#inferenceendpointsllm), but any other provider with a strong model could work.
## Building blocks
There are three components we needed to define for this pipeline, for the different components in the paper: A task to formalize the original statements, another one to assess the relevance of the theorems, and a final one to generate proofs for the theorems.
!!! Note
We will use the same `LLM` for all the tasks, so we will define once and reuse it for the different tasks:
```python
llm = InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
)
```
### DeepSeekProverAutoFormalization
This `Task` corresponds to the first step in the figure. Given an informal statement, it will formalize it for us in [`Lean 4`](https://github.com/leanprover/lean4) language, meaning it will translate from an informal statement that could be gathered from the internet, to the lean4 structured language.
<details close>
<summary>DeepSeekProverAutoFormalization</summary>
```python
_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX = r"```lean4(.*?)```"
template_deepseek_prover_auto_formalization = """\
Mathematical Problem in Natural Language:
{{ informal_statement }}
{%- if few_shot %}
Please use the following examples to guide you with the answer:
{%- for example in examples %}
- {{ example }}
{%- endfor %}
{% endif -%}"""
class DeepSeekProverAutoFormalization(Task):
examples: Optional[List[str]] = None
system_prompt: str = "Translate the problem to Lean 4 (only the core declaration):\n```lean4\nformal statement goes here\n```"
_template: Union[Template, None] = PrivateAttr(...)
_few_shot: bool = PrivateAttr(default=False)
def load(self) -> None:
super().load()
self._template = Template(template_deepseek_prover_auto_formalization)
@property
def inputs(self) -> List[str]:
return ["informal_statement"]
@property
def outputs(self):
return ["formal_statement", "model_name"]
def format_input(self, input: str) -> ChatType: # type: ignore
return [
{
"role": "system",
"content": self.system_prompt,
},
{
"role": "user",
"content": self._template.render(
informal_statement=input[self.inputs[0]],
few_shot=bool(self.examples),
examples=self.examples,
),
},
]
@override
def format_output( # type: ignore
self, output: Union[str, None], input: Dict[str, Any] = None
) -> Dict[str, Any]: # type: ignore
match = re.search(_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX, output, re.DOTALL)
if match:
match = match.group(1).strip()
return {"formal_statement": match}
```
</details>
Following the paper, they found that the model yields better results if it uses examples in a few shot setting, so this class allows to take some examples to help in generating the formulation. Let's see an example of how we can instantiate it:
```python
from textwrap import dedent
examples = [
dedent("""
## Statement in natural language:
For real numbers k and x:
If x is equal to (13 - √131) / 4, and
If the equation 2x² - 13x + k = 0 is satisfied,
Then k must be equal to 19/4.
## Formalized:
theorem mathd_algebra_116 (k x : ℝ) (h₀ : x = (13 - Real.sqrt 131) / 4)
(h₁ : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :="""),
dedent("""
## Statement in natural language:
The greatest common divisor (GCD) of 20 factorial (20!) and 200,000 is equal to 40,000.
## Formalized:
theorem mathd_algebra_116 (k x : ℝ) (h₀ : x = (13 - Real.sqrt 131) / 4)
(h₁ : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :="""),
dedent("""
## Statement in natural language:
Given two integers x and y:
If y is positive (greater than 0),
And y is less than x,
And the equation x + y + xy = 80 is true,
Then x must be equal to 26.
## Formalized:
theorem mathd_algebra_116 (k x : ℝ) (h₀ : x = (13 - Real.sqrt 131) / 4)
(h₁ : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :="""),
]
auto_formalization = DeepSeekProverAutoFormalization(
name="auto_formalization",
input_batch_size=8,
llm=llm,
examples=examples
)
```
### DeepSeekProverScorer
The next `Task` corresponds to the second step, the model scoring and assessment. It uses an LLM as judge to evaluate the relevance of the theorem, and assigns a score so it can be filtered afterwards.
<details close>
<summary>DeepSeekProverScorer</summary>
```python
template_deepseek_prover_scorer = """\
To evaluate whether a formal Lean4 statement will be of interest to the community, consider the following criteria:
1. Relevance to Current Research: Does the statement address a problem or concept that is actively being researched in mathematics or related fields? Higher relevance scores indicate greater potential interest.
2. Complexity and Depth: Is the statement complex enough to challenge existing theories and methodologies, yet deep enough to provide significant insights or advancements? Complexity and depth showcase Lean4's capabilities and attract interest.
3. Interdisciplinary Potential: Does the statement offer opportunities for interdisciplinary research, connecting mathematics with other fields such as computer science, physics, or biology? Interdisciplinary projects often garner wide interest.
4. Community Needs and Gaps: Does the statement fill an identified need or gap within the Lean4 community or the broader mathematical community? Addressing these needs directly correlates with interest.
5. Innovativeness: How innovative is the statement? Does it propose new methods, concepts, or applications? Innovation drives interest and engagement.
Customize your evaluation for each problem accordingly, assessing it as 'excellent', 'good', 'above average', 'fair' or 'poor'.
You should respond in the following format for each statement:
'''
Natural language: (Detailed explanation of the informal statement, including any relevant background information, assumptions, and definitions.)
Analysis: (Provide a brief justification for each score, highlighting why the statement scored as it did across the criteria.)
Assessment: (Based on the criteria, rate the statement as 'excellent', 'good', 'above average', 'fair' or 'poor'. JUST the Assessment.)
'''"""
class DeepSeekProverScorer(Task):
_template: Union[Template, None] = PrivateAttr(...)
def load(self) -> None:
super().load()
self._template = Template(template_deepseek_prover_scorer)
@property
def inputs(self) -> List[str]:
return ["informal_statement", "formal_statement"]
@property
def outputs(self):
return ["natural_language", "analysis", "assessment", "model_name"]
def format_input(self, input: str) -> ChatType:
return [
{
"role": "system",
"content": self._template.render(),
},
{
"role": "user",
"content": f"## Informal statement:\n{input[self.inputs[0]]}\n\n ## Formal statement:\n{input[self.inputs[1]]}",
},
]
@override
def format_output(
self, output: Union[str, None], input: Dict[str, Any] = None
) -> Dict[str, Any]:
try:
result = output.split("Natural language:")[1].strip()
natural_language, analysis = result.split("Analysis:")
analysis, assessment = analysis.split("Assessment:")
natural_language = natural_language.strip()
analysis = analysis.strip()
assessment = assessment.strip()
except Exception:
natural_language = analysis = assessment = None
return {
"natural_language": natural_language,
"analysis": analysis,
"assessment": assessment
}
```
</details>
### DeepSeekProverSolver
The last task is in charge of generating a proof for the theorems generated in the previous steps.
<details close>
<summary>DeepSeekProverSolver</summary>
```python
class DeepSeekProverSolver(Task):
system_prompt: str = (
"You are an expert in proving mathematical theorems formalized in lean4 language. "
"Your answers consist just in the proof to the theorem given, and nothing else."
)
@property
def inputs(self) -> List[str]:
return ["formal_statement"]
@property
def outputs(self):
return ["proof"]
def format_input(self, input: str) -> ChatType:
prompt = dedent("""
Give me a proof for the following theorem:
```lean4
{theorem}
```"""
)
return [
{
"role": "system",
"content": self.system_prompt,
},
{
"role": "user",
"content": prompt.format(theorem=input["formal_statement"]),
},
]
def format_output(
self, output: Union[str, None], input: Dict[str, Any] = None
) -> Dict[str, Any]:
import re
match = re.search(_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX, output, re.DOTALL)
if match:
match = match.group(1).strip()
return {"proof": match}
```
</details>
Additionally, the original pipeline defined in the paper includes a step to check the final proofs using the lean 4 language that we have omitted for simplicity. The fine tuning can be done completely offline, and come back to the pipeline after each iteration/training run.
*All the docstrings have been removed from the code blocks, but can be seen in the full pipeline.*
## Code
Lets's put the building blocks together to create the final pipeline with `distilabel`. For this example we have generated a sample dataset [plaguss/informal-mathematical-statements-tiny](https://huggingface.co/datasets/plaguss/informal-mathematical-statements-tiny) of informal mathematical statements starting from [casey-martin/multilingual-mathematical-autoformalization](https://huggingface.co/datasets/casey-martin/multilingual-mathematical-autoformalization), but as the paper mentions, we can create formal statements and it's corresponding proofs starting from informal ones:
<details close>
<summary>Click to see the full pipeline</summary>
```python title="deepseek_prover.py"
--8<-- "examples/deepseek_prover.py"
```
</details>
The script can be run run for a dry run or not, depending on the argument (the pipeline will run without dry run by default), and will be pushed to the hub with the name `your_username/test_deepseek_prover`:
```bash
python deepseek_prover.py [-d | --dry-run | --no-dry-run]
```
Final dataset: [plaguss/test_deepseek_prover](https://huggingface.co/datasets/plaguss/test_deepseek_prover).
# DEITA
[DEITA (Data-Efficient Instruction Tuning for Alignment)](https://arxiv.org/abs/2312.15685) studies an automatic data selection process by first quantifying the data quality based on complexity, quality and diversity. Second, select the best potential combination from an open-source dataset that would fit into the budget you allocate to tune your own LLM.
In most setting we cannot allocate unlimited resources for instruction-tuning LLMs. Therefore, the DEITA authors investigated how to select qualitative data for instruction tuning based on the principle of fewer high-quality samples. Liu et al. tackle the issue of first defining good data and second identifying it to respect an initial budget to instruct-tune your LLM.
The strategy utilizes **LLMs to replace human effort in time-intensive data quality **tasks on **instruction-tuning** datasets**. DEITA introduces a way to measure data quality across three critical dimensions: complexity, quality and diversity.
![DEITA pipeline overview](../../../assets/tutorials-assets/deita/overview.png)
You can see that we see again the dataset of instructions/responses and we kind of reproducing the second step when we learn how to optimize the responses according to an instruction by comparing several possibilities.
![DEITA pipeline overview](../../../assets/pipelines/deita.png)
### Datasets and budget
We will dive deeper into the whole process. We will investigate each stage to efficiently select the final dataset used for supervised fine-tuning with a budget constraint. We will tackle technical challenges by explaining exactly how you would assess good data as presented in the paper.
As a reminder, we're looking for a strategy to automatically select good data for the instruction-tuning step when you want to fine-tune an LLM to your own use case taking into account a resource constraint. This means that you cannot blindly train a model on any data you encounter on the internet.
The DEITA authors assume that you have access to open-source datasets that fit your use case. This may not be the case entirely. But with open-source communities tackling many use cases, with projects such as [BLOOM](https://arxiv.org/pdf/2110.08207.pdf) or [AYA](https://cohere.com/research/aya), it's likely that your use case will be tackled at some point. Furthermore, you could generate your own instruction/response pairs with methods such as [self-generated instructions](https://aclanthology.org/2023.acl-long.754/) using distilabel. This tutorial assumes that we have a data pool with excessive samples for the project's cost constraint. In short, we aim to achieve adequate performance from fewer samples.
The authors claim that the subsample size "correlates proportionally with the computation consumed in instruction tuning". Hence on a first approximation, reducing the sample size means reducing computation consumption and so the total development cost. Reproducing the paper notations, we will associate the budget m to a number of instruction/response pairs that you can set depending on your real budget.
![Datasets table](../../../assets/tutorials-assets/deita/datasets.png)
To match the experimental set-up, dataset *X\_sota* is a meta-dataset combining major open-source datasets available to instruct-tune LLMs. This dataset is composed of [ShareGPT](https://huggingface.co/datasets/shibing624/sharegpt_gpt4) (58k instruction/response pairs), [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) (105k instruction/response pairs) and [WizardLM](https://github.com/nlpxucan/WizardLM) (143k instruction/response pairs). It sums to more than 300k instruction/response pairs. We aim to reduce the final subsample to 6k instruction/response pairs.
## Setup the notebook and packages
Let's prepare our dependencies:
```bash
pip install "distilabel[openai,hf-transformers]>=1.0.0"
pip install pynvml huggingface_hub argilla
```
Import distilabel:
```python
from distilabel.models import TransformersLLM, OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import ConversationTemplate, DeitaFiltering, ExpandColumns, LoadDataFromHub
from distilabel.steps.tasks import ComplexityScorer, EvolInstruct, EvolQuality, GenerateEmbeddings, QualityScorer
```
Define the distilabel Pipeline and load the dataset from the Hugging Face Hub.
```python
pipeline = Pipeline(name="DEITA")
load_data = LoadDataFromHub(
name="load_data", batch_size=100, output_mappings={"prompt": "instruction"}, pipeline=pipeline
)
```
## EVOL-INSTRUCT: Generate Instructions with an LLM
[Evol-Instruct](https://arxiv.org/abs/2304.12244) automates the creation of complex instruction data for training large language models (LLMs) by progressively rewriting an initial set of instructions into more complex forms. This generated data is then used to fine-tune a model named WizardLM.
Evaluations show that instructions from Evol-Instruct are superior to human-created ones, and WizardLM achieves performance close to or exceeding GPT3.5-turbo in many skills. In distilabel, we initialise each step of the data generation pipeline. Later, we'll connect them together.
```python
evol_instruction_complexity = EvolInstruct(
name="evol_instruction_complexity",
llm=OpenAILLM(model="gpt-3.5-turbo"),
num_evolutions=5,
store_evolutions=True,
generate_answers=True,
include_original_instruction=True,
pipeline=pipeline,
)
evol_instruction_complexity.load()
_evolved_instructions = next(evol_instruction_complexity.process(
([{"instruction": "How many fish are there in a dozen fish?"}]))
)
print(*_evolved_instructions, sep="\n")
```
Output:
```bash
( 1, 'How many fish are there in a dozen fish?')
( 2, 'How many rainbow trout are there in a dozen rainbow trout?')
( 3, 'What is the average weight in pounds of a dozen rainbow trout caught in a specific river in Alaska during the month of May?')
```
## EVOL COMPLEXITY: Evaluate complexity of generated instructions
The second step is the evaluation of *complexity* for an instruction in a given instruction-response pair. Like EVOL-INSTRUCT, this method uses LLMs instead of humans to automatically improve instructions, specifically through their complexity. From any instruction-response pair, $(I, R)$, we first generate new instructions following the In-Depth Evolving Response. We generate more complex instructions through prompting, as explained by authors, by adding some constraints or reasoning steps. Let\'s take an example from [GPT-4-LLM](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) which aims to generate observations by GPT-4 to instruct-tune LLMs with supervised fine-tuning. And, we have the instruction $instruction_0$:
```python
instruction_0 = "Give three tips for staying healthy."
```
To make it more complex, you can use, as the authors did, some prompt templates to add constraints or deepen the instruction. They provided some prompts in the paper appendix. For instance, this one was used to add constraints:
```python
PROMPT = """I want you act as a Prompt Rewriter.
Your objective is to rewrite a given prompt into a more complex version to
make those famous AI systems (e.g., ChatGPT and GPT4) a bit harder to handle.
But the rewritten prompt must be reasonable and must be understood and
responded by humans.
Your rewriting cannot omit the non-text parts such as the table and code in
#Given Prompt#:. Also, please do not omit the input in #Given Prompt#.
You SHOULD complicate the given prompt using the following method:
Please add one more constraints/requirements into #Given Prompt#
You should try your best not to make the #Rewritten Prompt# become verbose,
#Rewritten Prompt# can only add 10 to 20 words into #Given Prompt#.
‘#Given Prompt#’, ‘#Rewritten Prompt#’, ‘given prompt’ and ‘rewritten prompt’
are not allowed to appear in #Rewritten Prompt#
#Given Prompt#:
<Here is instruction>
#Rewritten Prompt#:
"""
```
Prompting this to an LLM, you automatically get a more complex instruction, called $instruction_1$, from an initial instruction $instruction_0$:
```python
instruction_1 = "Provide three recommendations for maintaining well-being, ensuring one focuses on mental health."
```
With sequences of evolved instructions, we use a further LLM to automatically rank and score them. We provide the 6 instructions at the same time. By providing all instructions together, we force the scoring model to look at minor complexity differences between evolved instructions. Encouraging the model to discriminate between instructions. Taking the example below, $instruction_0$ and $instruction_1$ could deserve the same score independently, but when compared together we would notice the slight difference that makes $instruction_1$ more complex.
In `distilabel`, we implement this like so:
```python
instruction_complexity_scorer = ComplexityScorer(
name="instruction_complexity_scorer",
llm=OpenAILLM(model="gpt-3.5-turbo"),
input_mappings={"instructions": "evolved_instructions"},
pipeline=pipeline,
)
expand_evolved_instructions = ExpandColumns(
name="expand_evolved_instructions",
columns=["evolved_instructions", "answers", "scores"],
output_mappings={
"evolved_instructions": "evolved_instruction",
"answers": "answer",
"scores": "evol_instruction_score",
},
pipeline=pipeline,
)
instruction_complexity_scorer.load()
_evolved_instructions = next(instruction_complexity_scorer.process(([{"evolved_instructions": [PROMPT + instruction_1]}])))
print("Original Instruction:")
print(instruction_1)
print("\nEvolved Instruction:")
print(_evolved_instructions[0]["evolved_instructions"][0].split("#Rewritten Prompt#:\n")[1])
```
Output:
```
Original Instruction:
Provide three recommendations for maintaining well-being, ensuring one focuses on mental health.
Evolved Instruction:
Suggest three strategies for nurturing overall well-being, with the stipulation that at least one explicitly addresses the enhancement of mental health, incorporating evidence-based practices.
```
## EVOL-QUALITY: Quality Evaluation
Now that we have scored the *complexity* of the instructions, we will focus on the *quality* of the responses. Similar to *EVOL COMPLEXITY*, the authors introduced *EVOL QUALITY, a method* based on LLMs, instead of humans, to automatically score the quality of the response.
From an instruction-response pair, $(I, R)$, the goal is to make the response evolve into a more helpful and relevant response. The key difference is that we need to also provide the first instruction to guide evolution. Let's take back our example from [GPT-4-LLM](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM).
Here we have the response $response_0$ and its initial instruction $instruction_0$:
```python
instruction_0 = "Give three tips for staying healthy."
reponse_0 = "1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases. 2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week. 3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night."
```
Again the authors provided several prompts you could use to make your response evolve according to some guidelines. For example, this one was used to enrich the answer:
```python
PROMPT = """I want you to act as a Response Rewriter
Your goal is to enhance the quality of the response given by an AI assistant
to the #Given Prompt# through rewriting.
But the rewritten response must be reasonable and must be understood by humans.
Your rewriting cannot omit the non-text parts such as the table and code in
#Given Prompt# and #Given Response#. Also, please do not omit the input
in #Given Prompt#.
You Should enhance the quality of the response using the following method:
Please make the Response more in-depth
You should try your best not to make the #Rewritten Response# become verbose,
#Rewritten Response# can only add 10 to 20 words into #Given Response#.
‘#Given Response#’, ‘#Rewritten Response#’, ‘given response’ and ‘rewritten response’
are not allowed to appear in #Rewritten Response#
#Given Prompt#:
<instruction_0>
#Given Response#:
<response_0>
#Rewritten Response#:
"""
```
Prompting this to an LLM, you will automatically get a more enriched response, called $response_1$, from an initial response $response_0$ and initial instruction $instruction_0$:
```python
evol_response_quality = EvolQuality(
name="evol_response_quality",
llm=OpenAILLM(model="gpt-3.5-turbo"),
num_evolutions=5,
store_evolutions=True,
include_original_response=True,
input_mappings={
"instruction": "evolved_instruction",
"response": "answer",
},
pipeline=pipeline,
)
evol_response_quality.load()
_evolved_responses = next(evol_response_quality.process([{"instruction": PROMPT + instruction_0, "response": reponse_0}]))
print("Original Response:")
print(reponse_0)
print("\nEvolved Response:")
print(*_evolved_responses[0]['evolved_responses'], sep="\n")
```
And now, as in EVOL COMPLEXITY you iterate through this path and use different prompts to make your responses more relevant, helpful or creative. In the paper, they make 4 more iterations to get 5 evolved responses $(R0, R1, R2, R3, R4)$ which makes 5 different responses for one initial instruction at the end of this step.
```python
response_quality_scorer = QualityScorer(
name="response_quality_scorer",
llm=OpenAILLM(model="gpt-3.5-turbo"),
input_mappings={
"instruction": "evolved_instruction",
"responses": "evolved_responses",
},
pipeline=pipeline,
)
expand_evolved_responses = ExpandColumns(
name="expand_evolved_responses",
columns=["evolved_responses", "scores"],
output_mappings={
"evolved_responses": "evolved_response",
"scores": "evol_response_score",
},
pipeline=pipeline,
)
response_quality_scorer.load()
_scored_responses = next(response_quality_scorer.process([{"instruction": PROMPT + instruction_0, "responses": _evolved_responses[0]['evolved_responses']}]))
print("Original Response:")
print(reponse_0)
print("\nScore, Evolved Response:")
print(*zip(_scored_responses[0]["scores"], _evolved_responses[0]['evolved_responses']), sep="\n")
```
Output:
```bash
Original Response:
1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases. 2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week. 3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.
Score, Evolved Response:
(4.0, 'Here are three essential tips for maintaining good health: \n1. Prioritize regular exercise \n2. Eat a balanced diet with plenty of fruits and vegetables \n3. Get an adequate amount of sleep each night.')
(2.0, 'Here are three effective strategies to maintain a healthy lifestyle.')
(5.0, 'Here are three practical tips to maintain good health: Ensure a balanced diet, engage in regular exercise, and prioritize sufficient sleep. These practices support overall well-being.')
```
## Improving Data Diversity
One main component of good data to instruct-tune LLMs is diversity. Real world data can often contain [redundancy](https://openreview.net/forum?id=u96ZBg_Shna) due repetitive and homogeneous data.
The authors of the DEITA paper tackle the challenge of ensuring data diversity in the instruction tuning LLMs to avoid the pitfalls of data redundancy that can lead to over-fitting or poor generalization. They propose an embedding-based method to filter data for diversity. This method, called Repr Filter, uses embeddings generated by the *Llama 1 13B* model to represent instruction-response pairs in a vector space. The diversity of a new data sample is assessed based on the cosine distance between its embedding and that of its nearest neighbor in the already selected dataset. If this distance is greater than a specified threshold, the sample is considered diverse and is added to the selection. This process prioritizes diversity by assessing each sample's contribution to the variety of the dataset until the data selection budget is met. This approach effectively maintains the diversity of the data used for instruction tuning, as demonstrated by the DEITA models outperforming or matching state-of-the-art models with significantly less training data. In this implementation of DEITA we use the hidden state of the last layer of the Llama 2 model to generate embeddings, instead of a sentence transformer model, because we found that it improved the diversity of the data selection.
![jdjd](../../../assets/tutorials-assets/deita/diversity.png)
```python
generate_conversation = ConversationTemplate(
name="generate_conversation",
input_mappings={
"instruction": "evolved_instruction",
"response": "evolved_response",
},
pipeline=pipeline,
)
generate_embeddings = GenerateEmbeddings(
name="generate_embeddings",
llm=TransformersLLM(
model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
device="cuda",
torch_dtype="float16",
),
input_mappings={"text": "conversation"},
input_batch_size=5,
pipeline=pipeline,
)
deita_filtering = DeitaFiltering(name="deita_filtering", pipeline=pipeline)
```
## Build the ⚗ distilabel `Pipeline`
Now we're ready to build a `distilabel` pipeline using the DEITA method:
```python
load_data.connect(evol_instruction_complexity)
evol_instruction_complexity.connect(instruction_complexity_scorer)
instruction_complexity_scorer.connect(expand_evolved_instructions)
expand_evolved_instructions.connect(evol_response_quality)
evol_response_quality.connect(response_quality_scorer)
response_quality_scorer.connect(expand_evolved_responses)
expand_evolved_responses.connect(generate_conversation)
generate_conversation.connect(generate_embeddings)
generate_embeddings.connect(deita_filtering)
```
Now we can run the pipeline. We use the step names to reference them in the pipeline configuration:
```python
distiset = pipeline.run(
parameters={
"load_data": {
"repo_id": "distilabel-internal-testing/instruction-dataset-50",
"split": "train",
},
"evol_instruction_complexity": {
"llm": {"generation_kwargs": {"max_new_tokens": 512, "temperature": 0.7}}
},
"instruction_complexity_scorer": {
"llm": {"generation_kwargs": {"temperature": 0.0}}
},
"evol_response_quality": {
"llm": {"generation_kwargs": {"max_new_tokens": 512, "temperature": 0.7}}
},
"response_quality_scorer": {"llm": {"generation_kwargs": {"temperature": 0.0}}},
"deita_filtering": {"data_budget": 500, "diversity_threshold": 0.04},
},
use_cache=False,
)
```
We can push the results to the Hugging Face Hub:
```python
distiset.push_to_hub("distilabel-internal-testing/deita-colab")
```
## Results
Again, to show the relevance of EVOL QUALITY method, the authors evaluated on the MT-bench models fine-tuned with different data selections according to how we defined quality responses according to an instruction. Each time they selected 6k data according to the quality score:
![DEITA results](../../../assets/tutorials-assets/deita/results.png)
Credit: Liu et al. (2023)
The score is much better when selecting data with the EVOL QUALITY method than when we select randomly or according to the length, making a more qualitative response if longer. Nevertheless, we see that the margin we may have seen in the complexity score is thinner. And we'll discuss the strategy in a later part. Nevertheless, this strategy looks to improve the fine-tuning compared to the baselines and now we're interested in mixing quality and complexity assessment with a diversity evaluation to find the right trade-off in our selection process.
## Conclusion
In conclusion, if you are looking for some efficient method to align an open-source LLM to your business case with a constrained budget, the solutions provided by DEITA are really worth the shot. This data-centric approach enables one to focus on the content of the dataset to have the best results instead of "just" scaling the instruction-tuning with more, and surely less qualitative, data. In a nutshell, the strategy developed, through automatically scoring instructions-responses, aims to substitute the human preference step proprietary models such as GPT-4 have been trained with. There are a few improvements we could think about when it comes to how to select the good data, but it opens a really great way in instruct-tuning LLM with lower computational needs making the whole process intellectually relevant and more sustainable than most of the other methods. We'd be happy to help you out with aligning an LLM with your business case drawing inspiration from such a methodology.
# Instruction Backtranslation
["Self Alignment with Instruction Backtranslation"](https://arxiv.org/abs/2308.06259) presents a scalable method to build high-quality instruction following a language model by automatically labeling human-written text with corresponding instructions. Their approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high-quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model.
![Instruction Backtranslation pipeline overview](../../../assets/pipelines/instruction_backtranslation.png)
Their self-training approach assumes access to a base language model, a small amount of seed data, and a collection of unlabelled examples, e.g. a web corpus. The unlabelled data is a large, diverse set of human-written documents that includes writing about all manner of topics humans are interested in – but crucially is not paired with instructions.
A first key assumption is that there exists some subset of this very large human-written text that would be suitable as gold generations for some user instructions. A second key assumption is that they can predict instructions for these candidate gold answers that can be used as high-quality example pairs to train an instruction-following model.
Their overall process, called instruction back translation performs two core steps:
1. Self-augment: Generate instructions for unlabelled data, i.e. the web corpus, to produce candidate training data of (instruction, output) pairs for instruction tuning.
2. Self-curate: Self-select high-quality demonstration examples as training data to finetune the base model to follow instructions. This approach is done iteratively where a better intermediate instruction-following model can improve on selecting data for finetuning in the next iteration.
This replication covers the self-curation step i.e. the second/latter step as mentioned above, so as to be able to use the proposed prompting approach to rate the quality of the generated text, which can either be synthetically generated or real human-written text.
### Replication
To replicate the paper we will be using `distilabel` and a smaller dataset created by the Hugging Face H4 team named [`HuggingFaceH4/instruction-dataset`](https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset) for testing purposes.
#### Installation
To replicate Self Alignment with Instruction Backtranslation one will need to install `distilabel` as it follows:
```bash
pip install "distilabel[hf-inference-endpoints,openai]>=1.0.0"
```
And since we will be using [`InferenceEndpointsLLM`][distilabel.models.InferenceEndpointsLLM] (installed via the extra `hf-inference-endpoints`) we will need deploy those in advance either locally or in the Hugging Face Hub (alternatively also the serverless endpoints can be used, but most of the times the inference times are slower, and there's a limited quota to use those as those are free) and set both the `HF_TOKEN` (to use the [`InferenceEndpointsLLM`][distilabel.models.InferenceEndpointsLLM]) and the `OPENAI_API_KEY` environment variable value (to use the [`OpenAILLM`][distilabel.models.OpenAILLM]).
#### Building blocks
- [`LoadDataFromHub`][distilabel.steps.LoadDataFromHub]: Generator Step to load a dataset from the Hugging Face Hub.
- [`TextGeneration`][distilabel.steps.tasks.TextGeneration]: Task to generate responses for a given instruction using an LLM.
- [`InferenceEndpointsLLM`][distilabel.models.InferenceEndpointsLLM]: LLM that runs a model from an Inference Endpoint in the Hugging Face Hub.
- [`InstructionBacktranslation`][distilabel.steps.tasks.InstructionBacktranslation]: Task that generates a score and a reason for a response for a given instruction using the Self Alignment with Instruction Backtranslation prompt.
- [`OpenAILLM`][distilabel.models.OpenAILLM]: LLM that loads a model from OpenAI.
#### Code
As mentioned before, we will put the previously mentioned building blocks together to replicate Self Alignment with Instruction Backtranslation.
```python
from distilabel.models import InferenceEndpointsLLM, OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub, KeepColumns
from distilabel.steps.tasks import InstructionBacktranslation, TextGeneration
with Pipeline(name="self-alignment-with-instruction-backtranslation") as pipeline:
load_hub_dataset = LoadDataFromHub(
name="load_dataset",
output_mappings={"prompt": "instruction"},
)
text_generation = TextGeneration(
name="text_generation",
llm=InferenceEndpointsLLM(
base_url="<INFERENCE_ENDPOINT_URL>",
tokenizer_id="argilla/notus-7b-v1",
model_display_name="argilla/notus-7b-v1",
),
input_batch_size=10,
output_mappings={"model_name": "generation_model"},
)
instruction_backtranslation = InstructionBacktranslation(
name="instruction_backtranslation",
llm=OpenAILLM(model="gpt-4"),
input_batch_size=10,
output_mappings={"model_name": "scoring_model"},
)
keep_columns = KeepColumns(
name="keep_columns",
columns=[
"instruction",
"generation",
"generation_model",
"score",
"reason",
"scoring_model",
],
)
load_hub_dataset >> text_generation >> instruction_backtranslation >> keep_columns
```
Then we need to call `pipeline.run` with the runtime parameters so that the pipeline can be launched.
```python
distiset = pipeline.run(
parameters={
load_hub_dataset.name: {
"repo_id": "HuggingFaceH4/instruction-dataset",
"split": "test",
},
text_generation.name: {
"llm": {
"generation_kwargs": {
"max_new_tokens": 1024,
"temperature": 0.7,
},
},
},
instruction_backtranslation.name: {
"llm": {
"generation_kwargs": {
"max_new_tokens": 1024,
"temperature": 0.7,
},
},
},
},
)
```
Finally, we can optionally push the generated dataset, named [`Distiset`][distilabel.distiset.Distiset], to the Hugging Face Hub via the `push_to_hub` method, so that each subset generated in the leaf steps is pushed to the Hub.
```python
distiset.push_to_hub(
"instruction-backtranslation-instruction-dataset",
private=True,
)
```
---
hide: toc
---
# Create datasets to train a Process Reward Model using Math-Shepherd
This example will introduce [Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations](https://arxiv.org/abs/2312.08935), an innovative math process reward model (PRM) which assigns reward scores to each step of math problem solutions. Specifically, we will present a recipe to create datasets to train such models. The final sections contain 2 pipeline examples to run the pipeline depending with more or less resources.
## Replica
Unlike traditional models that only look at final answers (Output Reward Models or ORM), this system evaluates each step of a mathematical solution and assigns reward scores to individual solution steps. Let's see the Figure 2 from the paper, which makes a summary of the labelling approach presented in their work.
![Math-Shepherd framework](../../../assets/tutorials-assets/math-sheperd.png)
In the traditional ORM approach, the annotation was done depending on the final outcome, while the Process Reward Model (PRM) allows labelling the different steps that lead to a solution, making for a richer set of information.
### Steps involved
- [`MathShepherdGenerator`](https://distilabel.argilla.io/dev/components-gallery/task/mathshepherdgenerator/): This step is in charge of generating solutions for the instruction. Depending on the value set for the `M`, this step can be used to generate both the `golden_solution`, to be used as a reference for the labeller, or the set of `solutions` to be labelled. For the `solutions` column we want some diversity, to allow the model to reach both good and bad solutions, so we have a representative sample for the labeller, so it may be better to use a "weaker" model.
- [`MathShepherdCompleter`](https://distilabel.argilla.io/dev/components-gallery/task/mathshepherdcompleter/). This task does the job of the `completer` in the paper, generating completions as presented in Figure 2, section 3.3.2. It doesn't generate a column on it's own, but updates the steps generated in the `solutions` column from the [`MathShepherdGenerator`](https://distilabel.argilla.io/dev/components-gallery/task/mathshepherdgenerator/), using as reference to label the data, the `golden_solution`. So in order for this step to work, we need both of this columns in our dataset. Depending on the type of dataset, we may already have access to the `golden_solution`, even if it's with a different name, but it's not the same for the `solutions`.
- [`FormatPRM`](https://distilabel.argilla.io/dev/components-gallery/task/formatprm/). This step does the auxiliary job of preparing the data to follow the format defined in the paper of having two columns `input` and `label`. After running the [`MathShepherdCompleter`](https://distilabel.argilla.io/dev/components-gallery/task/mathshepherdcompleter/), we have raw data that can be formatted as the user want. Using [`ExpandColumns`](https://distilabel.argilla.io/latest/components-gallery/steps/expandcolumns/) and this step, one can directly obtain the same format presented in the dataset shared in the paper: [peiyi9979/Math-Shepherd](https://huggingface.co/datasets/peiyi9979/Math-Shepherd?row=0).
## Data preparation
For this example, just as the original paper, we are using the [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k) dataset. We only need a dataset with instructions to be solved (in this case it corresponds to the `question` column), and we can generate everything else using our predefined steps.
## Building the pipeline
The pipeline uses `openai/gsm8k` as reference, but the pipeline can be applied to different datasets, keep in mind the prompts can be modified with the current definition, by tweaking the `extra_rules` and `few_shots` in each task:
```python
from datasets import load_dataset
from distilabel.steps.tasks import MathShepherdCompleter, MathShepherdGenerator, FormatPRM
from distilabel.models import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import CombineOutputs, ExpandColumns
ds_name = "openai/gsm8k"
ds = load_dataset(ds_name, "main", split="test").rename_column("question", "instruction").select(range(3)) # (1)
with Pipeline(name="Math-Shepherd") as pipe:
model_id_70B = "meta-llama/Meta-Llama-3.1-70B-Instruct"
model_id_8B = "meta-llama/Meta-Llama-3.1-8B-Instruct"
llm_70B = InferenceEndpointsLLM(
model_id=model_id_70B,
tokenizer_id=model_id_70B,
generation_kwargs={"max_new_tokens": 1024, "temperature": 0.6},
)
llm_8B = InferenceEndpointsLLM(
model_id=model_id_8B,
tokenizer_id=model_id_8B,
generation_kwargs={"max_new_tokens": 2048, "temperature": 0.6},
) # (2)
generator_golden = MathShepherdGenerator(
name="golden_generator",
llm=llm_70B,
) # (3)
generator = MathShepherdGenerator(
name="generator",
llm=llm_8B,
use_default_structured_output=True, # (9)
M=5
) # (4)
completer = MathShepherdCompleter(
name="completer",
llm=llm_8B,
use_default_structured_output=True,
N=4
) # (5)
combine = CombineOutputs()
expand = ExpandColumns(
name="expand_columns",
columns=["solutions"],
split_statistics=True,
) # (6)
formatter = FormatPRM(name="format_prm") # (7)
[generator_golden, generator] >> combine >> completer >> expand >> formatter # (8)
```
1. Will use just 3 rows from the sample dataset, and rename the "question" to "instruction", to set the expected value for the [`MathShepherdGenerator`](https://distilabel.argilla.io/dev/components-gallery/task/mathshepherdgenerator/).
2. We will use 2 different LLMs, `meta-llama/Meta-Llama-3.1-70B-Instruct` (a stronger model for the `golden_solution`) and `meta-llama/Meta-Llama-3.1-8B-Instruct` (a weaker one to generate candidate solutions, and the completions).
3. This [`MathShepherdGenerator`](https://distilabel.argilla.io/dev/components-gallery/task/mathshepherdgenerator/) task, that uses the *stronger* model, will generate the `golden_solution` for us, the step by step solution for the task.
4. Another [`MathShepherdGenerator`](https://distilabel.argilla.io/dev/components-gallery/task/mathshepherdgenerator/) task, but in this case using the *weaker* model will generate candidate `solutions` (`M=5` in total).
5. Now the [`MathShepherdCompleter`](https://distilabel.argilla.io/dev/components-gallery/task/mathshepherdcompleter/) task will generate `n=4` *completions* for each step of each candidate solution in the `solutions` column, and label them using the `golden_solution` as shown in Figure 2 in the paper. This step will add the label (it uses [+ and -] tags following the implementation in the paper, but these values can be modified) to the `solutions` column in place, instead of generating an additional column, but the intermediate completions won't be shown at the end.
6. The [`ExpandColumns`](https://distilabel.argilla.io/latest/components-gallery/steps/expandcolumns/) step expands the solution to match the instruction, so if we had set M=5, we would now have 5x instruction-pair solutions. We set the `split_statistics` to True to ensure the `distilabel_metadata` is split accordingly, othwerwise the number of tokens for each solution would count as the tokens needed for the whole list of solutions generated. One can omit both this and the following step and process the data for training as preferred.
7. And finally, the [`FormatPRM`](https://distilabel.argilla.io/dev/components-gallery/task/formatprm/) generates two columns: `input` and `label` which prepare the data for training as presented in the original Math-Shepherd dataset.
8. Both the `generator_golden` and `generator` can be run in parallel as there's no dependency between them, and after that we combine the results and pass them to the `completer`. Finally, we use the `expand` and `formatter` prepare the data in the expected format to train the Process Reward Model as defined in the original paper.
9. Generate structured outputs to ensure it's easier to parse them, otherwise the models can fail a lot of times with an easy to parse list.
## Script and final dataset
To see all the pieces in place, take a look at the full pipeline:
??? Run
```python
python examples/pipe_math_shepherd.py
```
??? "Full pipeline"
```python title="pipe_math_shepherd.py"
--8<-- "examples/pipe_math_shepherd.py"
```
The resulting dataset can be seen at: [plaguss/test_math_shepherd_prm](https://huggingface.co/datasets/plaguss/test_math_shepherd_prm).
### Pipeline with vLLM and ray
This section contains an alternative way of running the pipeline with a bigger outcome. To showcase how to scale the pipeline, we are using for the 3 generating tasks [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct), highly improving the final quality as it follows much closer the prompt given. Also, we are using `vLLM` and 3 nodes (one per task in this case), to scale up the generation process.
??? Tip "Math-Shepherd's bigger pipeline"
````python
from datasets import load_dataset
from distilabel.models import vLLM
from distilabel.steps import StepResources
from distilabel.pipeline import Pipeline
from distilabel.steps import CombineOutputs, ExpandColumns
from distilabel.steps.tasks import (
FormatPRM,
MathShepherdCompleter,
MathShepherdGenerator,
)
ds_name = "openai/gsm8k"
ds = (
load_dataset(ds_name, "main", split="test")
.rename_column("question", "instruction")
)
with Pipeline(name="Math-Shepherd").ray() as pipe: # (1)
model_id_72B = "Qwen/Qwen2.5-72B-Instruct"
llm_72B = vLLM(
model=model_id_72B,
tokenizer=model_id_72B,
extra_kwargs={
"tensor_parallel_size": 8, # Number of GPUs per node
"max_model_len": 2048,
},
generation_kwargs={
"temperature": 0.5,
"max_new_tokens": 4096,
},
)
generator_golden = MathShepherdGenerator(
name="golden_generator",
llm=llm_72B,
input_batch_size=50,
output_mappings={"model_name": "model_name_golden_generator"},
resources=StepResources(replicas=1, gpus=8) # (2)
)
generator = MathShepherdGenerator(
name="generator",
llm=llm_72B,
input_batch_size=50,
M=5,
use_default_structured_output=True,
output_mappings={"model_name": "model_name_generator"},
resources=StepResources(replicas=1, gpus=8)
)
completer = MathShepherdCompleter(
name="completer",
llm=llm_72B,
N=8,
use_default_structured_output=True,
output_mappings={"model_name": "model_name_completer"},
resources=StepResources(replicas=1, gpus=8)
)
combine = CombineOutputs()
expand = ExpandColumns(
name="expand_columns",
columns=["solutions"],
split_statistics=True,
)
formatter = FormatPRM(name="format_prm", format="trl") # (3)
[generator_golden, generator] >> combine >> completer >> expand >> formatter
if __name__ == "__main__":
distiset = pipe.run(use_cache=False, dataset=ds, dataset_batch_size=50)
if distiset:
distiset.push_to_hub("plaguss/test_math_shepherd_prm_ray")
````
1. Transform the pipeline to run using `ray` backend.
2. Assign the resources: number of replicas 1 as we want a single instance of the task in a node, and number of GPUs equals to 8, using a whole node. Given that we defined the script in the slurm file to use 3 nodes, this will use all the 3 available nodes, with 8 GPUs for each of these tasks.
3. Prepare the columns in the format expected by `TRL` for training.
Click to see the slurm file used to run the previous pipeline. It's our go to `slurm` file, using 3 8xH100 nodes.
??? Tip "Slurm file"
```bash
#!/bin/bash
#SBATCH --job-name=math-shepherd-test-ray
#SBATCH --partition=hopper-prod
#SBATCH --qos=normal
#SBATCH --nodes=3
#SBATCH --exclusive
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --output=./logs/%x-%j.out
#SBATCH --err=./logs/%x-%j.err
#SBATCH --time=48:00:00
set -ex
module load cuda/12.1
echo "SLURM_JOB_ID: $SLURM_JOB_ID"
echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"
source .venv/bin/activate
# Getting the node names
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
# Get the IP address of the head node
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
# Start Ray head node
port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"
# Generate a unique Ray tmp dir for the head node
head_tmp_dir="/tmp/ray_tmp_${SLURM_JOB_ID}_head"
echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" \
ray start --head --node-ip-address="$head_node_ip" --port=$port \
--dashboard-host=0.0.0.0 \
--dashboard-port=8265 \
--temp-dir="$head_tmp_dir" \
--block &
# Give some time to head node to start...
sleep 10
# Start Ray worker nodes
worker_num=$((SLURM_JOB_NUM_NODES - 1))
# Start from 1 (0 is head node)
for ((i = 1; i <= worker_num; i++)); do
node_i=${nodes_array[$i]}
worker_tmp_dir="/tmp/ray_tmp_${SLURM_JOB_ID}_worker_$i"
echo "Starting WORKER $i at $node_i"
srun --nodes=1 --ntasks=1 -w "$node_i" \
ray start --address "$ip_head" \
--temp-dir="$worker_tmp_dir" \
--block &
sleep 5
done
# Give some time to the Ray cluster to gather info
sleep 60
# Finally submit the job to the cluster
RAY_ADDRESS="http://$head_node_ip:8265" ray job submit --working-dir pipeline -- python -u pipeline_math_shepherd_ray.py
```
??? Tip "Final dataset"
The resulting dataset can be seen at: [plaguss/test_math_shepherd_prm_ray](https://huggingface.co/datasets/plaguss/test_math_shepherd_prm_ray).
# Prometheus 2
["Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models"](https://arxiv.org/pdf/2405.01535) presents Prometheus 2, a new and more powerful evaluator LLM compared to Prometheus (its predecessor) presented in ["Prometheus: Inducing Fine-grained Evaluation Capability in Language Models"](https://arxiv.org/abs/2310.08491); since GPT-4, as well as other proprietary LLMs, are commonly used to assess the quality of the responses for various LLMs, but there are concerns about transparency, controllability, and affordability, that motivate the need of open-source LLMs specialized in evaluations.
![Prometheus 2 pipeline overview](../../../assets/pipelines/prometheus.png)
Existing open evaluator LMs exhibit critical shortcomings:
1. They issue scores that significantly diverge from those assigned by humans.
2. They lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment.
Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. Prometheus 2 is capable of processing both direct assessment and pair-wise ranking formats grouped with user-defined evaluation criteria.
Prometheus 2 released two variants:
- [`prometheus-eval/prometheus-7b-v2.0`](https://hf.co/prometheus-eval/prometheus-7b-v2.0): fine-tuned on top of [`mistralai/Mistral-7B-Instruct-v0.2`](https://hf.co/mistralai/Mistral-7B-Instruct-v0.2)
- [`prometheus-eval/prometheus-8x7b-v2.0`](https://hf.co/prometheus-eval/prometheus-8x7b-v2.0): fine-tuned on top of [`mistralai/Mixtral-8x7B-Instruct-v0.1`](https://hf.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
Both models have been fine-tuned for both direct assessment and pairwise ranking tasks i.e. assessing the quality of a single isolated response for a given instruction with or without a reference answer and assessing the quality of one response against another one for a given instruction with or without a reference answer, respectively.
On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Their models, code, and data are all publicly available at [`prometheus-eval/prometheus-eval`](https://github.com/prometheus-eval/prometheus-eval).
### Replication
!!! NOTE
The section is named `Replication` but in this case we're not replicating the Prometheus 2 paper per se, but rather showing how to use the [`PrometheusEval`][distilabel.steps.tasks.PrometheusEval] task implemented within `distilabel` to evaluate the quality of the responses from a given instruction using the Prometheus 2 model.
To showcase Prometheus 2 we will be using the [`PrometheusEval`][distilabel.steps.tasks.PrometheusEval] task implemented in `distilabel` and a smaller dataset created by the Hugging Face H4 team named [`HuggingFaceH4/instruction-dataset`](https://hf.co/datasets/HuggingFaceH4/instruction-dataset) for testing purposes.
#### Installation
To reproduce the code below, one will need to install `distilabel` as it follows:
```bash
pip install "distilabel[vllm]>=1.1.0"
```
Alternatively, it's recommended to install [`Dao-AILab/flash-attention`](https://github.com/Dao-AILab/flash-attention) to benefit from Flash Attention 2 speed ups during inference via `vllm`.
```bash
pip install flash-attn --no-build-isolation
```
!!! NOTE
The installation notes above assume that you are using a VM with one GPU accelerator with at least the required VRAM to fit [`prometheus-eval/prometheus-7b-v2.0`](https://hf.co/prometheus-eval/prometheus-7b-v2.0) in bfloat16 (28GB); but if you have enough VRAM to fit their 8x7B model in bfloat16 (~90GB) you can use [`prometheus-eval/prometheus-8x7b-v2.0`](https://hf.co/prometheus-eval/prometheus-8x7b-v2.0) instead.
#### Building blocks
- [`LoadDataFromHub`][distilabel.steps.LoadDataFromHub]: [`GeneratorStep`][distilabel.steps.GeneratorStep] to load a dataset from the Hugging Face Hub.
- [`PrometheusEval`][distilabel.steps.tasks.PrometheusEval]: [`Task`][distilabel.steps.tasks.Task] that assesses the quality of a response for a given instruction using any of the Prometheus 2 models.
- [`vLLM`][distilabel.models.vLLM]: [`LLM`][distilabel.models.LLM] that loads a model from the Hugging Face Hub via [vllm-project/vllm](https://github.com/vllm-project/vllm).
!!! NOTE
Since the Prometheus 2 models use a slightly different chat template than [`mistralai/Mistral-7B-Instruct-v0.2`](https://hf.co/mistralai/Mistral-7B-Instruct-v0.2), we need to set the `chat_template` parameter to `[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]` so as to properly format the input for Prometheus 2.
- (Optional) [`KeepColumns`][distilabel.steps.KeepColumns]: [`Task`][distilabel.steps.tasks.Task] that keeps only the specified columns in the dataset, used to remove the undesired columns.
#### Code
As mentioned before, we will put the previously mentioned building blocks together to see how Prometheus 2 can be used via `distilabel`.
```python
from distilabel.models import vLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns, LoadDataFromHub
from distilabel.steps.tasks import PrometheusEval
if __name__ == "__main__":
with Pipeline(name="prometheus") as pipeline:
load_dataset = LoadDataFromHub(
name="load_dataset",
repo_id="HuggingFaceH4/instruction-dataset",
split="test",
output_mappings={"prompt": "instruction", "completion": "generation"},
)
task = PrometheusEval(
name="task",
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]",
),
mode="absolute",
rubric="factual-validity",
reference=False,
num_generations=1,
group_generations=False,
)
keep_columns = KeepColumns(
name="keep_columns",
columns=["instruction", "generation", "feedback", "result", "model_name"],
)
load_dataset >> task >> keep_columns
```
Then we need to call `pipeline.run` with the runtime parameters so that the pipeline can be launched.
```python
distiset = pipeline.run(
parameters={
task.name: {
"llm": {
"generation_kwargs": {
"max_new_tokens": 1024,
"temperature": 0.7,
},
},
},
},
)
```
Finally, we can optionally push the generated dataset, named [`Distiset`][distilabel.distiset.Distiset], to the Hugging Face Hub via the `push_to_hub` method, so that each subset generated in the leaf steps is pushed to the Hub.
```python
distiset.push_to_hub(
"instruction-dataset-prometheus",
private=True,
)
```
# UltraFeedback
[UltraFeedback: Boosting Language Models with High-quality Feedback](https://arxiv.org/abs/2310.01377) is a paper published by [OpenBMB](https://www.openbmb.cn/home) which proposes `UltraFeedback`, a large-scale, fine-grained, diverse preference dataset, used for training powerful reward models and critic models.
UltraFeedback collects about 64k prompts from diverse resources (including UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, and FLAN), then they use these prompts to query multiple LLMs (commercial models, Llama models ranging 7B to 70B, and non-Llama models) and generate four different responses for each prompt, resulting in a total of 256k samples i.e. the UltraFeedback will rate four responses on every OpenAI request.
![UltraFeedback pipeline overview](../../../assets/pipelines/ultrafeedback.png)
To collect high-quality preference and textual feedback, they design a fine-grained annotation instruction, which contains four different aspects, namely instruction-following, truthfulness, honesty and helpfulness (even though within the paper they also mention a fifth one named verbalized calibration). Finally, GPT-4 is used to generate the ratings for the generated responses to the given prompt using the previously mentioned aspects.
## Replication
To replicate the paper we will be using `distilabel` and a smaller dataset created by the Hugging Face H4 team named [`HuggingFaceH4/instruction-dataset`](https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset) for testing purposes.
Also for testing purposes we will just show how to evaluate the generated responses for a given prompt using a new global aspect named `overall-rating` defined by Argilla, that computes the average of the four aspects, so as to reduce number of requests to be sent to OpenAI, but note that all the aspects are implemented within `distilabel` and can be used instead for a more faithful reproduction. Besides that we will generate three responses for each instruction using three LLMs selected from a pool of six: [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), [`argilla/notus-7b-v1`](https://huggingface.co/argilla/notus-7b-v1), [`google/gemma-1.1-7b-it`](https://huggingface.co/google/gemma-1.1-7b-it), [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), [`HuggingFaceH4/zephyr-7b-gemma-v0.1`](https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma-v0.1) and [`mlabonne/UltraMerge-7B`](https://huggingface.co/mlabonne/UltraMerge-7B).
### Installation
To replicate UltraFeedback one will need to install `distilabel` as it follows:
```bash
pip install "distilabel[argilla,openai,vllm]>=1.0.0"
```
And since we will be using `vllm` we will need to use a VM with at least 6 NVIDIA GPUs with at least 16GB of memory each to run the text generation, and set the `OPENAI_API_KEY` environment variable value.
### Building blocks
- [`LoadDataFromHub`][distilabel.steps.LoadDataFromHub]: Generator Step to load a dataset from the Hugging Face Hub.
- [`sample_n_steps`][distilabel.pipeline.sample_n_steps]: Function to create a `routing_batch_function` that samples `n` downstream steps for each batch generated by the upstream step. This is the key to replicate the LLM pooling mechanism described in the paper.
- [`TextGeneration`][distilabel.steps.tasks.TextGeneration]: Task to generate responses for a given instruction using an LLM.
- [`vLLM`][distilabel.models.vLLM]: LLM that loads a model from the Hugging Face Hub using `vllm`.
- [`GroupColumns`][distilabel.steps.GroupColumns]: Task that combines multiple columns into a single one i.e. from string to list of strings. Useful when there are multiple parallel steps that are connected to the same node.
- [`UltraFeedback`][distilabel.steps.tasks.UltraFeedback]: Task that generates ratings for the responses of a given instruction using the UltraFeedback prompt.
- [`OpenAILLM`][distilabel.models.OpenAILLM]: LLM that loads a model from OpenAI.
- [`KeepColumns`][distilabel.steps.KeepColumns]: Task to keep the desired columns while removing the not needed ones, as well as defining the order for those.
- (optional) [`PreferenceToArgilla`][distilabel.steps.PreferenceToArgilla]: Task to optionally push the generated dataset to Argilla to do some further analysis and human annotation.
### Code
As mentioned before, we will put the previously mentioned building blocks together to replicate UltraFeedback.
```python
from distilabel.models import OpenAILLM, vLLM
from distilabel.pipeline import Pipeline, sample_n_steps
from distilabel.steps import (
GroupColumns,
KeepColumns,
LoadDataFromHub,
PreferenceToArgilla,
)
from distilabel.steps.tasks import TextGeneration, UltraFeedback
sample_three_llms = sample_n_steps(n=3)
with Pipeline(name="ultrafeedback-pipeline") as pipeline:
load_hub_dataset = LoadDataFromHub(
name="load_dataset",
output_mappings={"prompt": "instruction"},
batch_size=2,
)
text_generation_with_notus = TextGeneration(
name="text_generation_with_notus",
llm=vLLM(model="argilla/notus-7b-v1"),
input_batch_size=2,
output_mappings={"model_name": "generation_model"},
)
text_generation_with_zephyr = TextGeneration(
name="text_generation_with_zephyr",
llm=vLLM(model="HuggingFaceH4/zephyr-7b-gemma-v0.1"),
input_batch_size=2,
output_mappings={"model_name": "generation_model"},
)
text_generation_with_gemma = TextGeneration(
name="text_generation_with_gemma",
llm=vLLM(model="google/gemma-1.1-7b-it"),
input_batch_size=2,
output_mappings={"model_name": "generation_model"},
)
text_generation_with_zephyr_gemma = TextGeneration(
name="text_generation_with_zephyr_gemma",
llm=vLLM(model="HuggingFaceH4/zephyr-7b-gemma-v0.1"),
input_batch_size=2,
output_mappings={"model_name": "generation_model"},
)
text_generation_with_llama = TextGeneration(
name="text_generation_with_llama",
llm=vLLM(model="meta-llama/Meta-Llama-3-8B-Instruct"),
input_batch_size=2,
output_mappings={"model_name": "generation_model"},
)
text_generation_with_ultramerge = TextGeneration(
name="text_generation_with_ultramerge",
llm=vLLM(model="mlabonne/UltraMerge-7B"),
input_batch_size=2,
output_mappings={"model_name": "generation_model"},
)
combine_columns = GroupColumns(
name="combine_columns",
columns=["generation", "generation_model"],
output_columns=["generations", "generation_models"],
input_batch_size=2
)
ultrafeedback = UltraFeedback(
name="ultrafeedback_openai",
llm=OpenAILLM(model="gpt-4-turbo-2024-04-09"),
aspect="overall-rating",
output_mappings={"model_name": "ultrafeedback_model"},
)
keep_columns = KeepColumns(
name="keep_columns",
columns=[
"instruction",
"generations",
"generation_models",
"ratings",
"rationales",
"ultrafeedback_model",
],
)
(
load_hub_dataset
>> sample_three_llms
>> [
text_generation_with_notus,
text_generation_with_zephyr,
text_generation_with_gemma,
text_generation_with_llama,
text_generation_with_zephyr_gemma,
text_generation_with_ultramerge
]
>> combine_columns
>> ultrafeedback
>> keep_columns
)
# Optional: Push the generated dataset to Argilla, but will need to `pip install argilla` first
# push_to_argilla = PreferenceToArgilla(
# name="push_to_argilla",
# api_url="<ARGILLA_API_URL>",
# api_key="<ARGILLA_API_KEY>", # type: ignore
# dataset_name="ultrafeedback",
# dataset_workspace="admin",
# num_generations=2,
# )
# keep_columns >> push_to_argilla
```
!!! NOTE
As we're using a relative small dataset, we're setting a low `batch_size` and `input_batch_size` so we have more batches for the `routing_batch_function` i.e. we will have more variety on the LLMs used to generate the responses. When using a large dataset, it's recommended to use a larger `batch_size` and `input_batch_size` to benefit from the `vLLM` optimizations for larger batch sizes, which makes the pipeline execution faster.
Then we need to call `pipeline.run` with the runtime parameters so that the pipeline can be launched.
```python
distiset = pipeline.run(
parameters={
load_hub_dataset.name: {
"repo_id": "HuggingFaceH4/instruction-dataset",
"split": "test",
},
text_generation_with_notus.name: {
"llm": {
"generation_kwargs": {
"max_new_tokens": 512,
"temperature": 0.7,
}
},
},
text_generation_with_zephyr.name: {
"llm": {
"generation_kwargs": {
"max_new_tokens": 512,
"temperature": 0.7,
}
},
},
text_generation_with_gemma.name: {
"llm": {
"generation_kwargs": {
"max_new_tokens": 512,
"temperature": 0.7,
}
},
},
text_generation_with_llama.name: {
"llm": {
"generation_kwargs": {
"max_new_tokens": 512,
"temperature": 0.7,
}
},
},
text_generation_with_zephyr_gemma.name: {
"llm": {
"generation_kwargs": {
"max_new_tokens": 512,
"temperature": 0.7,
}
},
},
text_generation_with_ultramerge.name: {
"llm": {
"generation_kwargs": {
"max_new_tokens": 512,
"temperature": 0.7,
}
},
},
ultrafeedback.name: {
"llm": {
"generation_kwargs": {
"max_new_tokens": 2048,
"temperature": 0.7,
}
},
},
}
)
```
Finally, we can optionally push the generated dataset, named [`Distiset`][distilabel.distiset.Distiset], to the Hugging Face Hub via the `push_to_hub` method, so that each subset generated in the leaf steps is pushed to the Hub.
```python
distiset.push_to_hub(
"ultrafeedback-instruction-dataset",
private=True,
)
```
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Synthetic data generation for fine-tuning custom retrieval and reranking models\n",
"\n",
"- **Goal**: Bootstrap, optimize and maintain your embedding models and rerankers through synthetic data generation and human feedback.\n",
"- **Libraries**: [argilla](https://github.com/argilla-io/argilla), [hf-inference-endpoints](https://github.com/huggingface/huggingface_hub), [sentence-transformers](https://github.com/UKPLab/sentence-transformers)\n",
"- **Components**: [LoadDataFromHub](https://distilabel.argilla.io/latest/components-gallery/steps/loaddatafromhub/), [GenerateSentencePair](https://distilabel.argilla.io/latest/components-gallery/tasks/generatesentencepair/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/)\n",
"\n",
"![GenerateSentencePair pipeline overview](../../../assets/pipelines/sentence-transformer.png)\n",
"\n",
"!!! note\n",
" For a comprehensive overview on optimizing the retrieval performance in a RAG pipeline, check this [guide](https://docs.zenml.io/user-guide/llmops-guide/finetuning-embeddings) in collaboration with [ZenML](https://github.com/zenml-io/zenml), an open-source MLOps framework designed for building portable and production-ready machine learning pipelines."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting started\n",
"\n",
"### Install the dependencies\n",
"\n",
"To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using **the free but rate-limited Hugging Face serverless Inference API** for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install \"distilabel[hf-inference-endpoints]\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install \"sentence-transformers~=3.0\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's make the needed imports:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from distilabel.models import InferenceEndpointsLLM\n",
"from distilabel.pipeline import Pipeline\n",
"from distilabel.steps.tasks import GenerateSentencePair\n",
"from distilabel.steps import LoadDataFromHub\n",
"\n",
"from sentence_transformers import SentenceTransformer, CrossEncoder\n",
"import torch"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You'll need an `HF_TOKEN` to use the HF Inference Endpoints. Login to use it directly within this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from huggingface_hub import login\n",
"\n",
"login(token=os.getenv(\"HF_TOKEN\"), add_to_git_credential=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### (optional) Deploy Argilla\n",
"\n",
"You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following [this guide](https://docs.argilla.io/latest/getting_started/quickstart/). \n",
"\n",
"Along with that, you will need to install Argilla as a distilabel extra."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install \"distilabel[argilla, hf-inference-endpoints]\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's make the extra needed imports:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"import argilla as rg"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The dataset\n",
"\n",
"Before starting any project, it is always important to look at your data. Our data is publicly available [on the Hugging Face Hub](https://huggingface.co/datasets/plaguss/argilla_sdk_docs_raw_unstructured?row=0) so we can have a quick look through [their dataset viewer within an embedded iFrame](https://huggingface.co/docs/hub/datasets-viewer-embed). \n",
"\n",
"<iframe src=\"https://huggingface.co/datasets/plaguss/argilla_sdk_docs_raw_unstructured/embed/viewer\" frameborder=\"0\" width=\"100%\" height=\"560px\"></iframe>\n",
"\n",
"As we can see, our dataset contains a column called `chunks`, which was obtained from the Argilla docs. Normally, you would need to download and chunk the data but we will not cover that in this tutorial. To read a full explanation for how this dataset was generated, please refer to [How we leveraged distilabel to create an Argilla 2.0 Chatbot](https://huggingface.co/blog/argilla-chatbot#downloading-and-chunking-data).\n",
"\n",
"Alternatively, we can load the entire dataset to disk with `datasets.load_dataset`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Synthetic data generation\n",
"\n",
"The [`GenerateSentencePair`](https://distilabel.argilla.io/latest/components-gallery/tasks/generatesentencepair/) component from `distilabel` can be used to generate training datasets for embeddings models. \n",
"\n",
"It is a pre-defined `Task` that given an `anchor` sentence generate data for a specific `action`. Supported actions are: `\"paraphrase\", \"semantically-similar\", \"query\", \"answer\"`. In our case the `chunks` column corresponds to the `anchor`. This means we will use `query` to generate potential queries for a fine-tuning a retrieval model and that we will use `semantically-similar` to generate texts that are similar to the intial anchor for fine-tuning a reranking model.\n",
"\n",
"We will `triplet=True` in order to generate both positive and negative examples, which should help the model generalize better during fine-tuning and we will set `hard_negative=True` to generate more challenging examples that are closer to the anchor and discussed topics.\n",
"\n",
"Lastly, we can seed the LLM with `context` to generate more relevant examples."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"context = (\n",
"\"\"\"\n",
"The text is a chunk from technical Python SDK documentation of Argilla.\n",
"Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets.\n",
"Along with prose explanations, the text chunk may include code snippets and Python references.\n",
"\"\"\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Retrieval\n",
"\n",
"For retrieval, we will thus generate queries that are similar to the `chunks` column. We will use the `query` action to generate potential queries for a fine-tuning a retrieval model.\n",
"\n",
"```python\n",
"generate_sentence_pair = GenerateSentencePair(\n",
" triplet=True, \n",
" hard_negative=True,\n",
" action=\"query\",\n",
" llm=llm,\n",
" input_batch_size=10,\n",
" context=context,\n",
")\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reranking\n",
"\n",
"For reranking, we will generate texts that are similar to the intial anchor. We will use the `semantically-similar` action to generate texts that are similar to the intial anchor for fine-tuning a reranking model. In this case, we set `hard_negative=False` to generate more diverse and potentially wrong examples, which can be used as negative examples for similarity fine-tuning because [rerankers cannot be fine-tuned using triplets](https://github.com/UKPLab/sentence-transformers/issues/2366).\n",
"\n",
"```python\n",
"generate_sentence_pair = GenerateSentencePair(\n",
" triplet=True,\n",
" hard_negative=False,\n",
" action=\"semantically-similar\",\n",
" llm=llm,\n",
" input_batch_size=10,\n",
" context=context,\n",
")\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Combined pipeline\n",
"\n",
"We will now use the `GenerateSentencePair` task to generate synthetic data for both retrieval and reranking models in a single pipeline. Note that, we map the `chunks` column to the `anchor` argument."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"llm = InferenceEndpointsLLM(\n",
" model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n",
" tokenizer_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n",
")\n",
"\n",
"with Pipeline(name=\"generate\") as pipeline:\n",
" load_dataset = LoadDataFromHub(\n",
" num_examples=15,\n",
" output_mappings={\"chunks\": \"anchor\"},\n",
" )\n",
" generate_retrieval_pairs = GenerateSentencePair(\n",
" name=\"generate_retrieval_pairs\",\n",
" triplet=True,\n",
" hard_negative=True,\n",
" action=\"query\",\n",
" llm=llm,\n",
" input_batch_size=10,\n",
" context=context,\n",
" )\n",
" generate_reranking_pairs = GenerateSentencePair(\n",
" name=\"generate_reranking_pairs\",\n",
" triplet=True,\n",
" hard_negative=False, # to potentially generate non-relevant pairs\n",
" action=\"semantically-similar\",\n",
" llm=llm,\n",
" input_batch_size=10,\n",
" context=context,\n",
" )\n",
"\n",
" load_dataset.connect(generate_retrieval_pairs, generate_reranking_pairs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we can execute this using `pipeline.run`. We will provide some `parameters` to specific components within our pipeline."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"generation_kwargs = {\n",
" \"llm\": {\n",
" \"generation_kwargs\": {\n",
" \"temperature\": 0.7,\n",
" \"max_new_tokens\": 512,\n",
" }\n",
" }\n",
"}\n",
"\n",
"distiset = pipeline.run( \n",
" parameters={\n",
" load_dataset.name: {\n",
" \"repo_id\": \"plaguss/argilla_sdk_docs_raw_unstructured\",\n",
" \"split\": \"train\",\n",
" },\n",
" generate_retrieval_pairs.name: generation_kwargs,\n",
" generate_reranking_pairs.name: generation_kwargs,\n",
" },\n",
" use_cache=False, # False for demo\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data generation can be a expensive, so it is recommended to store the data somewhere. For now, we will store it on the Hugging Face Hub, using our `push_to_hub` method."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"distiset.push_to_hub(\"[your-owner-name]/example-retrieval-reranking-dataset\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have got 2 different leaf/end nodes, therefore we've got a distil configurations we can access, one for the retrieval data, and one for the reranking data.\n",
"\n",
"<iframe\n",
" src=\"https://huggingface.co/datasets/distilabel-internal-testing/example-retrieval-reranking-dataset/embed/viewer/generate_reranking_pairs/train\"\n",
" frameborder=\"0\"\n",
" width=\"100%\"\n",
" height=\"560px\"\n",
"></iframe>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at these initial examples, we can see they nicely capture the essence of the `chunks` column but we will need to evaluate the quality of the data a bit more before we can use it for fine-tuning."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data quality evaluation \n",
"\n",
"Data is never as clean as it can be and this also holds for synthetically generated data too, therefore, it is always good to spent some time and look at your data.\n",
"\n",
"### Feature engineering\n",
"\n",
"In order to evaluate the quality of our data we will use features of the models that we intent to fine-tune as proxy for data quality. We can then use these features to filter out the best examples.\n",
"\n",
"In order to choose a good default model, we will use the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). We want to optimize for size and speed, so we will set model size `<100M` and then filter for `Retrieval` and `Reranking` based on the highest average score, resulting in [Snowflake/snowflake-arctic-embed-s](https://huggingface.co/Snowflake/snowflake-arctic-embed-s) and [sentence-transformers/all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) respectively.\n",
"\n",
"<iframe\n",
"\tsrc=\"https://mteb-leaderboard.hf.space\"\n",
"\tframeborder=\"0\"\n",
"\twidth=\"100%\"\n",
"\theight=\"600\"\n",
"></iframe>\n",
"\n",
"#### Retrieval\n",
"\n",
"For retrieval, we will compute similarities for the current embeddings of `anchor-positive`, `positive-negative` and `anchor-negative` pairs. We assume that an overlap of these similarities will cause the model to have difficulties generalizing and therefore we can use these features to evaluate the quality of our data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model_id = \"Snowflake/snowflake-arctic-embed-m\" # Hugging Face model ID\n",
"\n",
"model_retrieval = SentenceTransformer(\n",
" model_id, device=\"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will encode the generated text pairs and compute the similarities. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics.pairwise import cosine_similarity\n",
"\n",
"def get_embeddings(texts):\n",
" vectors = model_retrieval.encode(texts)\n",
" return [vector.tolist() for vector in vectors]\n",
"\n",
"\n",
"def get_similarities(vector_batch_a, vector_batch_b):\n",
" similarities = []\n",
" for vector_a, vector_b in zip(vector_batch_a, vector_batch_b):\n",
" similarity = cosine_similarity([vector_a], [vector_b])[0][0]\n",
" similarities.append(similarity)\n",
" return similarities\n",
"\n",
"def format_data_retriever(batch):# -> Any:\n",
" batch[\"anchor-vector\"] = get_embeddings(batch[\"anchor\"])\n",
" batch[\"positive-vector\"] = get_embeddings(batch[\"positive\"])\n",
" batch[\"negative-vector\"] = get_embeddings(batch[\"negative\"]) \n",
" batch[\"similarity-positive-negative\"] = get_similarities(batch[\"positive-vector\"], batch[\"negative-vector\"])\n",
" batch[\"similarity-anchor-positive\"] = get_similarities(batch[\"anchor-vector\"], batch[\"positive-vector\"])\n",
" batch[\"similarity-anchor-negative\"] = get_similarities(batch[\"anchor-vector\"], batch[\"negative-vector\"])\n",
" return batch\n",
"\n",
"dataset_generate_retrieval_pairs = distiset[\"generate_retrieval_pairs\"][\"train\"].map(format_data_retriever, batched=True, batch_size=250)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### Reranking\n",
"\n",
"For reranking, we will compute the compute the relevance scores from an existing reranker model for `anchor-positive`, `positive-negative` and `anchor-negative` pais and make a similar assumption as for the retrieval model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model_id = \"sentence-transformers/all-MiniLM-L12-v2\"\n",
"\n",
"model = CrossEncoder(model_id)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will compute the similarity for the generated text pairs using the reranker. On top of that, we will compute an `anchor-vector` to allow for doing semantic search."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def format_data_retriever(batch):# -> Any:\n",
" batch[\"anchor-vector\"] = get_embeddings(batch[\"anchor\"])\n",
" batch[\"similarity-positive-negative\"] = model.predict(zip(batch[\"positive-vector\"], batch[\"negative-vector\"]))\n",
" batch[\"similarity-anchor-positive\"] = model.predict(zip(batch[\"anchor-vector\"], batch[\"positive-vector\"]))\n",
" batch[\"similarity-anchor-negative\"] = model.predict(zip(batch[\"anchor-vector\"], batch[\"negative-vector\"]))\n",
" return batch\n",
"\n",
"dataset_generate_reranking_pairs = distiset[\"generate_reranking_pairs\"][\"train\"].map(format_data_retriever, batched=True, batch_size=250)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And voila, we have our proxies for quality evaluation which we can use to filter out the best and worst examples."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### (Optional) Argilla\n",
"\n",
"To get the most out of you data and actually look at our data, we will use Argilla. If you are not familiar with Argilla, we recommend taking a look at the [Argilla quickstart docs](https://docs.argilla.io/latest/getting_started/quickstart/). Alternatively, you can use your Hugging Face account to login to the [Argilla demo Space](https://argilla-argilla-template-space.hf.space).\n",
"\n",
"To start exploring data, we first need to define an `argilla.Dataset`. We will create a basic datset with some input `TextFields` for the `anchor` and output `TextQuestions` for the `positive` and `negative` pairs. Additionally, we will use the `file_name` as `MetaDataProperty`. Lastly, we will be re-using the vectors obtained from our previous step to allow for semantic search and we will add te similarity scores for some basic filtering and sorting."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we need to define the setting for our Argilla dataset. We will create two different datasets, one for the retrieval data and one for the reranking data to ensure our annotators can focus on the task at hand."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import argilla as rg\n",
"from argilla._exceptions import ConflictError\n",
"\n",
"api_key = \"ohh so secret\"\n",
"api_url = \"https://[your-owner-name]-[your-space-name].hf.space\"\n",
"\n",
"client = rg.Argilla(api_url=api_url, api_key=api_key)\n",
"\n",
"settings = rg.Settings(\n",
" fields=[\n",
" rg.TextField(\"anchor\")\n",
" ],\n",
" questions=[\n",
" rg.TextQuestion(\"positive\"),\n",
" rg.TextQuestion(\"negative\"),\n",
" rg.LabelQuestion(\n",
" name=\"is_positive_relevant\",\n",
" title=\"Is the positive query relevant?\",\n",
" labels=[\"yes\", \"no\"],\n",
" ),\n",
" rg.LabelQuestion(\n",
" name=\"is_negative_irrelevant\",\n",
" title=\"Is the negative query irrelevant?\",\n",
" labels=[\"yes\", \"no\"],\n",
" )\n",
" ],\n",
" metadata=[\n",
" rg.TermsMetadataProperty(\"filename\"),\n",
" rg.FloatMetadataProperty(\"similarity-positive-negative\"),\n",
" rg.FloatMetadataProperty(\"similarity-anchor-positive\"),\n",
" rg.FloatMetadataProperty(\"similarity-anchor-negative\"),\n",
" ],\n",
" vectors=[\n",
" rg.VectorField(\"anchor-vector\", dimensions=model.get_sentence_embedding_dimension())\n",
" ]\n",
")\n",
"rg_datasets = []\n",
"for dataset_name in [\"generate_retrieval_pairs\", \"generate_reranking_pairs\"]:\n",
" ds = rg.Dataset(\n",
" name=dataset_name,\n",
" settings=settings\n",
" )\n",
" try:\n",
" ds.create()\n",
" except ConflictError:\n",
" ds = client.datasets(dataset_name)\n",
" rg_datasets.append(ds)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we've got our dataset definitions setup in Argilla, we can upload our data to Argilla."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ds_datasets = [dataset_generate_retrieval_pairs, dataset_generate_reranking_pairs]\n",
"\n",
"records = []\n",
"\n",
"for rg_dataset, ds_dataset in zip(rg_datasets, ds_datasets):\n",
" for idx, entry in enumerate(ds_dataset):\n",
" records.append(\n",
" rg.Record(\n",
" id=idx,\n",
" fields={\"anchor\": entry[\"anchor\"]},\n",
" suggestions=[\n",
" rg.Suggestion(\"positive\", value=entry[\"positive\"], agent=\"gpt-4o\", type=\"model\"),\n",
" rg.Suggestion(\"negative\", value=entry[\"negative\"], agent=\"gpt-4o\", type=\"model\"),\n",
" ],\n",
" metadata={\n",
" \"filename\": entry[\"filename\"],\n",
" \"similarity-positive-negative\": entry[\"similarity-positive-negative\"],\n",
" \"similarity-anchor-positive\": entry[\"similarity-anchor-positive\"],\n",
" \"similarity-anchor-negative\": entry[\"similarity-anchor-negative\"]\n",
" },\n",
" vectors={\"anchor-vector\": entry[\"anchor-vector\"]}\n",
" )\n",
" )\n",
" rg_dataset.records.log(records)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can explore the UI and add a final human touch to get the most out of our dataset. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fine-tuning\n",
"\n",
"At last, we can fine-tune our models. We will use the `sentence-transformers` library to fine-tune our models.\n",
"\n",
"### Retrieval\n",
"\n",
"For retrieval, we have created a script that fine-tunes a model on our generated data the generated data based [https://github.com/argilla-io/argilla-sdk-chatbot/blob/main/train_embedding.ipynb](https://github.com/argilla-io/argilla-sdk-chatbot/blob/main/train_embedding.ipynb).You can also [open it in Google Colab directly](https://githubtocolab.com/argilla-io/argilla-sdk-chatbot/blob/main/train_embedding.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reranking\n",
"\n",
"For reranking, `sentence-transformers` provides a script that shows [how to fine-tune a CrossEncoder models](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/cross-encoder). Ad of now, there is [some uncertainty over fine-tuning CrossEncoder models with triplets](https://github.com/UKPLab/sentence-transformers/issues/2366) but you can still use the `positive` and `anchor`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusions\n",
"\n",
"In this tutorial, we present an end-to-end example of fine-tuning retrievers and rerankers for RAG. This serves as a good starting point for optimizing and maintaining your data and model but need to be adapted to your specific use case.\n",
"\n",
"We started with some seed data from the Argilla docs, generated synthetic data for retrieval and reranking models, evaluated the quality of the data, and showed how to fine-tune the models. We also used Argilla to get a human touch on the data."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".env",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Clean an existing preference dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- **Goal**: Clean an existing preference dataset by providing AI feedback on the quality of the data.\n",
"- **Libraries**: [argilla](https://github.com/argilla-io/argilla), [hf-inference-endpoints](https://github.com/huggingface/huggingface_hub)\n",
"- **Components**: [LoadDataFromDicts](https://distilabel.argilla.io/dev/components-gallery/steps/loaddatafromdicts/), [UltraFeedback](https://distilabel.argilla.io/latest/components-gallery/tasks/ultrafeedback/), [KeepColumns](https://distilabel.argilla.io/latest/components-gallery/steps/groupcolumns/), [PreferenceToArgilla](https://distilabel.argilla.io/latest/components-gallery/steps/textgenerationtoargilla/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/), [GlobalStep](../../how_to_guides/basic/step/global_step.md)\n",
"\n",
"![Knowledge graph figure](../../../assets/pipelines/clean-dataset.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting Started"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Install the dependencies\n",
"\n",
"To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using **the free but rate-limited Hugging Face serverless Inference API** for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install \"distilabel[hf-inference-endpoints]\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install \"transformers~=4.0\" \"torch~=2.0\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's make the required imports:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import random\n",
"\n",
"from datasets import load_dataset\n",
"\n",
"from distilabel.models import InferenceEndpointsLLM\n",
"from distilabel.pipeline import Pipeline\n",
"from distilabel.steps import (\n",
" KeepColumns,\n",
" LoadDataFromDicts,\n",
" PreferenceToArgilla,\n",
")\n",
"from distilabel.steps.tasks import UltraFeedback"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You'll need an `HF_TOKEN` to use the HF Inference Endpoints. Login to use it directly within this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from huggingface_hub import login\n",
"\n",
"login(token=os.getenv(\"HF_TOKEN\"), add_to_git_credential=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### (optional) Deploy Argilla\n",
"\n",
"You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following [this guide](https://docs.argilla.io/latest/getting_started/quickstart/). \n",
"\n",
"Along with that, you will need to install Argilla as a distilabel extra."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install \"distilabel[argilla, hf-inference-endpoints]\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, we will clean a preference dataset, so we will use the [`Intel/orca_dpo_pairs`](https://huggingface.co/datasets/Intel/orca_dpo_pairs) dataset from the Hugging Face Hub."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<iframe\n",
" src=\"https://huggingface.co/datasets/Intel/orca_dpo_pairs/embed/viewer/default/train\"\n",
" frameborder=\"0\"\n",
" width=\"100%\"\n",
" height=\"560px\"\n",
"></iframe>"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"dataset = load_dataset(\"Intel/orca_dpo_pairs\", split=\"train[:20]\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will shuffle the `chosen` and `rejected` columns to avoid any bias in the dataset."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def shuffle_and_track(chosen, rejected):\n",
" pair = [chosen, rejected]\n",
" random.shuffle(pair)\n",
" order = [\"chosen\" if x == chosen else \"rejected\" for x in pair]\n",
" return {\"generations\": pair, \"order\": order}\n",
"\n",
"dataset = dataset.map(lambda x: shuffle_and_track(x[\"chosen\"], x[\"rejected\"]))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"dataset = dataset.to_list()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"??? tip \"As a custom step\"\n",
" You can also [create a custom step](../../how_to_guides/basic/step/global_step.md) in a separate module, import it and add it to the pipeline after loading the `orca_dpo_pairs` dataset using the `LoadDataFromHub` step.\n",
"\n",
" ```python title=\"shuffle_step.py\"\n",
" from typing import TYPE_CHECKING, List\n",
" from distilabel.steps import GlobalStep, StepInput\n",
"\n",
" if TYPE_CHECKING:\n",
" from distilabel.typing import StepOutput\n",
" \n",
" import random\n",
"\n",
" class ShuffleStep(GlobalStep):\n",
" @property\n",
" def inputs(self):\n",
" \"\"\"Returns List[str]: The inputs of the step.\"\"\"\n",
" return [\"instruction\", \"chosen\", \"rejected\"]\n",
"\n",
" @property\n",
" def outputs(self):\n",
" \"\"\"Returns List[str]: The outputs of the step.\"\"\"\n",
" return [\"instruction\", \"generations\", \"order\"]\n",
"\n",
" def process(self, inputs: StepInput):\n",
" \"\"\"Returns StepOutput: The outputs of the step.\"\"\"\n",
" outputs = []\n",
"\n",
" for input in inputs:\n",
" chosen = input[\"chosen\"]\n",
" rejected = input[\"rejected\"]\n",
" pair = [chosen, rejected]\n",
" random.shuffle(pair)\n",
" order = [\"chosen\" if x == chosen else \"rejected\" for x in pair]\n",
" \n",
" outputs.append({\"instruction\": input[\"instruction\"], \"generations\": pair, \"order\": order})\n",
"\n",
" yield outputs\n",
" ```\n",
" \n",
" ```python\n",
" from shuffle_step import ShuffleStep\n",
" ```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define the pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To clean an existing preference dataset, we will need to define a `Pipeline` with all the necessary steps. However, a similar workflow can be used to clean a SFT dataset. Below, we will go over each step in detail."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load the dataset\n",
"We will use the dataset we just shuffled as source data.\n",
"\n",
"- Component: `LoadDataFromDicts`\n",
"- Input columns: `system`, `question`, `chosen`, `rejected`, `generations` and `order`, the same keys as in the loaded list of dictionaries.\n",
"- Output columns: `system`, `instruction`, `chosen`, `rejected`, `generations` and `order`. We will use `output_mappings` to rename the columns."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"([{'system': '',\n",
" 'question': \"You will be given a definition of a task first, then some input of the task.\\nThis task is about using the specified sentence and converting the sentence to Resource Description Framework (RDF) triplets of the form (subject, predicate object). The RDF triplets generated must be such that the triplets accurately capture the structure and semantics of the input sentence. The input is a sentence and the output is a list of triplets of the form [subject, predicate, object] that capture the relationships present in the sentence. When a sentence has more than 1 RDF triplet possible, the output must contain all of them.\\n\\nAFC Ajax (amateurs)'s ground is Sportpark De Toekomst where Ajax Youth Academy also play.\\nOutput:\",\n",
" 'chosen': '[\\n [\"AFC Ajax (amateurs)\", \"has ground\", \"Sportpark De Toekomst\"],\\n [\"Ajax Youth Academy\", \"plays at\", \"Sportpark De Toekomst\"]\\n]',\n",
" 'rejected': \" Sure, I'd be happy to help! Here are the RDF triplets for the input sentence:\\n\\n[AFC Ajax (amateurs), hasGround, Sportpark De Toekomst]\\n[Ajax Youth Academy, playsAt, Sportpark De Toekomst]\\n\\nExplanation:\\n\\n* AFC Ajax (amateurs) is the subject of the first triplet, and hasGround is the predicate that describes the relationship between AFC Ajax (amateurs) and Sportpark De Toekomst.\\n* Ajax Youth Academy is the subject of the second triplet, and playsAt is the predicate that describes the relationship between Ajax Youth Academy and Sportpark De Toekomst.\\n\\nNote that there may be other possible RDF triplets that could be derived from the input sentence, but the above triplets capture the main relationships present in the sentence.\",\n",
" 'generations': [\" Sure, I'd be happy to help! Here are the RDF triplets for the input sentence:\\n\\n[AFC Ajax (amateurs), hasGround, Sportpark De Toekomst]\\n[Ajax Youth Academy, playsAt, Sportpark De Toekomst]\\n\\nExplanation:\\n\\n* AFC Ajax (amateurs) is the subject of the first triplet, and hasGround is the predicate that describes the relationship between AFC Ajax (amateurs) and Sportpark De Toekomst.\\n* Ajax Youth Academy is the subject of the second triplet, and playsAt is the predicate that describes the relationship between Ajax Youth Academy and Sportpark De Toekomst.\\n\\nNote that there may be other possible RDF triplets that could be derived from the input sentence, but the above triplets capture the main relationships present in the sentence.\",\n",
" '[\\n [\"AFC Ajax (amateurs)\", \"has ground\", \"Sportpark De Toekomst\"],\\n [\"Ajax Youth Academy\", \"plays at\", \"Sportpark De Toekomst\"]\\n]'],\n",
" 'order': ['rejected', 'chosen']}],\n",
" True)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"load_dataset = LoadDataFromDicts(\n",
" data=dataset[:1],\n",
" output_mappings={\"question\": \"instruction\"},\n",
" pipeline=Pipeline(name=\"showcase-pipeline\"),\n",
")\n",
"load_dataset.load()\n",
"next(load_dataset.process())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluate the responses\n",
"\n",
"To evaluate the quality of the responses, we will use [`meta-llama/Meta-Llama-3.1-70B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct), applying the `UltraFeedback` task that judges the responses according to different dimensions (helpfulness, honesty, instruction-following, truthfulness). For an SFT dataset, you can use [`PrometheusEval`](../papers/prometheus.md) instead.\n",
"\n",
"- Component: `UltraFeedback` task with LLMs using `InferenceEndpointsLLM`\n",
"- Input columns: `instruction`, `generations`\n",
"- Output columns: `ratings`, `rationales`, `distilabel_metadata`, `model_name`\n",
"\n",
"For your use case and to improve the results, you can use any [other LLM of your choice](https://distilabel.argilla.io/latest/components-gallery/llms/)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'instruction': \"What's the capital of Spain?\",\n",
" 'generations': ['Madrid', 'Barcelona'],\n",
" 'ratings': [5, 1],\n",
" 'rationales': [\"The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\",\n",
" \"The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent.\"],\n",
" 'distilabel_metadata': {'raw_output_ultra_feedback_0': \"#### Output for Text 1\\nRating: 5 (Excellent)\\nRationale: The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\\n\\n#### Output for Text 2\\nRating: 1 (Low Quality)\\nRationale: The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent.\"},\n",
" 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"evaluate_responses = UltraFeedback(\n",
" aspect=\"overall-rating\",\n",
" llm=InferenceEndpointsLLM(\n",
" model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n",
" tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n",
" generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n",
" ),\n",
" pipeline=Pipeline(name=\"showcase-pipeline\"),\n",
")\n",
"evaluate_responses.load()\n",
"next(\n",
" evaluate_responses.process(\n",
" [\n",
" {\n",
" \"instruction\": \"What's the capital of Spain?\",\n",
" \"generations\": [\"Madrid\", \"Barcelona\"],\n",
" }\n",
" ]\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Keep only the required columns\n",
"\n",
"We will get rid of the unneeded columns.\n",
"\n",
"- Component: `KeepColumns`\n",
"- Input columns: `system`, `instruction`, `chosen`, `rejected`, `generations`, `ratings`, `rationales`, `distilabel_metadata` and `model_name`\n",
"- Output columns: `instruction`, `chosen`, `rejected`, `generations` and `order`"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'instruction': \"What's the capital of Spain?\",\n",
" 'generations': ['Madrid', 'Barcelona'],\n",
" 'order': ['chosen', 'rejected'],\n",
" 'ratings': [5, 1],\n",
" 'rationales': ['', ''],\n",
" 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"keep_columns = KeepColumns(\n",
" columns=[\n",
" \"instruction\",\n",
" \"generations\",\n",
" \"order\",\n",
" \"ratings\",\n",
" \"rationales\",\n",
" \"model_name\",\n",
" ],\n",
" pipeline=Pipeline(name=\"showcase-pipeline\"),\n",
")\n",
"keep_columns.load()\n",
"next(\n",
" keep_columns.process(\n",
" [\n",
" {\n",
" \"system\": \"\",\n",
" \"instruction\": \"What's the capital of Spain?\",\n",
" \"chosen\": \"Madrid\",\n",
" \"rejected\": \"Barcelona\",\n",
" \"generations\": [\"Madrid\", \"Barcelona\"],\n",
" \"order\": [\"chosen\", \"rejected\"],\n",
" \"ratings\": [5, 1],\n",
" \"rationales\": [\"\", \"\"],\n",
" \"model_name\": \"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n",
" }\n",
" ]\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### (Optional) Further data curation\n",
"\n",
"You can use Argilla to further curate your data.\n",
"\n",
"- Component: `PreferenceToArgilla` step\n",
"- Input columns: `instruction`, `generations`, `generation_models`, `ratings`\n",
"- Output columns: `instruction`, `generations`, `generation_models`, `ratings`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"to_argilla = PreferenceToArgilla(\n",
" dataset_name=\"cleaned-dataset\",\n",
" dataset_workspace=\"argilla\",\n",
" api_url=\"https://[your-owner-name]-[your-space-name].hf.space\",\n",
" api_key=\"[your-api-key]\",\n",
" num_generations=2\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run the pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below, you can see the full pipeline definition:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"with Pipeline(name=\"clean-dataset\") as pipeline:\n",
"\n",
" load_dataset = LoadDataFromDicts(\n",
" data=dataset, output_mappings={\"question\": \"instruction\"}\n",
" )\n",
"\n",
" evaluate_responses = UltraFeedback(\n",
" aspect=\"overall-rating\",\n",
" llm=InferenceEndpointsLLM(\n",
" model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n",
" tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n",
" generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n",
" ),\n",
" )\n",
"\n",
" keep_columns = KeepColumns(\n",
" columns=[\n",
" \"instruction\",\n",
" \"generations\",\n",
" \"order\",\n",
" \"ratings\",\n",
" \"rationales\",\n",
" \"model_name\",\n",
" ]\n",
" )\n",
"\n",
" to_argilla = PreferenceToArgilla(\n",
" dataset_name=\"cleaned-dataset\",\n",
" dataset_workspace=\"argilla\",\n",
" api_url=\"https://[your-owner-name]-[your-space-name].hf.space\",\n",
" api_key=\"[your-api-key]\",\n",
" num_generations=2,\n",
" )\n",
"\n",
" load_dataset.connect(evaluate_responses)\n",
" evaluate_responses.connect(keep_columns)\n",
" keep_columns.connect(to_argilla)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now run the pipeline and clean our preference dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"distiset = pipeline.run()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's check it! If you have loaded the data to Argilla, you can [start annotating in the Argilla UI](https://docs.argilla.io/latest/how_to_guides/annotate/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can push the dataset to the Hub for sharing with the community and [embed it to explore the data](https://huggingface.co/docs/hub/datasets-viewer-embed)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"distiset.push_to_hub(\"[your-owner-name]/example-cleaned-preference-dataset\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<iframe\n",
" src=\"https://huggingface.co/datasets/distilabel-internal-testing/example-cleaned-preference-dataset/embed/viewer/default/train\"\n",
" frameborder=\"0\"\n",
" width=\"100%\"\n",
" height=\"560px\"\n",
"></iframe>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial, we showcased the detailed steps to build a pipeline for cleaning a preference dataset using distilabel. However, you can customize this pipeline for your own use cases, such as cleaning an SFT dataset or adding custom steps.\n",
"\n",
"We used a preference dataset as our starting point and shuffled the data to avoid any bias. Next, we evaluated the responses using a model through the serverless Hugging Face Inference API, following the UltraFeedback standards. Finally, we kept the needed columns and used Argilla for further curation."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "distilabel-tutorials",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Generate a preference dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- **Goal**: Generate a synthetic preference dataset for DPO/ORPO.\n",
"- **Libraries**: [argilla](https://github.com/argilla-io/argilla), [hf-inference-endpoints](https://github.com/huggingface/huggingface_hub)\n",
"- **Components**: [LoadDataFromHub](https://distilabel.argilla.io/latest/components-gallery/steps/loaddatafromhub/), [TextGeneration](https://distilabel.argilla.io/latest/components-gallery/tasks/textgeneration/), [UltraFeedback](https://distilabel.argilla.io/latest/components-gallery/tasks/ultrafeedback/), [GroupColumns](https://distilabel.argilla.io/latest/components-gallery/steps/groupcolumns/), [FormatTextGenerationDPO](https://distilabel.argilla.io/latest/components-gallery/steps/formattextgenerationdpo/), [PreferenceToArgilla](https://distilabel.argilla.io/latest/components-gallery/steps/textgenerationtoargilla/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/)\n",
"\n",
"![Knowledge graph figure](../../../assets/pipelines/generate-preference-dataset.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting started"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Install the dependencies\n",
"\n",
"To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using **the free but rate-limited Hugging Face serverless Inference API** for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install \"distilabel[hf-inference-endpoints]\""
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"!pip install \"transformers~=4.0\" \"torch~=2.0\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's make the required imports:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from distilabel.models import InferenceEndpointsLLM\n",
"from distilabel.pipeline import Pipeline\n",
"from distilabel.steps import (\n",
" LoadDataFromHub,\n",
" GroupColumns,\n",
" FormatTextGenerationDPO,\n",
" PreferenceToArgilla,\n",
")\n",
"from distilabel.steps.tasks import TextGeneration, UltraFeedback"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You'll need an `HF_TOKEN` to use the HF Inference Endpoints. Log in to use it directly within this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from huggingface_hub import login\n",
"\n",
"login(token=os.getenv(\"HF_TOKEN\"), add_to_git_credential=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### (optional) Deploy Argilla\n",
"\n",
"You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following [this guide](https://docs.argilla.io/latest/getting_started/quickstart/). \n",
"\n",
"Along with that, you will need to install Argilla as a distilabel extra."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install \"distilabel[argilla, hf-inference-endpoints]\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define the pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To generate our preference dataset, we will need to define a `Pipeline` with all the necessary steps. Below, we will go over each step in detail."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load the dataset\n",
"\n",
"We will use as source data the [`argilla/10Kprompts-mini`](https://huggingface.co/datasets/argilla/10Kprompts-mini) dataset from the Hugging Face Hub.\n",
"\n",
"<iframe\n",
" src=\"https://huggingface.co/datasets/argilla/10Kprompts-mini/embed/viewer/default/train\"\n",
" frameborder=\"0\"\n",
" width=\"100%\"\n",
" height=\"560px\"\n",
"></iframe>\n",
"\n",
"- Component: `LoadDataFromHub`\n",
"- Input columns: `instruction` and `topic`, the same as in the loaded dataset\n",
"- Output columns: `instruction` and `topic`"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"([{'instruction': 'How can I create an efficient and robust workflow that utilizes advanced automation techniques to extract targeted data, including customer information, from diverse PDF documents and effortlessly integrate it into a designated Google Sheet? Furthermore, I am interested in establishing a comprehensive and seamless system that promptly activates an SMS notification on my mobile device whenever a new PDF document is uploaded to the Google Sheet, ensuring real-time updates and enhanced accessibility.',\n",
" 'topic': 'Software Development'}],\n",
" True)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"load_dataset = LoadDataFromHub(\n",
" repo_id= \"argilla/10Kprompts-mini\",\n",
" num_examples=1,\n",
" pipeline=Pipeline(name=\"showcase-pipeline\"),\n",
" )\n",
"load_dataset.load()\n",
"next(load_dataset.process())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generate responses\n",
"\n",
"We need to generate the responses for the given instructions. We will use two different models available on the Hugging Face Hub through the Serverless Inference API: [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and [`mistralai/Mixtral-8x7B-Instruct-v0.1`](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1). We will also indicate the generation parameters for each model.\n",
"\n",
"- Component: `TextGeneration` task with LLMs using `InferenceEndpointsLLM`\n",
"- Input columns: `instruction`\n",
"- Output columns: `generation`, `distilabel_metadata`, `model_name` for each model\n",
"\n",
"For your use case and to improve the results, you can use any [other LLM of your choice](https://distilabel.argilla.io/latest/components-gallery/llms/)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[{'instruction': 'Which are the top cities in Spain?', 'generation': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\\n\\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Família and Park Güell, designed by Antoni Gaudí.\\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alcázar Palace, and lively flamenco music scene.\\n5. **Málaga**: A coastal city in southern Spain, Málaga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\\n10. **San Sebastián**: A city in the Basque Country, San Sebastián is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebastián International Film Festival.\\n\\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.', 'distilabel_metadata': {'raw_output_text_generation_0': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\\n\\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Família and Park Güell, designed by Antoni Gaudí.\\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alcázar Palace, and lively flamenco music scene.\\n5. **Málaga**: A coastal city in southern Spain, Málaga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\\n10. **San Sebastián**: A city in the Basque Country, San Sebastián is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebastián International Film Festival.\\n\\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.'}, 'model_name': 'meta-llama/Meta-Llama-3-8B-Instruct'}]\n",
"[{'instruction': 'Which are the top cities in Spain?', 'generation': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\\n\\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\\n\\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaudí, including the Sagrada Familia and Park Güell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\\n\\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\\n\\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alcázar, a beautiful palace made up of a series of rooms and courtyards.\\n\\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\\n\\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\\n\\n7. Málaga: A coastal city in Andalusia, Málaga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\\n\\nThese are just a few of the many wonderful cities in Spain.', 'distilabel_metadata': {'raw_output_text_generation_0': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\\n\\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\\n\\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaudí, including the Sagrada Familia and Park Güell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\\n\\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\\n\\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alcázar, a beautiful palace made up of a series of rooms and courtyards.\\n\\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\\n\\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\\n\\n7. Málaga: A coastal city in Andalusia, Málaga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\\n\\nThese are just a few of the many wonderful cities in Spain.'}, 'model_name': 'mistralai/Mixtral-8x7B-Instruct-v0.1'}]\n"
]
}
],
"source": [
"generate_responses = [\n",
" TextGeneration(\n",
" llm=InferenceEndpointsLLM(\n",
" model_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n",
" tokenizer_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n",
" generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n",
" ),\n",
" pipeline=Pipeline(name=\"showcase-pipeline\"),\n",
" ),\n",
" TextGeneration(\n",
" llm=InferenceEndpointsLLM(\n",
" model_id=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n",
" tokenizer_id=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n",
" generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n",
" ),\n",
" pipeline=Pipeline(name=\"showcase-pipeline\"),\n",
" ),\n",
"]\n",
"for task in generate_responses:\n",
" task.load()\n",
" print(next(task.process([{\"instruction\": \"Which are the top cities in Spain?\"}])))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Group the responses\n",
"\n",
"The task to evaluate the responses needs as input a list of generations. However, each model response was saved in the generation column of the subsets `text_generation_0` and `text_generation_1`. We will combine these two columns into a single column and the `default` subset.\n",
"\n",
"- Component: `GroupColumns`\n",
"- Input columns: `generation` and `model_name`from `text_generation_0` and `text_generation_1`\n",
"- Output columns: `generations` and `model_names`"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'generations': ['Madrid', 'Barcelona'],\n",
" 'model_names': ['meta-llama/Meta-Llama-3-8B-Instruct',\n",
" 'mistralai/Mixtral-8x7B-Instruct-v0.1']}]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"group_responses = GroupColumns(\n",
" columns=[\"generation\", \"model_name\"],\n",
" output_columns=[\"generations\", \"model_names\"],\n",
" pipeline=Pipeline(name=\"showcase-pipeline\"),\n",
")\n",
"next(\n",
" group_responses.process(\n",
" [\n",
" {\n",
" \"generation\": \"Madrid\",\n",
" \"model_name\": \"meta-llama/Meta-Llama-3-8B-Instruct\",\n",
" },\n",
" ],\n",
" [\n",
" {\n",
" \"generation\": \"Barcelona\",\n",
" \"model_name\": \"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n",
" }\n",
" ],\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluate the responses\n",
"\n",
"To build our preference dataset, we need to evaluate the responses generated by the models. We will use [`meta-llama/Meta-Llama-3-70B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) for this, applying the `UltraFeedback` task that judges the responses according to different dimensions (helpfulness, honesty, instruction-following, truthfulness).\n",
"\n",
"- Component: `UltraFeedback` task with LLMs using `InferenceEndpointsLLM`\n",
"- Input columns: `instruction`, `generations`\n",
"- Output columns: `ratings`, `rationales`, `distilabel_metadata`, `model_name`\n",
"\n",
"For your use case and to improve the results, you can use any [other LLM of your choice](https://distilabel.argilla.io/latest/components-gallery/llms/)."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'instruction': \"What's the capital of Spain?\",\n",
" 'generations': ['Madrid', 'Barcelona'],\n",
" 'ratings': [5, 1],\n",
" 'rationales': [\"The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\",\n",
" \"The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent.\"],\n",
" 'distilabel_metadata': {'raw_output_ultra_feedback_0': \"#### Output for Text 1\\nRating: 5 (Excellent)\\nRationale: The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\\n\\n#### Output for Text 2\\nRating: 1 (Low Quality)\\nRationale: The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent.\"},\n",
" 'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'}]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"evaluate_responses = UltraFeedback(\n",
" aspect=\"overall-rating\",\n",
" llm=InferenceEndpointsLLM(\n",
" model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n",
" tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n",
" generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n",
" ),\n",
" pipeline=Pipeline(name=\"showcase-pipeline\"),\n",
")\n",
"evaluate_responses.load()\n",
"next(\n",
" evaluate_responses.process(\n",
" [\n",
" {\n",
" \"instruction\": \"What's the capital of Spain?\",\n",
" \"generations\": [\"Madrid\", \"Barcelona\"],\n",
" }\n",
" ]\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Convert to a preference dataset\n",
"\n",
"- You can automatically convert it to a preference dataset with the `chosen` and `rejected` columns.\n",
" - Component: `FormatTextGenerationDPO` step\n",
" - Input columns: `instruction`, `generations`, `generation_models`, `ratings`\n",
" - Output columns: `prompt`, `prompt_id`, `chosen`, `chosen_model`, `chosen_rating`, `rejected`, `rejected_model`, `rejected_rating`"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'instruction': \"What's the capital of Spain?\",\n",
" 'generations': ['Madrid', 'Barcelona'],\n",
" 'generation_models': ['Meta-Llama-3-8B-Instruct',\n",
" 'Mixtral-8x7B-Instruct-v0.1'],\n",
" 'ratings': [5, 1],\n",
" 'prompt': \"What's the capital of Spain?\",\n",
" 'prompt_id': '26174c953df26b3049484e4721102dca6b25d2de9e3aa22aa84f25ed1c798512',\n",
" 'chosen': [{'role': 'user', 'content': \"What's the capital of Spain?\"},\n",
" {'role': 'assistant', 'content': 'Madrid'}],\n",
" 'chosen_model': 'Meta-Llama-3-8B-Instruct',\n",
" 'chosen_rating': 5,\n",
" 'rejected': [{'role': 'user', 'content': \"What's the capital of Spain?\"},\n",
" {'role': 'assistant', 'content': 'Barcelona'}],\n",
" 'rejected_model': 'Mixtral-8x7B-Instruct-v0.1',\n",
" 'rejected_rating': 1}]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"format_dpo = FormatTextGenerationDPO(pipeline=Pipeline(name=\"showcase-pipeline\"))\n",
"format_dpo.load()\n",
"next(\n",
" format_dpo.process(\n",
" [\n",
" {\n",
" \"instruction\": \"What's the capital of Spain?\",\n",
" \"generations\": [\"Madrid\", \"Barcelona\"],\n",
" \"generation_models\": [\n",
" \"Meta-Llama-3-8B-Instruct\",\n",
" \"Mixtral-8x7B-Instruct-v0.1\",\n",
" ],\n",
" \"ratings\": [5, 1],\n",
" }\n",
" ]\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Or you can use Argilla to manually label the data and convert it to a preference dataset.\n",
" - Component: `PreferenceToArgilla` step\n",
" - Input columns: `instruction`, `generations`, `generation_models`, `ratings`\n",
" - Output columns: `instruction`, `generations`, `generation_models`, `ratings`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"to_argilla = PreferenceToArgilla(\n",
" dataset_name=\"preference-dataset\",\n",
" dataset_workspace=\"argilla\",\n",
" api_url=\"https://[your-owner-name]-[your-space-name].hf.space\",\n",
" api_key=\"[your-api-key]\",\n",
" num_generations=2\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run the pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below, you can see the full pipeline definition:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"with Pipeline(name=\"generate-dataset\") as pipeline:\n",
"\n",
" load_dataset = LoadDataFromHub(repo_id=\"argilla/10Kprompts-mini\")\n",
"\n",
" generate_responses = [\n",
" TextGeneration(\n",
" llm=InferenceEndpointsLLM(\n",
" model_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n",
" tokenizer_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n",
" generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n",
" )\n",
" ),\n",
" TextGeneration(\n",
" llm=InferenceEndpointsLLM(\n",
" model_id=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n",
" tokenizer_id=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n",
" generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n",
" )\n",
" ),\n",
" ]\n",
"\n",
" group_responses = GroupColumns(\n",
" columns=[\"generation\", \"model_name\"],\n",
" output_columns=[\"generations\", \"model_names\"],\n",
" )\n",
"\n",
" evaluate_responses = UltraFeedback(\n",
" aspect=\"overall-rating\",\n",
" llm=InferenceEndpointsLLM(\n",
" model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n",
" tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n",
" generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n",
" )\n",
" )\n",
"\n",
" format_dpo = FormatTextGenerationDPO()\n",
"\n",
" to_argilla = PreferenceToArgilla(\n",
" dataset_name=\"preference-dataset\",\n",
" dataset_workspace=\"argilla\",\n",
" api_url=\"https://[your-owner-name]-[your-space-name].hf.space\",\n",
" api_key=\"[your-api-key]\",\n",
" num_generations=2\n",
" )\n",
"\n",
" for task in generate_responses:\n",
" load_dataset.connect(task)\n",
" task.connect(group_responses)\n",
" group_responses.connect(evaluate_responses)\n",
" evaluate_responses.connect(format_dpo, to_argilla)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now run the pipeline and generate the preference dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"distiset = pipeline.run()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's check the preference dataset! If you have loaded the data to Argilla, you can [start annotating in the Argilla UI](https://docs.argilla.io/latest/how_to_guides/annotate/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can push the dataset to the Hub for sharing with the community and [embed it to explore the data](https://huggingface.co/docs/hub/datasets-viewer-embed)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"distiset.push_to_hub(\"[your-owner-name]/example-preference-dataset\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<iframe\n",
" src=\"https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset/embed/viewer/format_text_generation_d_p_o_0/train\"\n",
" frameborder=\"0\"\n",
" width=\"100%\"\n",
" height=\"560px\"\n",
"></iframe>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial, we showcased the detailed steps to build a pipeline for generating a preference dataset using distilabel. You can customize this pipeline for your own use cases and share your datasets with the community through the Hugging Face Hub, or use them to train a model for DPO or ORPO.\n",
"\n",
"We used a dataset containing prompts to generate responses using two different models through the serverless Hugging Face Inference API. Next, we evaluated the responses using a third model, following the UltraFeedback standards. Finally, we converted the data to a preference dataset and used Argilla for further curation."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "distilabel-tutorials",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Generate synthetic text classification data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- **Goal**: Generate synthetic text classification data to augment an imbalanced and limited dataset for training a topic classifier. In addition, generate new data for training a fact-based versus opinion-based classifier to add a new label.\n",
"- **Libraries**: [argilla](https://github.com/argilla-io/argilla), [hf-inference-endpoints](https://github.com/huggingface/huggingface_hub), [SetFit](https://github.com/huggingface/setfit)\n",
"- **Components**: [LoadDataFromDicts](https://distilabel.argilla.io/latest/components-gallery/steps/loaddatafromdicts/), [EmbeddingTaskGenerator](https://distilabel.argilla.io/latest/components-gallery/tasks/embeddingtaskgenerator/), [GenerateTextClassificationData](https://distilabel.argilla.io/latest/components-gallery/tasks/generatetextclassificationdata/)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting started\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Install the dependencies\n",
"\n",
"To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using **the free but rate-limited Hugging Face serverless Inference API** for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install \"distilabel[hf-inference-endpoints]\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install \"transformers~=4.40\" \"torch~=2.0\" \"setfit~=1.0\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's make the required imports:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import random\n",
"from collections import Counter\n",
"\n",
"from datasets import load_dataset, Dataset\n",
"from distilabel.models import InferenceEndpointsLLM\n",
"from distilabel.pipeline import Pipeline\n",
"from distilabel.steps import LoadDataFromDicts\n",
"from distilabel.steps.tasks import (\n",
" GenerateTextClassificationData,\n",
")\n",
"from setfit import SetFitModel, Trainer, sample_dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You'll need an `HF_TOKEN` to use the HF Inference Endpoints. Log in to use it directly within this notebook.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from huggingface_hub import login\n",
"\n",
"login(token=os.getenv(\"HF_TOKEN\"), add_to_git_credential=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### (optional) Deploy Argilla\n",
"\n",
"You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following [this guide](https://docs.argilla.io/latest/getting_started/quickstart/).\n",
"\n",
"Along with that, you will need to install Argilla as a distilabel extra.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install \"distilabel[argilla, hf-inference-endpoints]\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The dataset\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will use the [`fancyzhx/ag_news`](https://huggingface.co/datasets/fancyzhx/ag_news) dataset from the Hugging Face Hub as our original data source. To simulate a real-world scenario with imbalanced and limited data, we will load only 20 samples from this dataset.\n",
"\n",
"<iframe\n",
" src=\"https://huggingface.co/datasets/fancyzhx/ag_news/embed/viewer/default/train\"\n",
" frameborder=\"0\"\n",
" width=\"100%\"\n",
" height=\"560px\"\n",
"></iframe>\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hf_dataset = load_dataset(\"fancyzhx/ag_news\", split=\"train[-20:]\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can retrieve the available labels in the dataset and examine the current data distribution."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{0: 'World', 1: 'Sports', 2: 'Business', 3: 'Sci/Tech'}\n",
"Counter({0: 12, 1: 6, 2: 2})\n"
]
}
],
"source": [
"labels_topic = hf_dataset.features[\"label\"].names\n",
"id2str = {i: labels_topic[i] for i in range(len(labels_topic))}\n",
"print(id2str)\n",
"print(Counter(hf_dataset[\"label\"]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As observed, the dataset is imbalanced, with most samples falling under the `World` category, while the `Sci/Tech` category is entirely missing. Moreover, there are insufficient samples to effectively train a topic classification model.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will also define the labels for the new classification task."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"labels_fact_opinion = [\"Fact-based\", \"Opinion-based\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define the text classification task\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To generate the data we will use the `GenerateTextClassificationData` task. This task will use as input classification tasks and we can define the language, difficulty and clarity required for the generated data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[{'role': 'user', 'content': 'You have been assigned a text classification task: Classify the news article as fact-based or opinion-based\\n\\nYour mission is to write one text classification example for this task in JSON format. The JSON object must contain the following keys:\\n - \"input_text\": a string, the input text specified by the classification task.\\n - \"label\": a string, the correct label of the input text.\\n - \"misleading_label\": a string, an incorrect label that is related to the task.\\n\\nPlease adhere to the following guidelines:\\n - The \"input_text\" should be diverse in expression.\\n - The \"misleading_label\" must be a valid label for the given task, but not as appropriate as the \"label\" for the \"input_text\".\\n - The values for all fields should be in English.\\n - Avoid including the values of the \"label\" and \"misleading_label\" fields in the \"input_text\", that would make the task too easy.\\n - The \"input_text\" is clear and requires college level education to comprehend.\\n\\nYour output must always be a JSON object only, do not explain yourself or output anything else. Be creative!'}]\n"
]
}
],
"source": [
"task = GenerateTextClassificationData(\n",
" language=\"English\",\n",
" difficulty=\"college\",\n",
" clarity=\"clear\",\n",
" num_generations=1,\n",
" llm=InferenceEndpointsLLM(\n",
" model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.4},\n",
" ),\n",
" input_batch_size=5,\n",
")\n",
"task.load()\n",
"result = next(\n",
" task.process([{\"task\": \"Classify the news article as fact-based or opinion-based\"}])\n",
")\n",
"print(result[0][\"distilabel_metadata\"][\"raw_input_generate_text_classification_data_0\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For our use case, we only need to generate data for two tasks: a topic classification task and a fact versus opinion classification task. Therefore, we will define the tasks accordingly. As we will be using an smaller model for generation, we will select 2 random labels for each topic classification task and change the order for the fact versus opinion classification task ensuring more diversity in the generated data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"task_templates = [\n",
" \"Determine the news article as {}\",\n",
" \"Classify news article as {}\",\n",
" \"Identify the news article as {}\",\n",
" \"Categorize the news article as {}\",\n",
" \"Label the news article using {}\",\n",
" \"Annotate the news article based on {}\",\n",
" \"Determine the theme of a news article from {}\",\n",
" \"Recognize the topic of the news article as {}\",\n",
"]\n",
"\n",
"classification_tasks = [\n",
" {\"task\": action.format(\" or \".join(random.sample(labels_topic, 2)))}\n",
" for action in task_templates for _ in range(4)\n",
"] + [\n",
" {\"task\": action.format(\" or \".join(random.sample(labels_fact_opinion, 2)))}\n",
" for action in task_templates\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run the pipeline\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, it's time to define and run the pipeline. As mentioned, we will load the written tasks and feed them into the `GenerateTextClassificationData` task. For our use case, we will be using `Meta-Llama-3.1-8B-Instruct` via the `InferenceEndpointsLLM`, with different degrees of difficulty and clarity.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"difficulties = [\"college\", \"high school\", \"PhD\"]\n",
"clarity = [\"clear\", \"understandable with some effort\", \"ambiguous\"]\n",
"\n",
"with Pipeline(\"texcat-generation-pipeline\") as pipeline:\n",
"\n",
" tasks_generator = LoadDataFromDicts(data=classification_tasks)\n",
"\n",
" generate_data = []\n",
" for difficulty in difficulties:\n",
" for clarity_level in clarity:\n",
" task = GenerateTextClassificationData(\n",
" language=\"English\",\n",
" difficulty=difficulty,\n",
" clarity=clarity_level,\n",
" num_generations=2,\n",
" llm=InferenceEndpointsLLM(\n",
" model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n",
" ),\n",
" input_batch_size=5,\n",
" )\n",
" generate_data.append(task)\n",
"\n",
" for task in generate_data:\n",
" tasks_generator.connect(task)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now run the pipeline and generate the synthetic data.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"distiset = pipeline.run()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'task': 'Determine the news article as Business or World',\n",
" 'input_text': \"The recent decision by the European Central Bank to raise interest rates will likely have a significant impact on the eurozone's economic growth, with some analysts predicting a 0.5% contraction in GDP due to the increased borrowing costs. The move is seen as a measure to combat inflation, which has been rising steadily over the past year.\",\n",
" 'label': 'Business',\n",
" 'misleading_label': 'World',\n",
" 'distilabel_metadata': {'raw_output_generate_text_classification_data_0': '{\\n \"input_text\": \"The recent decision by the European Central Bank to raise interest rates will likely have a significant impact on the eurozone\\'s economic growth, with some analysts predicting a 0.5% contraction in GDP due to the increased borrowing costs. The move is seen as a measure to combat inflation, which has been rising steadily over the past year.\",\\n \"label\": \"Business\",\\n \"misleading_label\": \"World\"\\n}'},\n",
" 'model_name': 'meta-llama/Meta-Llama-3.1-8B-Instruct'}"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"distiset[\"generate_text_classification_data_0\"][\"train\"][0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can push the dataset to the Hub for sharing with the community and [embed it to explore the data](https://huggingface.co/docs/hub/datasets-viewer-embed).\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"distiset.push_to_hub(\"[your-owner-name]/example-texcat-generation-dataset\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<iframe\n",
" src=\"https://huggingface.co/datasets/distilabel-internal-testing/example-texcat-generation-dataset/embed/viewer/generate_text_classification_data_1/train\"\n",
" frameborder=\"0\"\n",
" width=\"100%\"\n",
" height=\"560px\"\n",
"></iframe>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By examining the distiset distribution, we can confirm that it includes at least the 8 required samples for each label to train our classification models with SetFit."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Counter({'Sci/Tech': 275,\n",
" 'Business': 130,\n",
" 'World': 86,\n",
" 'Fact-based': 86,\n",
" 'Sports': 64,\n",
" 'Opinion-based': 54,\n",
" None: 20,\n",
" 'Opinion Based': 1,\n",
" 'News/Opinion': 1,\n",
" 'Science': 1,\n",
" 'Environment': 1,\n",
" 'Opinion': 1})"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"all_labels = [\n",
" entry[\"label\"]\n",
" for dataset_name in distiset\n",
" for entry in distiset[dataset_name][\"train\"]\n",
"]\n",
"\n",
"Counter(all_labels)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will create two datasets with the required labels and data for our use cases."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def extract_rows(distiset, labels):\n",
" return [\n",
" {\n",
" \"text\": entry[\"input_text\"],\n",
" \"label\": entry[\"label\"],\n",
" \"id\": i\n",
" }\n",
" for dataset_name in distiset\n",
" for i, entry in enumerate(distiset[dataset_name][\"train\"])\n",
" if entry[\"label\"] in labels\n",
" ]\n",
"\n",
"data_topic = extract_rows(distiset, labels_topic)\n",
"data_fact_opinion = extract_rows(distiset, labels_fact_opinion)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## (Optional) Evaluate with Argilla\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"!!! note \"Get started in Argilla\"\n",
" If you are not familiar with Argilla, we recommend taking a look at the [Argilla quickstart docs](https://docs.argilla.io/latest/getting_started/quickstart/). Alternatively, you can use your Hugging Face account to login to the [Argilla demo Space](https://argilla-argilla-template-space.hf.space).\n",
"\n",
"To get the most out of our data, we will use Argilla. First, we need to connect to the Argilla instance.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import argilla as rg\n",
"\n",
"# Replace api_url with your url if using Docker\n",
"# Replace api_key with your API key under \"My Settings\" in the UI\n",
"# Uncomment the last line and set your HF_TOKEN if your space is private\n",
"client = rg.Argilla(\n",
" api_url=\"https://[your-owner-name]-[your_space_name].hf.space\",\n",
" api_key=\"[your-api-key]\",\n",
" # headers={\"Authorization\": f\"Bearer {HF_TOKEN}\"}\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will create a `Dataset` for each task, with an input `TextField` for the text classification text and a `LabelQuestion` to ensure the generated labels are correct.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def create_texcat_dataset(dataset_name, labels):\n",
" settings = rg.Settings(\n",
" fields=[rg.TextField(\"text\")],\n",
" questions=[\n",
" rg.LabelQuestion(\n",
" name=\"label\",\n",
" title=\"Classify the texts according to the following labels\",\n",
" labels=labels,\n",
" ),\n",
" ],\n",
" )\n",
" return rg.Dataset(name=dataset_name, settings=settings).create()\n",
"\n",
"\n",
"rg_dataset_topic = create_texcat_dataset(\"topic-classification\", labels_topic)\n",
"rg_dataset_fact_opinion = create_texcat_dataset(\n",
" \"fact-opinion-classification\", labels_fact_opinion\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can upload the generated data to Argilla and evaluate it. We will use the generated labels as suggestions.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rg_dataset_topic.records.log(data_topic)\n",
"rg_dataset_fact_opinion.records.log(data_fact_opinion)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can start the annotation process. Just open the dataset in the Argilla UI and start annotating the records. If the suggestions are correct, you can just click on `Submit`. Otherwise, you can select the correct label.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"!!! note\n",
" Check this [how-to guide](https://docs.argilla.io/latest/how_to_guides/annotate/) to know more about annotating in the UI.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once, you get the annotations, let's continue by retrieving the data from Argilla and format it as a dataset with the required data.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rg_dataset_topic = client.datasets(\"topic-classification\")\n",
"rg_dataset_fact_opinion = client.datasets(\"fact-opinion-classification\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"status_filter = rg.Query(filter=rg.Filter((\"response.status\", \"==\", \"submitted\")))\n",
"\n",
"submitted_topic = rg_dataset_topic.records(status_filter).to_list(flatten=True)\n",
"submitted_fact_opinion = rg_dataset_fact_opinion.records(status_filter).to_list(\n",
" flatten=True\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def format_submitted(submitted):\n",
" return [\n",
" {\n",
" \"text\": r[\"text\"],\n",
" \"label\": r[\"label.responses\"][0],\n",
" \"id\": i,\n",
" }\n",
" for i, r in enumerate(submitted)\n",
" ]\n",
"\n",
"data_topic = format_submitted(submitted_topic)\n",
"data_fact_opinion = format_submitted(submitted_fact_opinion)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train your models\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our case, we will fine-tune using SetFit. However, you can select the one that best fits your requirements.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Formatting the data\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next step will be to format the data to be compatible with SetFit. In the case of the topic classification, we will need to combine the synthetic data with the original data.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hf_topic = hf_dataset.to_list()\n",
"num = len(data_topic)\n",
"\n",
"data_topic.extend(\n",
" [\n",
" {\n",
" \"text\": r[\"text\"],\n",
" \"label\": id2str[r[\"label\"]],\n",
" \"id\": num + i,\n",
" }\n",
" for i, r in enumerate(hf_topic)\n",
" ]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we check the data distribution now, we can see that we have enough samples for each label to train our models.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Counter({'Sci/Tech': 275, 'Business': 132, 'World': 98, 'Sports': 70})"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"labels = [record[\"label\"] for record in data_topic]\n",
"Counter(labels)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Counter({'Fact-based': 86, 'Opinion-based': 54})"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"labels = [record[\"label\"] for record in data_fact_opinion]\n",
"Counter(labels)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's create our training and validation datasets. The training dataset will gather 8 samples by label. In this case, the validation datasets will contain the remaining samples not included in the training datasets.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def sample_and_split(dataset, label_column, num_samples):\n",
" train_dataset = sample_dataset(\n",
" dataset, label_column=label_column, num_samples=num_samples\n",
" )\n",
" eval_dataset = dataset.filter(lambda x: x[\"id\"] not in set(train_dataset[\"id\"]))\n",
" return train_dataset, eval_dataset\n",
"\n",
"\n",
"dataset_topic_full = Dataset.from_list(data_topic)\n",
"dataset_fact_opinion_full = Dataset.from_list(data_fact_opinion)\n",
"\n",
"train_dataset_topic, eval_dataset_topic = sample_and_split(\n",
" dataset_topic_full, \"label\", 8\n",
")\n",
"train_dataset_fact_opinion, eval_dataset_fact_opinion = sample_and_split(\n",
" dataset_fact_opinion_full, \"label\", 8\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The actual training\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's train our models for each task! We will use [TaylorAI/bge-micro-v2](https://huggingface.co/TaylorAI/bge-micro-v2), available in the Hugging Face Hub. You can check the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) to select the best model for your use case."
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [],
"source": [
"def train_model(model_name, dataset, eval_dataset):\n",
" model = SetFitModel.from_pretrained(model_name)\n",
"\n",
" trainer = Trainer(\n",
" model=model,\n",
" train_dataset=dataset,\n",
" )\n",
" trainer.train()\n",
" metrics = trainer.evaluate(eval_dataset)\n",
" print(metrics)\n",
"\n",
" return model"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"***** Running training *****\n",
" Num unique pairs = 768\n",
" Batch size = 16\n",
" Num epochs = 1\n",
" Total optimization steps = 48\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'embedding_loss': 0.1873, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.02}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"***** Running evaluation *****\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'train_runtime': 4.9767, 'train_samples_per_second': 154.318, 'train_steps_per_second': 9.645, 'epoch': 1.0}\n",
"{'accuracy': 0.8333333333333334}\n"
]
}
],
"source": [
"model_topic = train_model(\n",
" model_name=\"TaylorAI/bge-micro-v2\",\n",
" dataset=train_dataset_topic,\n",
" eval_dataset=eval_dataset_topic,\n",
")\n",
"model_topic.save_pretrained(\"topic_classification_model\")\n",
"model_topic = SetFitModel.from_pretrained(\"topic_classification_model\")"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"***** Running training *****\n",
" Num unique pairs = 144\n",
" Batch size = 16\n",
" Num epochs = 1\n",
" Total optimization steps = 9\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'embedding_loss': 0.2985, 'learning_rate': 2e-05, 'epoch': 0.11}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"***** Running evaluation *****\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'train_runtime': 0.8327, 'train_samples_per_second': 172.931, 'train_steps_per_second': 10.808, 'epoch': 1.0}\n",
"{'accuracy': 0.9090909090909091}\n"
]
}
],
"source": [
"model_fact_opinion = train_model(\n",
" model_name=\"TaylorAI/bge-micro-v2\",\n",
" dataset=train_dataset_fact_opinion,\n",
" eval_dataset=eval_dataset_fact_opinion,\n",
")\n",
"model_fact_opinion.save_pretrained(\"fact_opinion_classification_model\")\n",
"model_fact_opinion = SetFitModel.from_pretrained(\"fact_opinion_classification_model\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Voilà! The models are now trained and ready to be used. You can start making predictions to check the model's performance and add the new label. Optionally, you can continue using distilabel to generate additional data or Argilla to verify the quality of the predictions."
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [],
"source": [
"def predict(model, input, labels):\n",
" model.labels = labels\n",
" prediction = model.predict([input])\n",
" return prediction[0]"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Sci/Tech'"
]
},
"execution_count": 130,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict(\n",
" model_topic, \"The new iPhone is expected to be released next month.\", labels_topic\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Opinion-based'"
]
},
"execution_count": 131,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict(\n",
" model_fact_opinion,\n",
" \"The new iPhone is expected to be released next month.\",\n",
" labels_fact_opinion,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusions\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial, we showcased the detailed steps to build a pipeline for generating text classification data using distilabel. You can customize this pipeline for your own use cases and share your datasets with the community through the Hugging Face Hub.\n",
"\n",
"We defined two text classification tasks—a topic classification task and a fact versus opinion classification task—and generated new data using various models via the serverless Hugging Face Inference API. Then, we curated the generated data with Argilla. Finally, we trained the models with SetFit using both the original and synthetic data."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "distilabel-tutorials",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@import url('https://fonts.googleapis.com/css2?family=Inter:wght@100..600&display=swap');
:root {
--md-primary-fg-color: #f2a8ff;
--md-primary-fg-color--light: #f2a8ff;
--md-primary-fg-color--dark: #f2a8ff;
--md-text-font: "Inter";
}
[data-md-color-scheme="default"] {
--md-primary-fg-color: #000000;
--md-typeset-a-color: #9c50c2;
--md-accent-fg-color: #c57fed;
}
[data-md-color-scheme="slate"] {
--md-primary-fg-color: #000000;
--md-typeset-a-color: #ca77d8;
--md-accent-fg-color: #f2a8ff;
}
.md-sidebar__scrollwrap:focus-within, .md-sidebar__scrollwrap:hover {
scrollbar-color: var(--md-default-fg-color--lighter) #0000;
}
This source diff could not be displayed because it is too large. You can view the blob instead.
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import re
from typing import Any, Dict, List, Optional, Union
from typing_extensions import override
from distilabel.steps import GlobalStep, StepInput
from distilabel.steps.tasks.base import Task
from distilabel.steps.tasks.typing import ChatType
from distilabel.steps.typing import StepOutput
class ArenaHard(Task):
"""Evaluates two assistant responses using an LLM as judge.
This `Task` is based on the "From Live Data to High-Quality Benchmarks: The
Arena-Hard Pipeline" paper that presents Arena Hard, which is a benchmark for
instruction-tuned LLMs that contains 500 challenging user queries. GPT-4 is used
as the judge to compare the model responses against a baseline model, which defaults
to `gpt-4-0314`.
Note:
Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena
among popular open-ended LLM benchmarks.
Input columns:
- instruction (`str`): The instruction to evaluate the responses.
- generations (`List[str]`): The responses generated by two, and only two, LLMs.
Output columns:
- evaluation (`str`): The evaluation of the responses generated by the LLMs.
- score (`str`): The score extracted from the evaluation.
- model_name (`str`): The model name used to generate the evaluation.
Categories:
- benchmark
References:
- [From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/)
- [`arena-hard-auto`](https://github.com/lm-sys/arena-hard-auto/tree/main)
Examples:
Evaluate two assistant responses for a given instruction using Arean Hard prompts:
```python
from distilabel.pipeline import Pipeline
from distilabel.steps import GroupColumns, LoadDataFromDicts
from distilabel.steps.tasks import ArenaHard, TextGeneration
with Pipeline() as pipeline:
load_data = LoadDataFromDicts(
data=[{"instruction": "What is the capital of France?"}],
)
text_generation_a = TextGeneration(
llm=..., # LLM instance
output_mappings={"model_name": "generation_model"},
)
text_generation_b = TextGeneration(
llm=..., # LLM instance
output_mappings={"model_name": "generation_model"},
)
combine = GroupColumns(
columns=["generation", "generation_model"],
output_columns=["generations", "generation_models"],
)
arena_hard = ArenaHard(
llm=..., # LLM instance
)
load_data >> [text_generation_a, text_generation_b] >> combine >> arena_hard
```
"""
@property
def inputs(self) -> List[str]:
"""The inputs required by this task are the `instruction` and the `generations`,
which are the responses generated by two, and only two, LLMs."""
return ["instruction", "generations"]
def format_input(self, input: Dict[str, Any]) -> ChatType:
"""This method formats the input data as a `ChatType` using the prompt defined
by the Arena Hard benchmark, which consists on a `system_prompt` plus a template
for the user first message that contains the `instruction` and both `generations`.
"""
return [
{
"role": "system",
"content": "Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user prompt displayed below. You will be given assistant A's answer and assistant B's answer. Your job is to evaluate which assistant's answer is better.\n\nBegin your evaluation by generating your own answer to the prompt. You must provide your answers before judging any answers.\n\nWhen evaluating the assistants' answers, compare both assistants' answers with your answer. You must identify and correct any mistakes or inaccurate information.\n\nThen consider if the assistant's answers are helpful, relevant, and concise. Helpful means the answer correctly responds to the prompt or follows the instructions. Note when user prompt has any ambiguity or more than one interpretation, it is more helpful and appropriate to ask for clarifications or more information from the user than providing an answer based on assumptions. Relevant means all parts of the response closely connect or are appropriate to what is being asked. Concise means the response is clear and not verbose or excessive.\n\nThen consider the creativity and novelty of the assistant's answers when needed. Finally, identify any missing important information in the assistants' answers that would be beneficial to include when responding to the user prompt.\n\nAfter providing your explanation, you must output only one of the following choices as your final verdict with a label:\n\n1. Assistant A is significantly better: [[A>>B]]\n2. Assistant A is slightly better: [[A>B]]\n3. Tie, relatively the same: [[A=B]]\n4. Assistant B is slightly better: [[B>A]]\n5. Assistant B is significantly better: [[B>>A]]\n\nExample output: \"My final verdict is tie: [[A=B]]\".",
},
{
"role": "user",
"content": f"<|User Prompt|>\n{input['instruction']}\n\n<|The Start of Assistant A's Answer|>\n{input['generations'][0]}\n<|The End of Assistant A's Answer|>\n\n<|The Start of Assistant B's Answer|>\n{input['generations'][1]}\n<|The End of Assistant B's Answer|>",
},
]
@property
def outputs(self) -> List[str]:
"""The outputs generated by this task are the `evaluation`, the `score` and
the `model_name` (which is automatically injected within the `process` method
of the parent task)."""
return ["evaluation", "score", "model_name"]
def format_output(
self,
output: Union[str, None],
input: Union[Dict[str, Any], None] = None,
) -> Dict[str, Any]:
"""This method formats the output generated by the LLM as a Python dictionary
containing the `evaluation` which is the raw output generated by the LLM (consisting
of the judge LLM alternate generation for the given instruction, plus an explanation
on the evaluation of the given responses; plus the `score` extracted from the output.
Args:
output: the raw output of the LLM.
input: the input to the task. Is provided in case it needs to be used to enrich
the output if needed.
Returns:
A dict with the keys `evaluation` with the raw output which contains the LLM
evaluation and the extracted `score` if possible.
"""
if output is None:
return {"evaluation": None, "score": None}
pattern = re.compile(r"\[\[([AB<>=]+)\]\]")
match = pattern.search(output)
if match is None:
return {"evaluation": output, "score": None}
return {"evaluation": output, "score": match.group(1)}
class ArenaHardResults(GlobalStep):
"""Process Arena Hard results to calculate the ELO scores.
This `Step` is based on the "From Live Data to High-Quality Benchmarks: The
Arena-Hard Pipeline" paper that presents Arena Hard, which is a benchmark for
instruction-tuned LLMs that contains 500 challenging user queries. This step is
a `GlobalStep` that should run right after the `ArenaHard` task to calculate the
ELO scores for the evaluated models.
Note:
Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena
among popular open-ended LLM benchmarks.
Input columns:
- evaluation (`str`): The evaluation of the responses generated by the LLMs.
- score (`str`): The score extracted from the evaluation.
References:
- [From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/)
- [`arena-hard-auto`](https://github.com/lm-sys/arena-hard-auto/tree/main)
Examples:
Rate the ELO scores for two assistant responses for a given an evaluation / comparison between both using Arean Hard prompts:
```python
from distilabel.pipeline import Pipeline
from distilabel.steps import GroupColumns, LoadDataFromDicts
from distilabel.steps.tasks import ArenaHard, TextGeneration
with Pipeline() as pipeline:
load_data = LoadDataFromDicts(
data=[{"instruction": "What is the capital of France?"}],
)
text_generation_a = TextGeneration(
llm=..., # LLM instance
output_mappings={"model_name": "generation_model"},
)
text_generation_b = TextGeneration(
llm=..., # LLM instance
output_mappings={"model_name": "generation_model"},
)
combine = GroupColumns(
columns=["generation", "generation_model"],
output_columns=["generations", "generation_models"],
)
arena_hard = ArenaHard(
llm=..., # LLM instance
)
arena_hard_results = ArenaHardResults(
custom_model_column="generation_models",
custom_weights={"A>B": 1, "A>>B": 3, "B>A": 1, "B>>A": 3},
)
load_data >> [text_generation_a, text_generation_b] >> combine >> arena_hard >> arena_hard_results
```
"""
custom_model_column: Optional[str] = None
custom_weights: Dict[str, int] = {"A>B": 1, "A>>B": 3, "B>A": 1, "B>>A": 3}
def load(self) -> None:
"""Ensures that the required dependencies are installed."""
super().load()
try:
import numpy as np # noqa: F401
import pandas as pd # noqa: F401
from sklearn.linear_model import LogisticRegression # noqa: F401
except ImportError as e:
raise ImportError(
"In order to run `ArenaHardResults`, the `arena-hard` extra dependencies"
" must be installed i.e. `numpy`, `pandas`, and `scikit-learn`.\n"
"Please install the dependencies by running `pip install distilabel[arena-hard]`."
) from e
# TODO: the `evaluation` is not really required as an input, so it could be removed, since
# only `score` is used / required
@property
def inputs(self) -> List[str]:
"""The inputs required by this step are the `evaluation` and the `score` generated
by the `ArenaHard` task. Since this step does use the identifiers `model_a` and `model_b`,
optionally one can set `custom_model_column` to use the model names if existing within
the input data, ideally this value should be `model_name` if connected from the `ArenaHard`
step."""
columns = ["evaluation", "score"]
if self.custom_model_column:
columns.append(self.custom_model_column)
return columns
@override
def process(self, inputs: StepInput) -> StepOutput: # type: ignore
"""This method processes the inputs generated by the `ArenaHard` task to calculate the
win rates for each of the models to evaluate. Since this step inherits from the `GlobalStep`,
it will wait for all the input batches to be processed, and then the output will be yielded in
case there's a follow up step, since this step won't modify the received inputs.
Args:
inputs: A list of Python dictionaries with the inputs of the task.
Yields:
A list of Python dictionaries with the outputs of the task.
References:
- https://github.com/lm-sys/arena-hard-auto/blob/main/show_result.py
"""
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
models = ["A", "B"]
if self.custom_model_column:
models = inputs[0][self.custom_model_column]
# TODO: the battles are only calculated for the first game, even though the official
# implementation also covers the possibility of a second game (not within the released
# dataset yet)
battles = pd.DataFrame()
for input in inputs:
output = {
# TODO: "question_id": input["question_id"],
"model_a": models[0],
"model_b": models[1],
}
if input["score"] in ["A>B", "A>>B"]:
output["winner"] = models[0]
rows = [output] * self.custom_weights[input["score"]]
elif input["score"] in ["B>A", "B>>A"]:
output["winner"] = models[1]
rows = [output] * self.custom_weights[input["score"]]
elif input["score"] == "A=B":
output["winner"] = "tie"
rows = [output]
else:
continue
battles = pd.concat([battles, pd.DataFrame(rows)])
models = pd.concat([battles["model_a"], battles["model_b"]]).unique()
models = pd.Series(np.arange(len(models)), index=models)
battles = pd.concat([battles, battles], ignore_index=True)
p = len(models.index)
n = battles.shape[0]
X = np.zeros([n, p])
X[np.arange(n), models[battles["model_a"]]] = +np.log(10)
X[np.arange(n), models[battles["model_b"]]] = -np.log(10)
Y = np.zeros(n)
Y[battles["winner"] == "model_a"] = 1.0
tie_idx = battles["winner"] == "tie"
tie_idx[len(tie_idx) // 2 :] = False
Y[tie_idx] = 1.0
lr = LogisticRegression(fit_intercept=False, penalty=None, tol=1e-8) # type: ignore
lr.fit(X, Y)
# The ELO scores are calculated assuming that the reference is `gpt-4-0314`
# with an starting ELO of 1000, so that the evaluated models are compared with
# `gtp-4-0314` only if it's available within the models
elo_scores = 400 * lr.coef_[0] + 1000
# TODO: we could parametrize the reference / anchor model, but left as is to be faithful to the
# original implementation
if "gpt-4-0314" in models.index:
elo_scores += 1000 - elo_scores[models["gpt-4-0314"]]
output = pd.Series(elo_scores, index=models.index).sort_values(ascending=False)
self._logger.info(f"Arena Hard ELO: {output}")
# Here only so that if follow up steps are connected the inputs are preserved,
# since this step doesn't modify nor generate new inputs
yield inputs
if __name__ == "__main__":
import json
from distilabel.models import InferenceEndpointsLLM, OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import (
GroupColumns,
KeepColumns,
LoadDataFromHub,
StepInput,
step,
)
from distilabel.steps.tasks import TextGeneration
from distilabel.steps.typing import StepOutput
@step(inputs=["turns"], outputs=["system_prompt", "instruction"])
def PrepareForTextGeneration(*inputs: StepInput) -> StepOutput:
for input in inputs:
for item in input:
item["system_prompt"] = "You are a helpful assistant."
item["instruction"] = item["turns"][0]["content"]
yield input
@step(
inputs=["question_id"],
outputs=["generation", "generation_model"],
step_type="global",
)
def LoadReference(*inputs: StepInput) -> StepOutput:
# File downloaded from https://raw.githubusercontent.com/lm-sys/arena-hard-auto/e0a8ea1df42c1df76451a6cd04b14e31ff992b87/data/arena-hard-v0.1/model_answer/gpt-4-0314.jsonl
lines = open("gpt-4-0314.jsonl", mode="r").readlines()
for input in inputs:
for item in input:
for line in lines:
data = json.loads(line)
if data["question_id"] == item["question_id"]:
item["generation"] = data["choices"][0]["turns"][0]["content"]
item["generation_model"] = data["model_id"]
break
yield input
with Pipeline(name="arena-hard-v0.1") as pipeline:
load_dataset = LoadDataFromHub(
name="load_dataset",
repo_id="alvarobartt/lmsys-arena-hard-v0.1",
split="test",
num_examples=5,
)
load_reference = LoadReference(name="load_reference")
prepare = PrepareForTextGeneration(name="prepare")
text_generation_cohere = TextGeneration(
name="text_generation_cohere",
llm=InferenceEndpointsLLM(
model_id="CohereForAI/c4ai-command-r-plus",
tokenizer_id="CohereForAI/c4ai-command-r-plus",
),
use_system_prompt=True,
input_batch_size=10,
output_mappings={"model_name": "generation_model"},
)
combine_columns = GroupColumns(
name="combine_columns",
columns=["generation", "generation_model"],
output_columns=["generations", "generation_models"],
)
arena_hard = ArenaHard(
name="arena_hard",
llm=OpenAILLM(model="gpt-4-1106-preview"),
output_mappings={"model_name": "evaluation_model"},
)
keep_columns = KeepColumns(
name="keep_columns",
columns=[
"question_id",
"category",
"cluster",
"system_prompt",
"instruction",
"generations",
"generation_models",
"evaluation",
"score",
"evaluation_model",
],
)
win_rates = ArenaHardResults(
name="win_rates", custom_model_column="generation_models"
)
load_dataset >> load_reference # type: ignore
load_dataset >> prepare >> text_generation_cohere # type: ignore
( # type: ignore
[load_reference, text_generation_cohere]
>> combine_columns
>> arena_hard
>> keep_columns
>> win_rates
)
distiset = pipeline.run(
parameters={ # type: ignore
text_generation_cohere.name: {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 4096,
"stop_sequences": ["<EOS_TOKEN>", "<|END_OF_TURN_TOKEN|>"],
}
}
},
arena_hard.name: {
"llm": {
"generation_kwargs": {
"temperature": 0.0,
"max_new_tokens": 4096,
}
}
},
},
)
if distiset is not None:
distiset.push_to_hub("arena-hard-results")
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import re
from pathlib import Path
from textwrap import dedent
from typing import Any, Dict, List, Optional, Union
from jinja2 import Template
from pydantic import PrivateAttr
from typing_extensions import override
from distilabel.models import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks.base import Task
from distilabel.steps.tasks.typing import ChatType
_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX = r"```lean4(.*?)```"
template_deepseek_prover_auto_formalization = """\
Mathematical Problem in Natural Language:
{{ informal_statement }}
{%- if few_shot %}
Please use the following examples to guide you with the answer:
{%- for example in examples %}
- {{ example }}
{%- endfor %}
{% endif -%}"""
class DeepSeekProverAutoFormalization(Task):
"""Task to translate a mathematical problem from natural language to Lean 4.
Note:
A related dataset (MMA from the paper) can be found in Hugging Face:
[casey-martin/multilingual-mathematical-autoformalization](https://huggingface.co/datasets/casey-martin/multilingual-mathematical-autoformalization).
Input columns:
- informal_statement (`str`): The statement to be formalized using Lean 4.
Output columns:
- formal_statement (`str`): The formalized statement using Lean 4, to be analysed.
Categories:
- generation
References:
- [`DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data`](https://arxiv.org/abs/2405.14333).
- [`Lean 4`](https://github.com/leanprover/lean4).
Examples:
Formalize a mathematical problem from natural language to Lean 4:
```python
from distilabel.steps.tasks import DeepSeekProverAutoFormalization
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
prover_autoformal = DeepSeekProverAutoFormalization(
llm=InferenceEndpointsLLM(
model_id="deepseek-ai/deepseek-math-7b-instruct",
tokenizer_id="deepseek-ai/deepseek-math-7b-instruct",
),
)
prover_autoformal.load()
result = next(
prover_autoformal.process(
[
{"informal_statement": "If a polynomial g is monic, then the root of g is integral over the ring R."},
]
)
)
# result
# [
# {
# 'informal_statement': 'If a polynomial g is monic, then the root of g is integral over the ring R.',
# 'formal_statement': 'theorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=',
# 'distilabel_metadata': {
# 'raw_output_deep_seek_prover_auto_formalization_0': '```lean4\ntheorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=\n```'
# },
# 'model_name': 'deepseek-prover'
# }
# ]
```
Use a few-shot setting to formalize a mathematical problem from natural language to Lean 4:
```python
from distilabel.steps.tasks import DeepSeekProverAutoFormalization
from distilabel.models import InferenceEndpointsLLM
# You can gain inspiration from the following examples to create your own few-shot examples:
# https://github.com/yangky11/miniF2F-lean4/blob/main/MiniF2F/Valid.lean
# Consider this as a placeholder for your actual LLM.
prover_autoformal = DeepSeekProverAutoFormalization(
llm=InferenceEndpointsLLM(
model_id="deepseek-ai/deepseek-math-7b-instruct",
tokenizer_id="deepseek-ai/deepseek-math-7b-instruct",
),
examples=[
"theorem amc12a_2019_p21 (z : ℂ) (h₀ : z = (1 + Complex.I) / Real.sqrt 2) :\n\n((∑ k : ℤ in Finset.Icc 1 12, z ^ k ^ 2) * (∑ k : ℤ in Finset.Icc 1 12, 1 / z ^ k ^ 2)) = 36 := by\n\nsorry",
"theorem amc12a_2015_p10 (x y : ℤ) (h₀ : 0 < y) (h₁ : y < x) (h₂ : x + y + x * y = 80) : x = 26 := by\n\nsorry"
]
)
prover_autoformal.load()
result = next(
prover_autoformal.process(
[
{"informal_statement": "If a polynomial g is monic, then the root of g is integral over the ring R."},
]
)
)
# result
# [
# {
# 'informal_statement': 'If a polynomial g is monic, then the root of g is integral over the ring R.',
# 'formal_statement': 'theorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=',
# 'distilabel_metadata': {
# 'raw_output_deep_seek_prover_auto_formalization_0': '```lean4\ntheorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=\n```'
# },
# 'model_name': 'deepseek-prover'
# }
# ]
```
"""
examples: Optional[List[str]] = None
system_prompt: str = "Translate the problem to Lean 4 (only the core declaration):\n```lean4\nformal statement goes here\n```"
_template: Union[Template, None] = PrivateAttr(...)
_few_shot: bool = PrivateAttr(default=False)
def load(self) -> None:
"""Loads the Jinja2 template."""
super().load()
self._template = Template(template_deepseek_prover_auto_formalization)
@property
def inputs(self) -> List[str]:
"""The input for the task is the `instruction`."""
return ["informal_statement"]
@property
def outputs(self):
"""The output for the task is a list of `instructions` containing the generated instructions."""
return ["formal_statement", "model_name"]
def format_input(self, input: str) -> ChatType: # type: ignore
"""The input is formatted as a `ChatType` assuming that the instruction
is the first interaction from the user within a conversation. And the
`system_prompt` is added as the first message if it exists."""
return [
{
"role": "system",
"content": self.system_prompt,
},
{
"role": "user",
"content": self._template.render(
informal_statement=input[self.inputs[0]],
few_shot=bool(self.examples),
examples=self.examples,
),
},
]
@override
def format_output( # type: ignore
self, output: Union[str, None], input: Dict[str, Any] = None
) -> Dict[str, Any]: # type: ignore
"""Extracts the formal statement from the Lean 4 output."""
match = re.search(_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX, output, re.DOTALL)
if match:
match = match.group(1).strip()
return {"formal_statement": match}
template_deepseek_prover_scorer = """\
To evaluate whether a formal Lean4 statement will be of interest to the community, consider the following criteria:
1. Relevance to Current Research: Does the statement address a problem or concept that is actively being researched in mathematics or related fields? Higher relevance scores indicate greater potential interest.
2. Complexity and Depth: Is the statement complex enough to challenge existing theories and methodologies, yet deep enough to provide significant insights or advancements? Complexity and depth showcase Lean4's capabilities and attract interest.
3. Interdisciplinary Potential: Does the statement offer opportunities for interdisciplinary research, connecting mathematics with other fields such as computer science, physics, or biology? Interdisciplinary projects often garner wide interest.
4. Community Needs and Gaps: Does the statement fill an identified need or gap within the Lean4 community or the broader mathematical community? Addressing these needs directly correlates with interest.
5. Innovativeness: How innovative is the statement? Does it propose new methods, concepts, or applications? Innovation drives interest and engagement.
Customize your evaluation for each problem accordingly, assessing it as 'excellent', 'good', 'above average', 'fair' or 'poor'.
You should respond in the following format for each statement:
'''
Natural language: (Detailed explanation of the informal statement, including any relevant background information, assumptions, and definitions.)
Analysis: (Provide a brief justification for each score, highlighting why the statement scored as it did across the criteria.)
Assessment: (Based on the criteria, rate the statement as 'excellent', 'good', 'above average', 'fair' or 'poor'. JUST the Assessment.)
'''"""
class DeepSeekProverScorer(Task):
"""Task to evaluate the quality of a formalized mathematical problem in Lean 4,
inspired by the DeepSeek-Prover task for scoring.
Note:
A related dataset (MMA from the paper) can be found in Hugging Face:
[casey-martin/multilingual-mathematical-autoformalization](https://huggingface.co/datasets/casey-martin/multilingual-mathematical-autoformalization).
Input columns:
- informal_statement (`str`): The statement to be formalized using Lean 4.
- formal_statement (`str`): The formalized statement using Lean 4, to be analysed.
Output columns:
- natural_language (`str`): Explanation for the problem.
- analysis (`str`): Analysis of the different points defined in the prompt.
- assessment (`str`): Result of the assessment.
Categories:
- scorer
- quality
- response
References:
- [`DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data`](https://arxiv.org/abs/2405.14333).
- [`Lean 4`](https://github.com/leanprover/lean4).
Examples:
Analyse a formal statement in Lean 4:
```python
from distilabel.steps.tasks import DeepSeekProverScorer
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
prover_scorer = DeepSeekProverAutoFormalization(
llm=InferenceEndpointsLLM(
model_id="deepseek-ai/deepseek-math-7b-instruct",
tokenizer_id="deepseek-ai/deepseek-math-7b-instruct",
),
)
prover_scorer.load()
result = next(
prover_scorer.process(
[
{"formal_statement": "theorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):="},
]
)
)
# result
# [
# {
# 'formal_statement': 'theorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=',
# 'informal_statement': 'INFORMAL',
# 'analysis': 'ANALYSIS',
# 'assessment': 'ASSESSMENT',
# 'distilabel_metadata': {
# 'raw_output_deep_seek_prover_scorer_0': 'Natural language:\nINFORMAL\nAnalysis:\nANALYSIS\nAssessment:\nASSESSMENT'
# },
# 'model_name': 'deepseek-prover-scorer'
# }
# ]
```
"""
_template: Union[Template, None] = PrivateAttr(...)
def load(self) -> None:
"""Loads the Jinja2 template."""
super().load()
self._template = Template(template_deepseek_prover_scorer)
@property
def inputs(self) -> List[str]:
"""The input for the task is the `instruction`."""
return ["informal_statement", "formal_statement"]
@property
def outputs(self):
"""The output for the task is a list of `instructions` containing the generated instructions."""
return ["natural_language", "analysis", "assessment", "model_name"]
def format_input(self, input: str) -> ChatType: # type: ignore
"""The input is formatted as a `ChatType` assuming that the instruction
is the first interaction from the user within a conversation. And the
`system_prompt` is added as the first message if it exists."""
return [
{
"role": "system",
"content": self._template.render(),
},
{
"role": "user",
"content": f"## Informal statement:\n{input[self.inputs[0]]}\n\n ## Formal statement:\n{input[self.inputs[1]]}",
},
]
@override
def format_output( # type: ignore
self, output: Union[str, None], input: Dict[str, Any] = None
) -> Dict[str, Any]: # type: ignore
"""Analyses the formal statement with Lean 4 output and generates an assessment
and the corresponding informal assessment."""
try:
result = output.split("Natural language:")[1].strip()
natural_language, analysis = result.split("Analysis:")
analysis, assessment = analysis.split("Assessment:")
natural_language = natural_language.strip()
analysis = analysis.strip()
assessment = assessment.strip()
except Exception:
natural_language = analysis = assessment = None
return {
"natural_language": natural_language,
"analysis": analysis,
"assessment": assessment,
}
class DeepSeekProverSolver(Task):
"""Task to generate a proof for a formal statement (theorem) in lean4.
Input columns:
- formal_statement (`str`): The formalized statement using Lean 4.
Output columns:
- proof (`str`): The proof for the formal statement theorem.
Categories:
- scorer
- quality
- response
References:
- [`DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data`](https://arxiv.org/abs/2405.14333).
"""
system_prompt: str = (
"You are an expert in proving mathematical theorems formalized in lean4 language. "
"Your answers consist just in the proof to the theorem given, and nothing else."
)
@property
def inputs(self) -> List[str]:
"""The input for the task is the `formal_statement`."""
return ["formal_statement"]
@property
def outputs(self):
"""The output for the task is the proof for the formal statement theorem."""
return ["proof"]
def format_input(self, input: str) -> ChatType: # type: ignore
"""The input is formatted as a `ChatType`, with a system prompt to guide our model."""
prompt = dedent("""
Give me a proof for the following theorem:
```lean4
{theorem}
```""")
return [
{
"role": "system",
"content": self.system_prompt,
},
{
"role": "user",
"content": prompt.format(theorem=input["formal_statement"]),
},
]
def format_output( # type: ignore
self, output: Union[str, None], input: Dict[str, Any] = None
) -> Dict[str, Any]: # type: ignore
import re
match = re.search(_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX, output, re.DOTALL)
if match:
match = match.group(1).strip()
return {"proof": match}
examples = [
dedent("""
## Statement in natural language:
For real numbers k and x:
If x is equal to (13 - √131) / 4, and
If the equation 2x² - 13x + k = 0 is satisfied,
Then k must be equal to 19/4.
## Formalized:
theorem mathd_algebra_116 (k x : ℝ) (h₀ : x = (13 - Real.sqrt 131) / 4)
(h₁ : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :="""),
dedent("""
## Statement in natural language:
The greatest common divisor (GCD) of 20 factorial (20!) and 200,000 is equal to 40,000.
## Formalized:
theorem mathd_algebra_116 (k x : ℝ) (h₀ : x = (13 - Real.sqrt 131) / 4)
(h₁ : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :="""),
dedent("""
## Statement in natural language:
Given two integers x and y:
If y is positive (greater than 0),
And y is less than x,
And the equation x + y + xy = 80 is true,
Then x must be equal to 26.
## Formalized:
theorem mathd_algebra_116 (k x : ℝ) (h₀ : x = (13 - Real.sqrt 131) / 4)
(h₁ : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :="""),
]
with Pipeline(name="test_deepseek_prover") as pipeline:
data_loader = LoadDataFromHub(
repo_id="plaguss/informal-mathematical-statements-tiny",
split="val",
batch_size=8,
)
llm = InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
)
auto_formalization = DeepSeekProverAutoFormalization(
name="auto_formalization", input_batch_size=8, llm=llm, examples=examples
)
prover_scorer = DeepSeekProverScorer(
name="prover_scorer",
input_batch_size=8,
llm=llm,
)
proof_generator = DeepSeekProverSolver(
name="proof_generator", input_batch_size=8, llm=llm
)
(data_loader >> auto_formalization >> prover_scorer >> proof_generator)
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument(
"-d",
"--dry-run",
action=argparse.BooleanOptionalAction,
help="Do a dry run for testing purposes.",
)
args = parser.parse_args()
pipeline_parameters = {
data_loader.name: {"split": "val"},
auto_formalization.name: {
"llm": {
"generation_kwargs": {
"temperature": 0.6,
"top_p": 0.9,
"max_new_tokens": 512,
}
}
},
prover_scorer.name: {
"llm": {
"generation_kwargs": {
"temperature": 0.6,
"top_p": 0.9,
"max_new_tokens": 512,
}
}
},
}
ds_name = "test_deepseek_prover"
if args.dry_run:
distiset = pipeline.dry_run(batch_size=1, parameters=pipeline_parameters)
distiset.save_to_disk(Path.home() / f"Downloads/{ds_name}")
import pprint
pprint.pprint(distiset["default"]["train"][0])
else:
distiset = pipeline.run(parameters=pipeline_parameters)
distiset.push_to_hub(ds_name, include_script=True)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment