Commit 4d4d8f59 authored by chenzk's avatar chenzk
Browse files

v1.0

parents
Pipeline #2741 canceled with stages
# Serving an `LLM` for sharing it between several `Task`s
It's very common to want to use the same `LLM` for several `Task`s in a pipeline. To avoid loading the `LLM` as many times as the number of `Task`s and avoid wasting resources, it's recommended to serve the model using solutions like [`text-generation-inference`](https://huggingface.co/docs/text-generation-inference/quicktour#launching-tgi) or [`vLLM`](https://docs.vllm.ai/en/stable/serving/deploying_with_docker.html), and then use an `AsyncLLM` compatible client like `InferenceEndpointsLLM` or `OpenAILLM` to communicate with the server respectively.
## Serving LLMs using `text-generation-inference`
```bash
model=meta-llama/Meta-Llama-3-8B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
-e HUGGING_FACE_HUB_TOKEN=<secret> \
ghcr.io/huggingface/text-generation-inference:2.0.4 \
--model-id $model
```
!!! NOTE
The bash command above has been copy-pasted from the official docs [text-generation-inference](https://huggingface.co/docs/text-generation-inference/quicktour#launching-tgi). Please refer to the official docs for more information.
And then we can use `InferenceEndpointsLLM` with `base_url=http://localhost:8080` (pointing to our `TGI` local deployment):
```python
from distilabel.models import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration, UltraFeedback
with Pipeline(name="serving-llm") as pipeline:
load_data = LoadDataFromDicts(
data=[{"instruction": "Write a poem about the sun and moon."}]
)
# `base_url` points to the address of the `TGI` serving the LLM
llm = InferenceEndpointsLLM(base_url="http://192.168.1.138:8080")
text_generation = TextGeneration(
llm=llm,
num_generations=3,
group_generations=True,
output_mappings={"generation": "generations"},
)
ultrafeedback = UltraFeedback(aspect="overall-rating", llm=llm)
load_data >> text_generation >> ultrafeedback
```
## Serving LLMs using `vLLM`
```bash
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3-8B-Instruct
```
!!! NOTE
The bash command above has been copy-pasted from the official docs [vLLM](https://docs.vllm.ai/en/stable/serving/deploying_with_docker.html). Please refer to the official docs for more information.
And then we can use `OpenAILLM` with `base_url=http://localhost:8000` (pointing to our `vLLM` local deployment):
```python
from distilabel.models import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration, UltraFeedback
with Pipeline(name="serving-llm") as pipeline:
load_data = LoadDataFromDicts(
data=[{"instruction": "Write a poem about the sun and moon."}]
)
# `base_url` points to the address of the `vLLM` serving the LLM
llm = OpenAILLM(base_url="http://192.168.1.138:8000", model="")
text_generation = TextGeneration(
llm=llm,
num_generations=3,
group_generations=True,
output_mappings={"generation": "generations"},
)
ultrafeedback = UltraFeedback(aspect="overall-rating", llm=llm)
load_data >> text_generation >> ultrafeedback
```
# Structured data generation
`Distilabel` has integrations with relevant libraries to generate structured text i.e. to guide the [`LLM`][distilabel.models.llms.LLM] towards the generation of structured outputs following a JSON schema, a regex, etc.
## Outlines
`Distilabel` integrates [`outlines`](https://outlines-dev.github.io/outlines/welcome/) within some [`LLM`][distilabel.models.llms.LLM] subclasses. At the moment, the following LLMs integrated with `outlines` are supported in `distilabel`: [`TransformersLLM`][distilabel.models.llms.TransformersLLM], [`vLLM`][distilabel.models.llms.vLLM] or [`LlamaCppLLM`][distilabel.models.llms.LlamaCppLLM], so that anyone can generate structured outputs in the form of *JSON* or a parseable *regex*.
The [`LLM`][distilabel.models.llms.LLM] has an argument named `structured_output`[^1] that determines how we can generate structured outputs with it, let's see an example using [`LlamaCppLLM`][distilabel.models.llms.LlamaCppLLM].
!!! Note
For `outlines` integration to work you may need to install the corresponding dependencies:
```bash
pip install distilabel[outlines]
```
### JSON
We will start with a JSON example, where we initially define a `pydantic.BaseModel` schema to guide the generation of the structured output.
!!! NOTE
Take a look at [`StructuredOutputType`][distilabel.typing.models.StructuredOutputType] to see the expected format
of the `structured_output` dict variable.
```python
from pydantic import BaseModel
class User(BaseModel):
name: str
last_name: str
id: int
```
And then we provide that schema to the `structured_output` argument of the LLM.
```python
from distilabel.models import LlamaCppLLM
llm = LlamaCppLLM(
model_path="./openhermes-2.5-mistral-7b.Q4_K_M.gguf" # (1)
n_gpu_layers=-1,
n_ctx=1024,
structured_output={"format": "json", "schema": User},
)
llm.load()
```
1. We have previously downloaded a GGUF model i.e. `llama.cpp` compatible, from the Hugging Face Hub using curl[^2], but any model can be used as replacement, as long as the `model_path` argument is updated.
And we are ready to pass our instruction as usual:
```python
import json
result = llm.generate(
[[{"role": "user", "content": "Create a user profile for the following marathon"}]],
max_new_tokens=50
)
data = json.loads(result[0][0])
data
# {'name': 'Kathy', 'last_name': 'Smith', 'id': 4539210}
User(**data)
# User(name='Kathy', last_name='Smith', id=4539210)
```
We get back a Python dictionary (formatted as a string) that we can parse using `json.loads`, or validate it directly using the `User`, which si a `pydantic.BaseModel` instance.
### Regex
The following example shows an example of text generation whose output adhere to a regular expression:
```python
pattern = r"<name>(.*?)</name>.*?<grade>(.*?)</grade>" # the same pattern for re.compile
llm=LlamaCppLLM(
model_path=model_path,
n_gpu_layers=-1,
n_ctx=1024,
structured_output={"format": "regex", "schema": pattern},
)
llm.load()
result = llm.generate(
[
[
{"role": "system", "content": "You are Simpsons' fans who loves assigning grades from A to E, where A is the best and E is the worst."},
{"role": "user", "content": "What's up with Homer Simpson?"}
]
],
max_new_tokens=200
)
```
We can check the output by parsing the content using the same pattern we required from the LLM.
```python
import re
match = re.search(pattern, result[0][0])
if match:
name = match.group(1)
grade = match.group(2)
print(f"Name: {name}, Grade: {grade}")
# Name: Homer Simpson, Grade: C+
```
These were some simple examples, but one can see the options this opens.
!!! Tip
A full pipeline example can be seen in the following script:
[`examples/structured_generation_with_outlines.py`](../../pipeline_samples/examples/llama_cpp_with_outlines.md)
[^1]:
You can check the variable type by importing it from:
```python
from distilabel.steps.tasks.structured_outputs.outlines import StructuredOutputType
```
[^2]:
Download the model with curl:
```bash
curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf
```
## Instructor
For other LLM providers behind APIs, there's no direct way of accessing the internal logit processor like `outlines` does, but thanks to [`instructor`](https://python.useinstructor.com/) we can generate structured output from LLM providers based on `pydantic.BaseModel` objects. We have integrated `instructor` to deal with the [`AsyncLLM`][distilabel.models.llms.AsyncLLM].
!!! Note
For `instructor` integration to work you may need to install the corresponding dependencies:
```bash
pip install distilabel[instructor]
```
!!! Note
Take a look at [`InstructorStructuredOutputType`][distilabel.typing.models.InstructorStructuredOutputType] to see the expected format
of the `structured_output` dict variable.
The following is the same example you can see with `outlines`'s `JSON` section for comparison purposes.
```python
from pydantic import BaseModel
class User(BaseModel):
name: str
last_name: str
id: int
```
And then we provide that schema to the `structured_output` argument of the LLM:
!!! NOTE
In this example we are using *Meta Llama 3.1 8B Instruct*, keep in mind not all the models support structured outputs.
```python
from distilabel.models import MistralLLM
llm = InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
structured_output={"schema": User}
)
llm.load()
```
And we are ready to pass our instructions as usual:
```python
import json
result = llm.generate(
[[{"role": "user", "content": "Create a user profile for the following marathon"}]],
max_new_tokens=256
)
data = json.loads(result[0][0])
data
# {'name': 'John', 'last_name': 'Doe', 'id': 12345}
User(**data)
# User(name='John', last_name='Doe', id=12345)
```
We get back a Python dictionary (formatted as a string) that we can parse using `json.loads`, or validate it directly using the `User`, which is a `pydantic.BaseModel` instance.
!!! Tip
A full pipeline example can be seen in the following script:
[`examples/structured_generation_with_instructor.py`](../../pipeline_samples/examples/mistralai_with_instructor.md)
## OpenAI JSON
OpenAI offers a [JSON Mode](https://platform.openai.com/docs/guides/text-generation/json-mode) to deal with structured output via their API, let's see how to make use of them. The JSON mode instructs the model to always return a JSON object following the instruction required.
!!! WARNING
Bear in mind, for this to work, you must instruct the model in some way to generate JSON, either in the `system message` or in the instruction, as can be seen in the [API reference](https://platform.openai.com/docs/guides/text-generation/json-mode).
Contrary to what we have via `outlines`, JSON mode will not guarantee the output matches any specific schema, only that it is valid and parses without errors. More information can be found in the OpenAI documentation.
Other than the reference to generating JSON, to ensure the model generates parseable JSON we can pass the argument `response_format="json"`[^3]:
```python
from distilabel.models import OpenAILLM
llm = OpenAILLM(model="gpt4-turbo", api_key="api.key")
llm.generate(..., response_format="json")
```
[^3]:
Keep in mind that to interact with this `response_format` argument in a pipeline, you will have to pass it via the `generation_kwargs`:
```python
# Assuming a pipeline is already defined, and we have a task using OpenAILLM called `task_with_openai`:
pipeline.run(
parameters={
"task_with_openai": {
"llm": {
"generation_kwargs": {
"response_format": "json"
}
}
}
}
)
```
# Executing Tasks with LLMs
## Working with LLMs
LLM subclasses are designed to be used within a [Task][distilabel.steps.tasks.Task], but they can also be used standalone.
```python
from distilabel.models import InferenceEndpointsLLM
llm = InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct"
)
llm.load()
llm.generate_outputs(
inputs=[
[{"role": "user", "content": "What's the capital of Spain?"}],
],
)
# [
# {
# "generations": [
# "The capital of Spain is Madrid."
# ],
# "statistics": {
# "input_tokens": [
# 43
# ],
# "output_tokens": [
# 8
# ]
# }
# }
# ]
```
!!! Note
Always call the `LLM.load` or `Task.load` method when using LLMs standalone or as part of a `Task`. If using a `Pipeline`, this is done automatically in `Pipeline.run()`.
!!! Tip "New in version 1.5.0"
Since version `1.5.0` the LLM output is a list of dictionaries (one per item in the `inputs`),
each containing `generations`, that reports the text returned by the `LLM`, and a `statistics` field that will store statistics related to the `LLM` generation. Initially, this will include
`input_tokens` and `output_tokens` when available, which will be obtained via the API when available, or if a tokenizer is available for the model used, using the tokenizer for the model.
This data will be moved by the corresponding `Task` during the pipeline processing and moved to `distilabel_metadata` so we can operate on this data if we want, like for example computing the number of tokens per dataset.
To access to the previous result one just has to access to the generations in the resulting dictionary: `result[0]["generations"]`.
### Offline Batch Generation
By default, all `LLM`s will generate text in a synchronous manner i.e. send inputs using `generate_outputs` method that will get blocked until outputs are generated. There are some `LLM`s (such as [OpenAILLM][distilabel.models.llms.openai.OpenAILLM]) that implements what we denote as _offline batch generation_, which allows to send the inputs to the LLM-as-a-service which will generate the outputs asynchronously and give us a job id that we can use later to check the status and retrieve the generated outputs when they are ready. LLM-as-a-service platforms offers this feature as a way to save costs in exchange of waiting for the outputs to be generated.
To use this feature in `distilabel` the only thing we need to do is to set the `use_offline_batch_generation` attribute to `True` when creating the `LLM` instance:
```python
from distilabel.models import OpenAILLM
llm = OpenAILLM(
model="gpt-4o",
use_offline_batch_generation=True,
)
llm.load()
llm.jobs_ids # (1)
# None
llm.generate_outputs( # (2)
inputs=[
[{"role": "user", "content": "What's the capital of Spain?"}],
],
)
# DistilabelOfflineBatchGenerationNotFinishedException: Batch generation with jobs_ids=('batch_OGB4VjKpu2ay9nz3iiFJxt5H',) is not finished
llm.jobs_ids # (3)
# ('batch_OGB4VjKpu2ay9nz3iiFJxt5H',)
llm.generate_outputs( # (4)
inputs=[
[{"role": "user", "content": "What's the capital of Spain?"}],
],
)
# [{'generations': ['The capital of Spain is Madrid.'],
# 'statistics': {'input_tokens': [13], 'output_tokens': [7]}}]
```
1. At first the `jobs_ids` attribute is `None`.
2. The first call to `generate_outputs` will send the inputs to the LLM-as-a-service and return a `DistilabelOfflineBatchGenerationNotFinishedException` since the outputs are not ready yet.
3. After the first call to `generate_outputs` the `jobs_ids` attribute will contain the job ids created for generating the outputs.
4. The second call or subsequent calls to `generate_outputs` will return the outputs if they are ready or raise a `DistilabelOfflineBatchGenerationNotFinishedException` if they are not ready yet.
The `offline_batch_generation_block_until_done` attribute can be used to block the `generate_outputs` method until the outputs are ready polling the platform the specified amount of seconds.
```python
from distilabel.models import OpenAILLM
llm = OpenAILLM(
model="gpt-4o",
use_offline_batch_generation=True,
offline_batch_generation_block_until_done=5, # poll for results every 5 seconds
)
llm.load()
llm.generate_outputs(
inputs=[
[{"role": "user", "content": "What's the capital of Spain?"}],
],
)
# [{'generations': ['The capital of Spain is Madrid.'],
# 'statistics': {'input_tokens': [13], 'output_tokens': [7]}}]
```
### Within a Task
Pass the LLM as an argument to the [`Task`][distilabel.steps.tasks.Task], and the task will handle the rest.
```python
from distilabel.models import OpenAILLM
from distilabel.steps.tasks import TextGeneration
llm = OpenAILLM(model="gpt-4o-mini")
task = TextGeneration(name="text_generation", llm=llm)
task.load()
next(task.process(inputs=[{"instruction": "What's the capital of Spain?"}]))
# [{'instruction': "What's the capital of Spain?",
# 'generation': 'The capital of Spain is Madrid.',
# 'distilabel_metadata': {'raw_output_text_generation': 'The capital of Spain is Madrid.',
# 'raw_input_text_generation': [{'role': 'user',
# 'content': "What's the capital of Spain?"}],
# 'statistics_text_generation': {'input_tokens': 13, 'output_tokens': 7}},
# 'model_name': 'gpt-4o-mini'}]
```
!!! Note
As mentioned in *Working with LLMs* section, the generation of an LLM is automatically moved to `distilabel_metadata` to avoid interference with the common workflow, so the addition of the `statistics` it's an extra component available for the user, but nothing has to be changed in the
defined pipelines.
### Runtime Parameters
LLMs can have runtime parameters, such as `generation_kwargs`, provided via the `Pipeline.run()` method using the `params` argument.
!!! Note
Runtime parameters can differ between LLM subclasses, caused by the different functionalities offered by the LLM providers.
```python
from distilabel.pipeline import Pipeline
from distilabel.models import OpenAILLM
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
with Pipeline(name="text-generation-pipeline") as pipeline:
load_dataset = LoadDataFromDicts(
name="load_dataset",
data=[{"instruction": "Write a short story about a dragon that saves a princess from a tower."}],
)
text_generation = TextGeneration(
name="text_generation",
llm=OpenAILLM(model="gpt-4o-mini"),
)
load_dataset >> text_generation
if __name__ == "__main__":
pipeline.run(
parameters={
text_generation.name: {"llm": {"generation_kwargs": {"temperature": 0.3}}},
},
)
```
## Creating custom LLMs
To create custom LLMs, subclass either [`LLM`][distilabel.models.llms.LLM] for synchronous or [`AsyncLLM`][distilabel.models.llms.AsyncLLM] for asynchronous LLMs. Implement the following methods:
* `model_name`: A property containing the model's name.
* `generate`: A method that takes a list of prompts and returns generated texts.
* `agenerate`: A method that takes a single prompt and returns generated texts. This method is used within the `generate` method of the `AsyncLLM` class.
* (optional) `get_last_hidden_state`: is a method that will take a list of prompts and return a list of hidden states. This method is optional and will be used by some tasks such as the [`GenerateEmbeddings`][distilabel.steps.tasks.GenerateEmbeddings] task.
=== "Custom LLM"
```python
from typing import Any
from pydantic import validate_call
from distilabel.models import LLM
from distilabel.typing import GenerateOutput, HiddenState
from distilabel.typing import ChatType
class CustomLLM(LLM):
@property
def model_name(self) -> str:
return "my-model"
@validate_call
def generate(self, inputs: List[ChatType], num_generations: int = 1, **kwargs: Any) -> List[GenerateOutput]:
for _ in range(num_generations):
...
def get_last_hidden_state(self, inputs: List[ChatType]) -> List[HiddenState]:
...
```
=== "Custom AsyncLLM"
```python
from typing import Any
from pydantic import validate_call
from distilabel.models import AsyncLLM
from distilabel.typing import GenerateOutput, HiddenState
from distilabel.typing import ChatType
class CustomAsyncLLM(AsyncLLM):
@property
def model_name(self) -> str:
return "my-model"
@validate_call
async def agenerate(self, input: ChatType, num_generations: int = 1, **kwargs: Any) -> GenerateOutput:
for _ in range(num_generations):
...
def get_last_hidden_state(self, inputs: List[ChatType]) -> List[HiddenState]:
...
```
`generate` and `agenerate` keyword arguments (but `input` and `num_generations`) are considered as `RuntimeParameter`s, so a value can be passed to them via the `parameters` argument of the `Pipeline.run` method.
!!! Note
To have the arguments of the `generate` and `agenerate` coerced to the expected types, the `validate_call` decorator is used, which will automatically coerce the arguments to the expected types, and raise an error if the types are not correct. This is specially useful when providing a value for an argument of `generate` or `agenerate` from the CLI, since the CLI will always provide the arguments as strings.
!!! Warning
Additional LLMs created in `distilabel` will have to take into account how the `statistics` are generated to properly include them in the LLM output.
## Available LLMs
[Our LLM gallery](../../../../components-gallery/llms/index.md) shows a list of the available LLMs that can be used within the `distilabel` library.
# Execute Steps and Tasks in a Pipeline
## How to create a pipeline
[`Pipeline`][distilabel.pipeline.Pipeline] organise the Steps and Tasks in a sequence, where the output of one step is the input of the next one.
A [`Pipeline`][distilabel.pipeline.Pipeline] should be created by making use of the context manager along with passing a **name**, and optionally a **description**.
```python
from distilabel.pipeline import Pipeline
with Pipeline("pipe-name", description="My first pipe") as pipeline:
...
```
### Connecting steps with the `Step.connect` method
Now, we can define the steps of our [`Pipeline`][distilabel.pipeline.Pipeline].
!!! NOTE
Steps without predecessors (i.e. root steps), need to be [`GeneratorStep`][distilabel.steps.GeneratorStep]s such as [`LoadDataFromDicts`][distilabel.steps.LoadDataFromDicts] or [`LoadDataFromHub`][distilabel.steps.LoadDataFromHub]. After this, other steps can be defined.
```python
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
with Pipeline("pipe-name", description="My first pipe") as pipeline:
load_dataset = LoadDataFromHub(name="load_dataset")
...
```
!!! Tip "Easily load your datasets"
If you are already used to work with Hugging Face's `Dataset` via `load_dataset` or `pd.DataFrame`, you can create the `GeneratorStep` directly from the dataset (or dataframe), and create the step with the help of [`make_generator_step`][distilabel.steps.generators.utils.make_generator_step]:
=== "From a list of dicts"
```python
from distilabel.pipeline import Pipeline
from distilabel.steps import make_generator_step
dataset = [{"instruction": "Tell me a joke."}]
with Pipeline("pipe-name", description="My first pipe") as pipeline:
loader = make_generator_step(dataset, output_mappings={"prompt": "instruction"})
...
```
=== "From `datasets.Dataset`"
```python
from datasets import load_dataset
from distilabel.pipeline import Pipeline
from distilabel.steps import make_generator_step
dataset = load_dataset(
"DIBT/10k_prompts_ranked",
split="train"
).filter(
lambda r: r["avg_rating"]>=4 and r["num_responses"]>=2
).select(range(500))
with Pipeline("pipe-name", description="My first pipe") as pipeline:
loader = make_generator_step(dataset, output_mappings={"prompt": "instruction"})
...
```
=== "From `pd.DataFrame`"
```python
import pandas as pd
from distilabel.pipeline import Pipeline
from distilabel.steps import make_generator_step
dataset = pd.read_csv("path/to/dataset.csv")
with Pipeline("pipe-name", description="My first pipe") as pipeline:
loader = make_generator_step(dataset, output_mappings={"prompt": "instruction"})
...
```
Next, we will use `prompt` column from the dataset obtained through `LoadDataFromHub` and use several `LLM`s to execute a `TextGeneration` task. We will also use the `Task.connect()` method to connect the steps, so the output of one step is the input of the next one.
!!! NOTE
The order of the execution of the steps will be determined by the connections of the steps. In this case, the `TextGeneration` tasks will be executed after the `LoadDataFromHub` step.
```python
from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
with Pipeline("pipe-name", description="My first pipe") as pipeline:
load_dataset = LoadDataFromHub(name="load_dataset")
for llm in (
OpenAILLM(model="gpt-4-0125-preview"),
MistralLLM(model="mistral-large-2402"),
VertexAILLM(model="gemini-1.5-pro"),
):
task = TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
task.connect(load_dataset)
...
```
For each row of the dataset, the `TextGeneration` task will generate a text based on the `instruction` column and the `LLM` model, and store the result (a single string) in a new column called `generation`. Because we need to have the `response`s in the same column, we will add `GroupColumns` to combine them all in the same column as a list of strings.
!!! NOTE
In this case, the `GroupColumns` tasks will be executed after all `TextGeneration` steps.
```python
from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import GroupColumns, LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
with Pipeline("pipe-name", description="My first pipe") as pipeline:
load_dataset = LoadDataFromHub(name="load_dataset")
combine_generations = GroupColumns(
name="combine_generations",
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
)
for llm in (
OpenAILLM(model="gpt-4-0125-preview"),
MistralLLM(model="mistral-large-2402"),
VertexAILLM(model="gemini-1.5-pro"),
):
task = TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
load_dataset.connect(task)
task.connect(combine_generations)
```
### Connecting steps with the `>>` operator
Besides the `Step.connect` method: `step1.connect(step2)`, there's an alternative way by making use of the `>>` operator. We can connect steps in a more readable way, and it's also possible to connect multiple steps at once.
=== "Step per step"
Each call to `step1.connect(step2)` has been exchanged by `step1 >> step2` within the loop.
```python
from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import GroupColumns, LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
with Pipeline("pipe-name", description="My first pipe") as pipeline:
load_dataset = LoadDataFromHub(name="load_dataset")
combine_generations = GroupColumns(
name="combine_generations",
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
)
for llm in (
OpenAILLM(model="gpt-4-0125-preview"),
MistralLLM(model="mistral-large-2402"),
VertexAILLM(model="gemini-1.5-pro"),
):
task = TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
load_dataset >> task >> combine_generations
```
=== "Multiple steps at once"
Each task is first appended to a list, and then all the calls to connections are done in a single call.
```python
from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import GroupColumns, LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
with Pipeline("pipe-name", description="My first pipe") as pipeline:
load_dataset = LoadDataFromHub(name="load_dataset")
combine_generations = GroupColumns(
name="combine_generations",
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
)
tasks = []
for llm in (
OpenAILLM(model="gpt-4-0125-preview"),
MistralLLM(model="mistral-large-2402"),
VertexAILLM(model="gemini-1.5-pro"),
):
tasks.append(
TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
)
load_dataset >> tasks >> combine_generations
```
### Routing batches to specific downstream steps
In some pipelines, you may want to send batches from a single upstream step to specific downstream steps based on certain conditions. To achieve this, you can use a `routing_batch_function`. This function takes a list of downstream steps and returns a list of step names to which each batch should be routed.
Let's update the example above to route the batches loaded by the `LoadDataFromHub` step to just 2 of the `TextGeneration` tasks. First, we will create our custom [`routing_batch_function`][distilabel.pipeline.routing_batch_function.routing_batch_function], and then we will update the pipeline to use it:
```python
import random
from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline, routing_batch_function
from distilabel.steps import GroupColumns, LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
@routing_batch_function
def sample_two_steps(steps: list[str]) -> list[str]:
return random.sample(steps, 2)
with Pipeline("pipe-name", description="My first pipe") as pipeline:
load_dataset = LoadDataFromHub(
name="load_dataset",
output_mappings={"prompt": "instruction"},
)
tasks = []
for llm in (
OpenAILLM(model="gpt-4-0125-preview"),
MistralLLM(model="mistral-large-2402"),
VertexAILLM(model="gemini-1.0-pro"),
):
tasks.append(
TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
)
combine_generations = GroupColumns(
name="combine_generations",
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
)
load_dataset >> sample_two_steps >> tasks >> combine_generations
```
The `routing_batch_function` that we just built is a common one, so `distilabel` comes with a builtin function that can be used to achieve the same behavior:
```python
from distilable.pipeline import sample_n_steps
sample_two_steps = sample_n_steps(2)
```
## Running the pipeline
### Pipeline.dry_run
Before running the `Pipeline` we can check if the pipeline is valid using the `Pipeline.dry_run()` method. It takes the same parameters as the `run` method which we will discuss in the following section, plus the `batch_size` we want the dry run to use (by default set to 1).
```python
with Pipeline("pipe-name", description="My first pipe") as pipeline:
...
if __name__ == "__main__":
distiset = pipeline.dry_run(parameters=..., batch_size=1)
```
### Pipeline.run
After testing, we can now execute the full `Pipeline` using the `Pipeline.run()` method.
```python
with Pipeline("pipe-name", description="My first pipe") as pipeline:
...
if __name__ == "__main__":
distiset = pipeline.run(
parameters={
"load_dataset": {
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
"text_generation_with_gpt-4-0125-preview": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
"text_generation_with_mistral-large-2402": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
"text_generation_with_gemini-1.0-pro": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
},
)
```
But if we run the pipeline above, we will see that the `run` method will fail:
```
ValueError: Step 'text_generation_with_gpt-4-0125-preview' requires inputs ['instruction'], but only the inputs=['prompt', 'completion', 'meta'] are available, which means that the inputs=['instruction'] are missing or not available
when the step gets to be executed in the pipeline. Please make sure previous steps to 'text_generation_with_gpt-4-0125-preview' are generating the required inputs.
```
This is because, before actually running the pipeline, we must ensure each step has the necessary input columns to be executed. In this case, the `TextGeneration` task requires the `instruction` column, but the `LoadDataFromHub` step generates the `prompt` column. To solve this, we can use the `output_mappings` or `input_mapping` arguments of individual `Step`s, to map columns from one step to another.
```python
with Pipeline("pipe-name", description="My first pipe") as pipeline:
load_dataset = LoadDataFromHub(
name="load_dataset",
output_mappings={"prompt": "instruction"}
)
...
```
If we execute the pipeline again, it will run successfully and we will have a `Distiset` with the outputs of all the leaf steps of the pipeline which we can push to the Hugging Face Hub.
```python
if __name__ == "__main__":
distiset = pipeline.run(...)
distiset.push_to_hub("distilabel-internal-testing/instruction-dataset-mini-with-generations")
```
#### Pipeline.run with a dataset
Note that in most cases if you don't need the extra flexibility the [`GeneratorSteps`][distilabel.steps.base.GeneratorStep] bring you, you can create a dataset as you would normally do and pass it to the [Pipeline.run][distilabel.pipeline.base.BasePipeline.run] method directly. Look at the highlighted lines to see the updated lines:
```python hl_lines="11-14 33 38"
import random
from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline, routing_batch_function
from distilabel.steps import GroupColumns
from distilabel.steps.tasks import TextGeneration
@routing_batch_function
def sample_two_steps(steps: list[str]) -> list[str]:
return random.sample(steps, 2)
dataset = load_dataset(
"distilabel-internal-testing/instruction-dataset-mini",
split="test"
)
with Pipeline("pipe-name", description="My first pipe") as pipeline:
tasks = []
for llm in (
OpenAILLM(model="gpt-4-0125-preview"),
MistralLLM(model="mistral-large-2402"),
VertexAILLM(model="gemini-1.0-pro"),
):
tasks.append(
TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
)
combine_generations = GroupColumns(
name="combine_generations",
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
)
sample_two_steps >> tasks >> combine_generations
if __name__ == "__main__":
distiset = pipeline.run(
dataset=dataset,
parameters=...
)
```
### Stopping the pipeline
In case you want to stop the pipeline while it's running, you can press ++ctrl+c++ or ++cmd+c++ depending on your OS (or send a `SIGINT` to the main process), and the outputs will be stored in the cache. Pressing an additional time will force the pipeline to stop its execution, but this can lead to losing the generated outputs for certain batches.
## Cache
If for some reason, the pipeline execution stops (for example by pressing `Ctrl+C`), the state of the pipeline and the outputs will be stored in the cache, so we can resume the pipeline execution from the point where it was stopped.
If we want to force the pipeline to run again without can, then we can use the `use_cache` argument of the `Pipeline.run()` method:
```python
if __name__ == "__main__":
distiset = pipeline.run(parameters={...}, use_cache=False)
```
!!! NOTE
For more information on caching, we refer the reader to the [caching](../../advanced/caching.md) section.
## Adjusting the batch size for each step
Memory issues can arise when processing large datasets or when using large models. To avoid this, we can use the `input_batch_size` argument of individual tasks. `TextGeneration` task will receive 5 dictionaries, while the `LoadDataFromHub` step will send 10 dictionaries per batch:
```python
from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import GroupColumns, LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
with Pipeline("pipe-name", description="My first pipe") as pipeline:
load_dataset = LoadDataFromHub(
name="load_dataset",
output_mappings={"prompt": "instruction"},
batch_size=10
)
for llm in (
OpenAILLM(model="gpt-4-0125-preview"),
MistralLLM(model="mistral-large-2402"),
VertexAILLM(model="gemini-1.5-pro"),
):
task = TextGeneration(
name=f"text_generation_with_{llm.model_name.replace('.', '-')}",
llm=llm,
input_batch_size=5,
)
...
```
## Serializing the pipeline
Sharing a pipeline with others is very easy, as we can serialize the pipeline object using the `save` method. We can save the pipeline in different formats, such as `yaml` or `json`:
=== "yaml"
```python
if __name__ == "__main__":
pipeline.save("pipeline.yaml", format="yaml")
```
=== "json"
```python
if __name__ == "__main__":
pipeline.save("pipeline.json", format="json")
```
To load the pipeline, we can use the `from_yaml` or `from_json` methods:
=== "yaml"
```python
pipeline = Pipeline.from_yaml("pipeline.yaml")
```
=== "json"
```python
pipeline = Pipeline.from_json("pipeline.json")
```
Serializing the pipeline is very useful when we want to share the pipeline with others, or when we want to store the pipeline for future use. It can even be hosted online, so the pipeline can be executed directly using the [CLI](../../advanced/cli/index.md).
## Visualizing the pipeline
We can visualize the pipeline using the `Pipeline.draw()` method. This will create a `mermaid` graph, and return the path to the image.
```python
path_to_image = pipeline.draw(
top_to_bottom=True,
show_edge_labels=True,
)
```
Within notebooks, we can simply call `pipeline` and the graph will be displayed. Alternatively, we can use the `Pipeline.draw()` method to have more control over the graph visualization and use `IPython` to display it.
```python
from IPython.display import Image, display
display(Image(path_to_image))
```
Let's now see how the pipeline of the [fully working example](#fully-working-example) looks like.
![Pipeline](../../../../assets/images/sections/how_to_guides/basic/pipeline.png)
## Fully working example
To sum up, here is the full code of the pipeline we have created in this section. Note that you will need to change the name of the Hugging Face repository where the resulting will be pushed, set `OPENAI_API_KEY` environment variable, set `MISTRAL_API_KEY` and have `gcloud` installed and configured:
??? Code
```python
from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import GroupColumns, LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
with Pipeline("pipe-name", description="My first pipe") as pipeline:
load_dataset = LoadDataFromHub(
name="load_dataset",
output_mappings={"prompt": "instruction"},
)
combine_generations = GroupColumns(
name="combine_generations",
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
)
for llm in (
OpenAILLM(model="gpt-4-0125-preview"),
MistralLLM(model="mistral-large-2402"),
VertexAILLM(model="gemini-1.0-pro"),
):
task = TextGeneration(
name=f"text_generation_with_{llm.model_name.replace('.', '-')}", llm=llm
)
load_dataset.connect(task)
task.connect(combine_generations)
if __name__ == "__main__":
distiset = pipeline.run(
parameters={
"load_dataset": {
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
"text_generation_with_gpt-4-0125-preview": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
"text_generation_with_mistral-large-2402": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
"text_generation_with_gemini-1.0-pro": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
},
)
distiset.push_to_hub(
"distilabel-internal-testing/instruction-dataset-mini-with-generations"
)
```
# GeneratorStep
The [`GeneratorStep`][distilabel.steps.GeneratorStep] is a subclass of [`Step`][distilabel.steps.Step] that is intended to be used as the first step within a [`Pipeline`][distilabel.pipeline.Pipeline], because it doesn't require input and generates data that can be used by other steps. Alternatively, it can also be used as a standalone.
```python
from typing import List, TYPE_CHECKING
from typing_extensions import override
from distilabel.steps import GeneratorStep
if TYPE_CHECKING:
from distilabel.typing import StepColumns, GeneratorStepOutput
class MyGeneratorStep(GeneratorStep):
instructions: List[str]
@override
def process(self, offset: int = 0) -> "GeneratorStepOutput":
if offset:
self.instructions = self.instructions[offset:]
while self.instructions:
batch = [
{
"instruction": instruction
} for instruction in self.instructions[: self.batch_size]
]
self.instructions = self.instructions[self.batch_size :]
yield (
batch,
True if len(self.instructions) == 0 else False,
)
@property
def outputs(self) -> "StepColumns":
return ["instruction"]
```
Then we can use it as follows:
```python
step = MyGeneratorStep(
name="my-generator-step",
instructions=["Tell me a joke.", "Tell me a story."],
batch_size=1,
)
step.load()
next(step.process(offset=0))
# ([{'instruction': 'Tell me a joke.'}], False)
next(step.process(offset=1))
# ([{'instruction': 'Tell me a story.'}], True)
```
!!! NOTE
The `Step.load()` always needs to be executed when being used as a standalone. Within a pipeline, this will be done automatically during pipeline execution.
## Defining custom GeneratorSteps
We can define a custom generator step by creating a new subclass of the [`GeneratorStep`][distilabel.steps.GeneratorStep] and defining the following:
- `outputs`: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.
- `process`: is a method that yields output data and a boolean flag indicating whether that's the last batch to be generated.
!!! NOTE
The default signature for the `process` method is `process(self, offset: int = 0) -> GeneratorStepOutput`. The argument `offset` should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one [`Step`][distilabel.steps.Step] at a time could be connected to the current one.
!!! WARNING
For the custom [`Step`][distilabel.steps.Step] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.
=== "Inherit from `GeneratorStep`"
We can inherit from the `GeneratorStep` class and define the `outputs`, and `process` methods as follows:
```python
from typing import List, TYPE_CHECKING
from typing_extensions import override
from distilabel.steps import GeneratorStep
if TYPE_CHECKING:
from distilabel.typing import StepColumns, GeneratorStepOutput
class MyGeneratorStep(GeneratorStep):
instructions: List[str]
@override
def process(self, offset: int = 0) -> "GeneratorStepOutput":
...
@property
def outputs(self) -> "StepColumns":
...
```
=== "Using the `@step` decorator"
The `@step` decorator will take care of the boilerplate code, and will allow to define the `outputs`, and `process` methods in a more straightforward way. One downside is that it won't let you access the `self` attributes if any, neither set those, so if you need to access or set any attribute, you should go with the first approach of defining the custom [`GeneratorStep`][distilabel.steps.GeneratorStep] subclass.
```python
from typing import TYPE_CHECKING
from distilabel.steps import step
if TYPE_CHECKING:
from distilabel.typing import GeneratorStepOutput
@step(outputs=[...], step_type="generator")
def CustomGeneratorStep(offset: int = 0) -> "GeneratorStepOutput":
yield (
...,
True if offset == 10 else False,
)
step = CustomGeneratorStep(name="my-step")
```
# GlobalStep
The [`GlobalStep`][distilabel.steps.GlobalStep] is a subclass of [`Step`][distilabel.steps.Step] that is used to define a step that requires the previous steps to be completed to run, since it will wait until all the input batches are received before running. This step is useful when you need to run a step that requires all the input data to be processed before running. Alternatively, it can also be used as a standalone.
## Defining custom GlobalSteps
We can define a custom step by creating a new subclass of the [`GlobalStep`][distilabel.steps.GlobalStep] and defining the following:
- `inputs`: is a property that returns a list of strings with the names of the required input fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.
- `outputs`: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.
- `process`: is a method that receives the input data and returns the output data, and it should be a generator, meaning that it should `yield` the output data.
!!! NOTE
The default signature for the `process` method is `process(self, *inputs: StepInput) -> StepOutput`. The argument `inputs` should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one [`Step`][distilabel.steps.Step] at a time could be connected to the current one.
!!! WARNING
For the custom [`GlobalStep`][distilabel.steps.GlobalStep] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.
=== "Inherit from `GlobalStep`"
We can inherit from the `GlobalStep` class and define the `inputs`, `outputs`, and `process` methods as follows:
```python
from typing import TYPE_CHECKING
from distilabel.steps import GlobalStep, StepInput
if TYPE_CHECKING:
from distilabel.typing import StepColumns, StepOutput
class CustomStep(Step):
@property
def inputs(self) -> "StepColumns":
...
@property
def outputs(self) -> "StepColumns":
...
def process(self, *inputs: StepInput) -> StepOutput:
for upstream_step_inputs in inputs:
for item in input:
...
yield item
# When overridden (ideally under the `typing_extensions.override` decorator)
# @typing_extensions.override
# def process(self, inputs: StepInput) -> StepOutput:
# for input in inputs:
# ...
# yield inputs
```
=== "Using the `@step` decorator"
The `@step` decorator will take care of the boilerplate code, and will allow to define the `inputs`, `outputs`, and `process` methods in a more straightforward way. One downside is that it won't let you access the `self` attributes if any, neither set those, so if you need to access or set any attribute, you should go with the first approach of defining the custom [`GlobalStep`][distilabel.steps.GlobalStep] subclass.
```python
from typing import TYPE_CHECKING
from distilabel.steps import StepInput, step
if TYPE_CHECKING:
from distilabel.typing import StepOutput
@step(inputs=[...], outputs=[...], step_type="global")
def CustomStep(inputs: StepInput) -> "StepOutput":
for input in inputs:
...
yield inputs
step = CustomStep(name="my-step")
```
# Steps for processing data
## Working with Steps
The [`Step`][distilabel.steps.Step] is intended to be used within the scope of a [`Pipeline`][distilabel.pipeline.Pipeline], which will orchestrate the different steps defined but can also be used standalone.
Assuming that we have a [`Step`][distilabel.steps.Step] already defined as it follows:
```python
from typing import TYPE_CHECKING
from distilabel.steps import Step, StepInput
if TYPE_CHECKING:
from distilabel.typing import StepColumns, StepOutput
class MyStep(Step):
@property
def inputs(self) -> "StepColumns":
return ["input_field"]
@property
def outputs(self) -> "StepColumns":
return ["output_field"]
def process(self, inputs: StepInput) -> "StepOutput":
for input in inputs:
input["output_field"] = input["input_field"]
yield inputs
```
Then we can use it as follows:
```python
step = MyStep(name="my-step")
step.load()
next(step.process([{"input_field": "value"}]))
# [{'input_field': 'value', 'output_field': 'value'}]
```
!!! NOTE
The `Step.load()` always needs to be executed when being used as a standalone. Within a pipeline, this will be done automatically during pipeline execution.
### Arguments
- `input_mappings`, is a dictionary that maps keys from the input dictionaries to the keys expected by the step. For example, if `input_mappings={"instruction": "prompt"}`, means that the input key `prompt` will be used as the key `instruction` for current step.
- `output_mappings`, is a dictionary that can be used to map the outputs of the step to other names. For example, if `output_mappings={"conversation": "prompt"}`, means that output key `conversation` will be renamed to `prompt` for the next step.
- `input_batch_size` (by default set to 50), is independent for every step and will determine how many input dictionaries will process at once.
### Runtime parameters
`Step`s can also have `RuntimeParameter`, which are parameters that can only be used after the pipeline initialisation when calling the `Pipeline.run`.
```python
from distilabel.mixins.runtime_parameters import RuntimeParameter
class Step(...):
input_batch_size: RuntimeParameter[PositiveInt] = Field(
default=DEFAULT_INPUT_BATCH_SIZE,
description="The number of rows that will contain the batches processed by the"
" step.",
)
```
## Types of Steps
There are two special types of [`Step`][distilabel.steps.Step] in `distilabel`:
* [`GeneratorStep`][distilabel.steps.GeneratorStep]: is a step that only generates data, and it doesn't need any input data from previous steps and normally is the first node in a [`Pipeline`][distilabel.pipeline.Pipeline]. More information: [Components -> Step - GeneratorStep](./generator_step.md).
* [`GlobalStep`][distilabel.steps.GlobalStep]: is a step with the standard interface i.e. receives inputs and generates outputs, but it processes all the data at once, and often is the final step in the [`Pipeline`][distilabel.pipeline.Pipeline]. The fact that a [`GlobalStep`][distilabel.steps.GlobalStep] requires the previous steps to finish before being able to start. More information: [Components - Step - GlobalStep](global_step.md).
* [`Task`][distilabel.steps.tasks.Task], is essentially the same as a default [`Step`][distilabel.steps.Step], but it relies on an [`LLM`][distilabel.models.llms.LLM] as an attribute, and the `process` method will be in charge of calling that LLM. More information: [Components - Task](../task/index.md).
## Defining custom Steps
We can define a custom step by creating a new subclass of the [`Step`][distilabel.steps.Step] and defining the following:
- `inputs`: is a property that returns a list of strings with the names of the required input fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.
- `outputs`: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.
- `process`: is a method that receives the input data and returns the output data, and it should be a generator, meaning that it should `yield` the output data.
!!! NOTE
The default signature for the `process` method is `process(self, *inputs: StepInput) -> StepOutput`. The argument `inputs` should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one [`Step`][distilabel.steps.Step] at a time could be connected to the current one.
!!! WARNING
For the custom [`Step`][distilabel.steps.Step] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.
=== "Inherit from `Step`"
We can inherit from the `Step` class and define the `inputs`, `outputs`, and `process` methods as follows:
```python
from typing import TYPE_CHECKING
from distilabel.steps import Step, StepInput
if TYPE_CHECKING:
from distilabel.typing import StepColumns, StepOutput
class CustomStep(Step):
@property
def inputs(self) -> "StepColumns":
...
@property
def outputs(self) -> "StepColumns":
...
def process(self, *inputs: StepInput) -> "StepOutput":
for upstream_step_inputs in inputs:
...
yield item
# When overridden (ideally under the `typing_extensions.override` decorator)
# @typing_extensions.override
# def process(self, inputs: StepInput) -> StepOutput:
# for input in inputs:
# ...
# yield inputs
```
=== "Using the `@step` decorator"
The `@step` decorator will take care of the boilerplate code, and will allow to define the `inputs`, `outputs`, and `process` methods in a more straightforward way. One downside is that it won't let you access the `self` attributes if any, neither set those, so if you need to access or set any attribute, you should go with the first approach of defining the custom [`Step`][distilabel.steps.Step] subclass.
```python
from typing import TYPE_CHECKING
from distilabel.steps import StepInput, step
if TYPE_CHECKING:
from distilabel.typing import StepOutput
@step(inputs=[...], outputs=[...])
def CustomStep(inputs: StepInput) -> "StepOutput":
for input in inputs:
...
yield inputs
step = CustomStep(name="my-step")
```
# GeneratorTask that produces output
## Working with GeneratorTasks
The [`GeneratorTask`][distilabel.steps.tasks.GeneratorTask] is a custom implementation of a [`Task`][distilabel.steps.tasks.Task] based on the [`GeneratorStep`][distilabel.steps.GeneratorStep]. As with a [`Task`][distilabel.steps.tasks.Task], it is normally used within a [`Pipeline`][distilabel.pipeline.Pipeline] but can also be used standalone.
!!! WARNING
This task is still experimental and may be subject to changes in the future.
```python
from typing import Any, Dict, List, Union
from typing_extensions import override
from distilabel.steps.tasks.base import GeneratorTask
from distilabel.typing import ChatType, GeneratorOutput
class MyCustomTask(GeneratorTask):
instruction: str
@override
def process(self, offset: int = 0) -> GeneratorStepOutput:
output = self.llm.generate(
inputs=[
[
{"role": "user", "content": self.instruction},
],
],
)
output = {"model_name": self.llm.model_name}
output.update(
self.format_output(output=output, input=None)
)
yield output
@property
def outputs(self) -> List[str]:
return ["output_field", "model_name"]
def format_output(
self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
return {"output_field": output}
```
We can then use it as follows:
```python
task = MyCustomTask(
name="custom-generation",
instruction="Tell me a joke.",
llm=OpenAILLM(model="gpt-4"),
)
task.load()
next(task.process())
# [{'output_field": "Why did the scarecrow win an award? Because he was outstanding!", "model_name": "gpt-4"}]
```
!!! NOTE
Most of the times you would need to override the default `process` method, as it's suited for the standard [`Task`][distilabel.steps.tasks.Task] and not for the [`GeneratorTask`][distilabel.steps.tasks.GeneratorTask]. But within the context of the `process` function you can freely use the `llm` to generate data in any way.
!!! NOTE
The `Step.load()` always needs to be executed when being used as a standalone. Within a pipeline, this will be done automatically during pipeline execution.
## Defining custom GeneratorTasks
We can define a custom generator task by creating a new subclass of the [`GeneratorTask`][distilabel.steps.tasks.Task] and defining the following:
- `process`: is a method that generates the data based on the [`LLM`][distilabel.models.llms.LLM] and the `instruction` provided within the class instance, and returns a dictionary with the output data formatted as needed i.e. with the values for the columns in `outputs`. Note that the `inputs` argument is not allowed in this function since this is a [`GeneratorTask`][distilabel.steps.tasks.GeneratorTask]. The signature only expects the `offset` argument, which is used to keep track of the current iteration in the generator.
- `outputs`: is a property that returns a list of strings with the names of the output fields, this property should always include `model_name` as one of the outputs since that's automatically injected from the LLM.
- `format_output`: is a method that receives the output from the [`LLM`][distilabel.models.llms.LLM] and optionally also the input data (which may be useful to build the output in some scenarios), and returns a dictionary with the output data formatted as needed i.e. with the values for the columns in `outputs`. Note that there's no need to include the `model_name` in the output.
```python
from typing import Any, Dict, List, Union
from distilabel.steps.tasks.base import GeneratorTask
from distilabel.typing import ChatType
class MyCustomTask(GeneratorTask):
@override
def process(self, offset: int = 0) -> GeneratorStepOutput:
output = self.llm.generate(
inputs=[
[{"role": "user", "content": "Tell me a joke."}],
],
)
output = {"model_name": self.llm.model_name}
output.update(
self.format_output(output=output, input=None)
)
yield output
@property
def outputs(self) -> List[str]:
return ["output_field", "model_name"]
def format_output(
self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
return {"output_field": output}
```
# ImageTask to work with Image Generation Models
## Working with ImageTasks
The [`ImageTask`][distilabel.steps.tasks.ImageTask] is a custom implementation of a [`Task`][distilabel.steps.tasks.Task] special to deal images. These tasks behave exactly as any other [`Task`][distilabel.steps.tasks.Task], but instead of relying on an [`LLM`][distilabel.models.llms.LLM], they work with a [`ImageGenerationModel`][distilabel.models.image_generation.ImageGenerationModel].
!!! info "New in version 1.5.0"
This task is new and is expected to work with Image Generation Models.
These tasks take as attribute an `image_generation_model` instead of `llm` as we would have with the standard `Task`, but everything else remains the same. Let's see an example with [`ImageGeneration`](https://distilabel.argilla.io/dev/components-gallery/tasks/imagegeneration/):
```python
from distilabel.steps.tasks import ImageGeneration
from distilabel.models.image_generation import InferenceEndpointsImageGeneration
task = ImageGeneration(
name="image-generation",
image_generation_model=InferenceEndpointsImageGeneration(model_id="black-forest-labs/FLUX.1-schnell"),
)
task.load()
next(task.process([{"prompt": "a white siamese cat"}]))
# [{'image": "iVBORw0KGgoAAAANSUhEUgA...", "model_name": "black-forest-labs/FLUX.1-schnell"}]
```
!!! info "Visualize the image in a notebook"
If you are testing the `ImageGeneration` task in a notebook, you can do the following
to see the rendered image:
```python
from distilabel.models.image_generation.utils import image_from_str
result = next(task.process([{"prompt": "a white siamese cat"}]))
image_from_str(result[0]["image"]) # Returns a `PIL.Image.Image` that renders directly
```
!!! tip "Running ImageGeneration in a Pipeline"
This transformation between image as string and as PIL object can be done for the whole dataset if running a pipeline, by calling the method `transform_columns_to_image` on the final distiset and passing the name (or list of names) of the column image.
## Defining custom ImageTasks
We can define a custom generator task by creating a new subclass of the [`ImageTask`][distilabel.steps.tasks.ImageTask] and defining the following:
- `process`: is a method that generates the data based on the [`ImageGenerationModel`][distilabel.models.image_generation.ImageGenerationModel] and the `prompt` provided within the class instance, and returns a dictionary with the output data formatted as needed i.e. with the values for the columns in `outputs`.
- `inputs`: is a property that returns a list of strings with the names of the required input fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.
- `outputs`: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not. This property should always include `model_name` as one of the outputs since that's automatically injected from the LLM.
- `format_input`: is a method that receives a dictionary with the input data and returns a *prompt* to be passed to the model.
- `format_output`: is a method that receives the output from the [`ImageGenerationModel`][distilabel.models.image_generation.ImageGenerationModel] and optionally also the input data (which may be useful to build the output in some scenarios), and returns a dictionary with the output data formatted as needed i.e. with the values for the columns in `outputs`.
```python
from typing import TYPE_CHECKING
from distilabel.models.image_generation.utils import image_from_str, image_to_str
from distilabel.steps.base import StepInput
from distilabel.steps.tasks.base import ImageTask
if TYPE_CHECKING:
from distilabel.typing import StepColumns, StepOutput
class MyCustomImageTask(ImageTask):
@override
def process(self, offset: int = 0) -> GeneratorOutput:
formatted_inputs = self._format_inputs(inputs)
outputs = self.llm.generate_outputs(
inputs=formatted_inputs,
num_generations=self.num_generations,
**self.llm.get_generation_kwargs(),
)
task_outputs = []
for input, input_outputs in zip(inputs, outputs):
formatted_outputs = self._format_outputs(input_outputs, input)
for formatted_output in formatted_outputs:
task_outputs.append(
{**input, **formatted_output, "model_name": self.llm.model_name}
)
yield task_outputs
@property
def inputs(self) -> "StepColumns":
return ["prompt"]
@property
def outputs(self) -> "StepColumns":
return ["image", "model_name"]
def format_input(self, input: dict[str, any]) -> str:
return input["prompt"]
def format_output(
self, output: Union[str, None], input: dict[str, any]
) -> Dict[str, Any]:
# Extract/generate/modify the image from the output
return {"image": ..., "model_name": self.llm.model_name}
```
!!! Warning
Note the fact that in the `process` method we are not dealing with the `image_generation` attribute but with the `llm`. This is not a bug, but intended, as internally we rename the `image_generation` to `llm` to reuse the code.
# Tasks for generating and judging with LLMs
## Working with Tasks
The [`Task`][distilabel.steps.tasks.Task] is a special kind of [`Step`][distilabel.steps.Step] that includes the [`LLM`][distilabel.models.llms.LLM] as a mandatory argument. As with a [`Step`][distilabel.steps.Step], it is normally used within a [`Pipeline`][distilabel.pipeline.Pipeline] but can also be used standalone.
For example, the most basic task is the [`TextGeneration`][distilabel.steps.tasks.TextGeneration] task, which generates text based on a given instruction.
```python
from distilabel.models import InferenceEndpointsLLM
from distilabel.steps.tasks import TextGeneration
task = TextGeneration(
name="text-generation",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
)
task.load()
next(task.process([{"instruction": "What's the capital of Spain?"}]))
# [
# {
# "instruction": "What's the capital of Spain?",
# "generation": "The capital of Spain is Madrid.",
# "distilabel_metadata": {
# "raw_output_text-generation": "The capital of Spain is Madrid.",
# "raw_input_text-generation": [
# {
# "role": "user",
# "content": "What's the capital of Spain?"
# }
# ],
# "statistics_text-generation": { # (1)
# "input_tokens": 18,
# "output_tokens": 8
# }
# },
# "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct"
# }
# ]
```
1. The `LLMs` will not only return the text but also a `statistics_{STEP_NAME}` field that will contain statistics related to the generation. If available, at least the input and output tokens will be returned.
!!! Note
The `Step.load()` always needs to be executed when being used as a standalone. Within a pipeline, this will be done automatically during pipeline execution.
As shown above, the [`TextGeneration`][distilabel.steps.tasks.TextGeneration] task adds a `generation` based on the `instruction`.
!!! Tip "New in version 1.2.0"
Since version `1.2.0`, we provide some metadata about the LLM call through `distilabel_metadata`. This can be disabled by setting the `add_raw_output` attribute to `False` when creating the task.
Additionally, since version `1.4.0`, the formatted input can also be included, which can be helpful when testing
custom templates (testing the pipeline using the [`dry_run`][distilabel.pipeline.local.Pipeline.dry_run] method).
```python title="disable raw input and output"
task = TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
add_raw_output=False,
add_raw_input=False
)
```
!!! Tip "New in version 1.5.0"
Since version `1.5.0` `distilabel_metadata` includes a new `statistics` field out of the box. The generation from the LLM will not only contain the text, but also statistics associated with the text if available, like the input and output tokens. This field will be generated with `statistic_{STEP_NAME}` to avoid collisions between different steps in the pipeline, similar to how `raw_output_{STEP_NAME}` works.
### Task.print
!!! Info "New in version 1.4.0"
New since version `1.4.0`, [`Task.print`][distilabel.steps.tasks.base._Task.print] `Task.print` method.
The `Tasks` include a handy method to show what the prompt formatted for an `LLM` would look like, let's see an example with [`UltraFeedback`][distilabel.steps.tasks.ultrafeedback.UltraFeedback], but it applies to any other `Task`.
```python
from distilabel.steps.tasks import UltraFeedback
from distilabel.models import InferenceEndpointsLLM
uf = UltraFeedback(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
)
uf.load()
uf.print()
```
The result will be a rendered prompt, with the System prompt (if contained for the task) and the User prompt, rendered with rich (it will show exactly the same in a jupyter notebook).
![task-print](../../../../assets/images/sections/how_to_guides/tasks/task_print.png)
In case you want to test with a custom input, you can pass an example to the tasks` `format_input` method (or generate it on your own depending on the task), and pass it to the print method so that it shows your example:
```python
uf.print(
uf.format_input({"instruction": "test", "generations": ["1", "2"]})
)
```
??? "Using a DummyLLM to avoid loading one"
In case you don't want to load an LLM to render the template, you can create a dummy one like the ones we could use for testing.
```python
from distilabel.models import LLM
from distilabel.models.mixins import MagpieChatTemplateMixin
class DummyLLM(AsyncLLM, MagpieChatTemplateMixin):
structured_output: Any = None
magpie_pre_query_template: str = "llama3"
def load(self) -> None:
pass
@property
def model_name(self) -> str:
return "test"
def generate(
self, input: "FormattedInput", num_generations: int = 1
) -> "GenerateOutput":
return ["output" for _ in range(num_generations)]
```
You can use this `LLM` just as any of the other ones to `load` your task and call `print`:
```python
uf = UltraFeedback(llm=DummyLLM())
uf.load()
uf.print()
```
!!! Note
When creating a custom task, the `print` method will be available by default, but it is limited to the most common scenarios for the inputs. If you test your new task and find it's not working as expected (for example, if your task contains one input consisting of a list of texts instead of a single one), you should override the `_sample_input` method. You can inspect the [`UltraFeedback`][distilabel.steps.tasks.ultrafeedback.UltraFeedback] source code for this.
## Specifying the number of generations and grouping generations
All the `Task`s have a `num_generations` attribute that allows defining the number of generations that we want to have per input. We can update the example above to generate 3 completions per input:
```python
from distilabel.models import InferenceEndpointsLLM
from distilabel.steps.tasks import TextGeneration
task = TextGeneration(
name="text-generation",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
num_generations=3,
)
task.load()
next(task.process([{"instruction": "What's the capital of Spain?"}]))
# [
# {
# 'instruction': "What's the capital of Spain?",
# 'generation': 'The capital of Spain is Madrid.',
# 'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
# 'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
# },
# {
# 'instruction': "What's the capital of Spain?",
# 'generation': 'The capital of Spain is Madrid.',
# 'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
# 'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
# },
# {
# 'instruction': "What's the capital of Spain?",
# 'generation': 'The capital of Spain is Madrid.',
# 'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
# 'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
# }
# ]
```
In addition, we might want to group the generations in a single output row as maybe one downstream step expects a single row with multiple generations. We can achieve this by setting the `group_generations` attribute to `True`:
```python
from distilabel.models import InferenceEndpointsLLM
from distilabel.steps.tasks import TextGeneration
task = TextGeneration(
name="text-generation",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
num_generations=3,
group_generations=True
)
task.load()
next(task.process([{"instruction": "What's the capital of Spain?"}]))
# [
# {
# 'instruction': "What's the capital of Spain?",
# 'generation': ['The capital of Spain is Madrid.', 'The capital of Spain is Madrid.', 'The capital of Spain is Madrid.'],
# 'distilabel_metadata': [
# {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
# {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
# {'raw_output_text-generation': 'The capital of Spain is Madrid.'}
# ],
# 'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
# }
# ]
```
## Defining custom Tasks
We can define a custom step by creating a new subclass of the [`Task`][distilabel.steps.tasks.Task] and defining the following:
- `inputs`: is a property that returns a list of strings with the names of the required input fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.
- `format_input`: is a method that receives a dictionary with the input data and returns a [`ChatType`][distilabel.typing.models.ChatType] following [the chat-completion OpenAI message formatting](https://platform.openai.com/docs/guides/text-generation).
- `outputs`: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not. This property should always include `model_name` as one of the outputs since that's automatically injected from the LLM.
- `format_output`: is a method that receives the output from the [`LLM`][distilabel.models.llms.LLM] and optionally also the input data (which may be useful to build the output in some scenarios), and returns a dictionary with the output data formatted as needed i.e. with the values for the columns in `outputs`. Note that there's no need to include the `model_name` in the output.
=== "Inherit from `Task`"
When using the `Task` class inheritance method for creating a custom task, we can also optionally override the `Task.process` method to define a more complex processing logic involving an `LLM`, as the default one just calls the `LLM.generate` method once previously formatting the input and subsequently formatting the output. For example, [EvolInstruct][distilabel.steps.tasks.EvolInstruct] task overrides this method to call the `LLM.generate` multiple times (one for each evolution).
```python
from typing import Any, Dict, List, Union, TYPE_CHECKING
from distilabel.steps.tasks import Task
if TYPE_CHECKING:
from distilabel.typing import StepColumns, ChatType
class MyCustomTask(Task):
@property
def inputs(self) -> "StepColumns":
return ["input_field"]
def format_input(self, input: Dict[str, Any]) -> "ChatType":
return [
{
"role": "user",
"content": input["input_field"],
},
]
@property
def outputs(self) -> "StepColumns":
return ["output_field", "model_name"]
def format_output(
self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
return {"output_field": output}
```
=== "Using the `@task` decorator"
If your task just needs a system prompt, a user message template and a way to format the output given by the `LLM`, then you can use the `@task` decorator to avoid writing too much boilerplate code.
```python
from typing import Any, Dict, Union
from distilabel.steps.tasks import task
@task(inputs=["input_field"], outputs=["output_field"])
def MyCustomTask(output: Union[str, None], input: Union[Dict[str, Any], None] = None) -> Dict[str, Any]:
"""
---
system_prompt: |
My custom system prompt
user_message_template: |
My custom user message template: {input_field}
---
"""
# Format the `LLM` output here
return {"output_field": output}
```
!!! Warning
Most `Tasks` reuse the `Task.process` method to process the generations, but if a new `Task` defines a custom `process` method, like happens for example with [`Magpie`][distilabel.steps.tasks.magpie.base.Magpie], one hast to deal with the `statistics` returned by the `LLM`.
# How-to guides
Welcome to the how-to guides section! Here you will find a collection of guides that will help you get started with Distilabel. We have divided the guides into two categories: basic and advanced. The basic guides will help you get started with the core concepts of Distilabel, while the advanced guides will help you explore more advanced features.
## Basic
<div class="grid cards" markdown>
- __Define Steps for your Pipeline__
---
Steps are the building blocks of your pipeline. They can be used to generate data, evaluate models, manipulate data, or any other general task.
[:octicons-arrow-right-24: Define Steps](basic/step/index.md)
- __Define Tasks that rely on LLMs__
---
Tasks are a specific type of step that rely on Language Models (LLMs) to generate data.
[:octicons-arrow-right-24: Define Tasks](basic/task/index.md)
- __Define LLMs as local or remote models__
---
LLMs are the core of your tasks. They are used to integrate with local models or remote APIs.
[:octicons-arrow-right-24: Define LLMs](basic/llm/index.md)
- __Execute Steps and Tasks in a Pipeline__
---
Pipeline is where you put all your steps and tasks together to create a workflow.
[:octicons-arrow-right-24: Execute Pipeline](basic/pipeline/index.md)
</div>
## Advanced
<div class="grid cards" markdown>
- __Using the Distiset dataset object__
---
Distiset is a dataset object based on the datasets library that can be used to store and manipulate data.
[:octicons-arrow-right-24: Distiset](advanced/distiset.md)
- __Export data to Argilla__
---
Argilla is a platform that can be used to store, search, and apply feedback to datasets.
[:octicons-arrow-right-24: Argilla](advanced/argilla.md)
- __Using a file system to pass data of batches between steps__
---
File system can be used to pass data between steps in a pipeline.
[:octicons-arrow-right-24: File System](advanced/fs_to_pass_data.md)
- __Using CLI to explore and re-run existing Pipelines__
---
CLI can be used to explore and re-run existing pipelines through the command line.
[:octicons-arrow-right-24: CLI](advanced/cli/index.md)
- __Cache and recover pipeline executions__
---
Caching can be used to recover pipeline executions to avoid loosing data and precious LLM calls.
[:octicons-arrow-right-24: Caching](advanced/caching.md)
- __Structured data generation__
---
Structured data generation can be used to generate data with a specific structure like JSON, function calls, etc.
[:octicons-arrow-right-24: Structured Generation](advanced/structured_generation.md)
- __Serving an LLM for sharing it between several tasks__
---
Serve an LLM via TGI or vLLM to make requests and connect using a client like `InferenceEndpointsLLM` or `OpenAILLM` to avoid wasting resources.
[:octicons-arrow-right-24: Sharing an LLM across tasks](advanced/serving_an_llm_for_reuse.md)
- __Impose requirements to your pipelines and steps__
---
Add requirements to steps in a pipeline to ensure they are installed and avoid errors.
[:octicons-arrow-right-24: Pipeline requirements](advanced/pipeline_requirements.md)
</div>
\ No newline at end of file
---
hide: toc
---
# Benchmarking with `distilabel`
Benchmark LLMs with `distilabel`: reproducing the Arena Hard benchmark.
The script below first defines both the `ArenaHard` and the `ArenaHardResults` tasks, so as to generate responses for a given collection of prompts/questions with up to two LLMs, and then calculate the results as per the original implementation, respectively. Additionally, the second part of the example builds a `Pipeline` to run the generation on top of the prompts with `InferenceEndpointsLLM` while streaming the rest of the generations from a pre-computed set of GPT-4 generations, and then evaluate one against the other with `OpenAILLM` generating an alternate response, a comparison between the responses, and a result as A>>B, A>B, B>A, B>>A, or tie.
![Arena Hard](../../../assets/pipelines/arena-hard.png)
To run this example you will first need to install the Arena Hard optional dependencies, being `pandas`, `scikit-learn`, and `numpy`.
??? Run
```python
python examples/arena_hard.py
```
```python title="arena_hard.py"
--8<-- "examples/arena_hard.py"
```
\ No newline at end of file
---
hide: toc
---
# Create exam questions using structured generation
This example will showcase how to generate exams questions and answers from a text page. In this case, we will use a wikipedia page as an example, and show how to leverage the prompt to help the model generate the data in the appropriate format.
We are going to use a `meta-llama/Meta-Llama-3.1-8B-Instruct` to generate questions and answers for a mock exam from a wikipedia page. In this case, we are going to use the *Transfer Learning* entry for it. With the help of structured generation we will guide the model to create structured data for us that is easy to parse. The structure will be question, answer, and distractors (wrong answers).
??? "Click to see the sample results"
Example page [Transfer_learning](https://en.wikipedia.org/wiki/Transfer_learning):
![wiki page](../../../assets/tutorials-assets/wiki_transfer_learning.png)
QA of the page:
```json
{
"exam": [
{
"answer": "A technique in machine learning where knowledge learned from a task is re-used to boost performance on a related task.",
"distractors": ["A type of neural network architecture", "A machine learning algorithm for image classification", "A method for data preprocessing"],
"question": "What is transfer learning?"
},
{
"answer": "1976",
"distractors": ["1981", "1992", "1998"],
"question": "In which year did Bozinovski and Fulgosi publish a paper addressing transfer learning in neural network training?"
},
{
"answer": "Discriminability-based transfer (DBT) algorithm",
"distractors": ["Multi-task learning", "Learning to Learn", "Cost-sensitive machine learning"],
"question": "What algorithm was formulated by Lorien Pratt in 1992?"
},
{
"answer": "A domain consists of a feature space and a marginal probability distribution.",
"distractors": ["A domain consists of a label space and an objective predictive function.", "A domain consists of a task and a learning algorithm.", "A domain consists of a dataset and a model."],
"question": "What is the definition of a domain in the context of transfer learning?"
},
{
"answer": "Transfer learning aims to help improve the learning of the target predictive function in the target domain using the knowledge in the source domain and learning task.",
"distractors": ["Transfer learning aims to learn a new task from scratch.", "Transfer learning aims to improve the learning of the source predictive function in the source domain.", "Transfer learning aims to improve the learning of the target predictive function in the source domain."],
"question": "What is the goal of transfer learning?"
},
{
"answer": "Markov logic networks, Bayesian networks, cancer subtype discovery, building utilization, general game playing, text classification, digit recognition, medical imaging, and spam filtering.",
"distractors": ["Supervised learning, unsupervised learning, reinforcement learning, natural language processing, computer vision, and robotics.", "Image classification, object detection, segmentation, and tracking.", "Speech recognition, sentiment analysis, and topic modeling."],
"question": "What are some applications of transfer learning?"
},
{
"answer": "ADAPT (Python), TLib (Python), Domain-Adaptation-Toolbox (Matlab)",
"distractors": ["TensorFlow, PyTorch, Keras", "Scikit-learn, OpenCV, NumPy", "Matlab, R, Julia"],
"question": "What are some software implementations of transfer learning and domain adaptation algorithms?"
}
]
}
```
## Build the pipeline
Let's see how to build a pipeline to obtain this type of data:
````python
from typing import List
from pathlib import Path
from pydantic import BaseModel, Field
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
import wikipedia
page = wikipedia.page(title="Transfer_learning") # (1)
class ExamQuestion(BaseModel):
question: str = Field(..., description="The question to be answered")
answer: str = Field(..., description="The correct answer to the question")
distractors: List[str] = Field(
..., description="A list of incorrect but viable answers to the question"
)
class ExamQuestions(BaseModel): # (2)
exam: List[ExamQuestion]
SYSTEM_PROMPT = """\
You are an exam writer specialized in writing exams for students.
Your goal is to create questions and answers based on the document provided, and a list of distractors, that are incorrect but viable answers to the question.
Your answer must adhere to the following format:
```
[
{
"question": "Your question",
"answer": "The correct answer to the question",
"distractors": ["wrong answer 1", "wrong answer 2", "wrong answer 3"]
},
... (more questions and answers as required)
]
```
""".strip() # (3)
with Pipeline(name="ExamGenerator") as pipeline:
load_dataset = LoadDataFromDicts(
name="load_instructions",
data=[
{
"page": page.content, # (4)
}
],
)
text_generation = TextGeneration( # (5)
name="exam_generation",
system_prompt=SYSTEM_PROMPT,
template="Generate a list of answers and questions about the document. Document:\n\n{{ page }}",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
structured_output={
"schema": ExamQuestions.model_json_schema(),
"format": "json"
},
),
input_batch_size=8,
output_mappings={"model_name": "generation_model"},
)
load_dataset >> text_generation # (6)
````
1. Download a single page for the demo. We could donwnload first the pages, or apply the same procedure to any type of data we want. In a real world use case, we would want to make a dataset from these documents first.
2. Define the structure required for the answer using Pydantic. In this case we want for each page, a list with questions and answers (additionally we've added distractors, but can be ignored for this case). So our output will be a `ExamQuestions` model, which is a list of `ExamQuestion`, where each one consists in the `question` and `answer` fields as string fields. The language model will use the field descriptions to generate the values.
3. Use the system prompt to guide the model towards the behaviour we want from it. Independently from the structured output we are forcing the model to have, it helps if we pass the format expected in our prompt.
4. Move the page content from wikipedia to a row in the dataset.
5. The [`TextGeneration`](https://distilabel.argilla.io/dev/components-gallery/tasks/textgeneration/) task gets the system prompt, and the user prompt by means of the `template` argument, where we aid the model to generate the questions and answers based on the page content, that will be obtained from the corresponding column of the loaded data.
6. Connect both steps, and we are done.
## Run the example
To run this example you will first need to install the wikipedia dependency to download the sample data, being `pip install wikipedia`. *Change the username first in case you want to push the dataset to the hub using your account*.
??? Run
````python
python examples/exam_questions.py
````
````python title="exam_questions.py"
--8<-- "examples/exam_questions.py"
````
---
hide: toc
---
# Create a social network with FinePersonas
In this example, we'll explore the creation of specialized user personas for social network interactions using the [FinePersonas-v0.1](https://huggingface.co/datasets/argilla/FinePersonas-v0.1) dataset from Hugging Face. The final dataset will be ready to fine-tune a chat model with specific traits and characteristics.
## Introduction
We'll delve into the process of fine-tuning different LoRA (Low-Rank Adaptation) models to imbue these personas with specific traits and characteristics.
This approach draws inspiration from Michael Sayman's work on [SocialAI](https://apps.apple.com/us/app/socialai-ai-social-network/id6670229993) (visit the [profile](https://x.com/michaelsayman) to see some examples), to leverage [FinePersonas-v0.1](https://huggingface.co/datasets/argilla/FinePersonas-v0.1) for building models that can emulate bots with specific behaviour.
By fine-tuning these adapters, we can potentially create AI personas with distinct characteristics, communication styles, and areas of expertise. The result? AI interactions that feel more natural and tailored to specific contexts or user needs. For those interested in the technical aspects of this approach, we recommend the insightful blog post on [Multi-LoRA serving](https://huggingface.co/blog/multi-lora-serving). It provides a clear and comprehensive explanation of the technology behind this innovative method.
Let's jump to the demo.
## Creating our SocialAI Task
Building on the new [`TextGeneration`](https://distilabel.argilla.io/dev/components-gallery/tasks/textgeneration/), creating custom tasks is easier than ever before. This powerful tool opens up a world of possibilities for creating tailored text-based content with ease and precision. We will create a `SocialAI` task that will be in charge of generating responses to user interactions, taking into account a given `follower_type`, and use the perspective from a given `persona`:
```python
from distilabel.steps.tasks import TextGeneration
class SocialAI(TextGeneration):
follower_type: Literal["supporter", "troll", "alarmist"] = "supporter"
system_prompt: str = (
"You are an AI assistant expert at simulating user interactions. "
"You must answer as if you were a '{follower_type}', be concise answer with no more than 200 characters, nothing else."
"Here are some traits to use for your personality:\n\n"
"{traits}"
) # (1)
template: str = "You are the folowing persona:\n\n{{ persona }}\n\nWhat would you say to the following?\n\n {{ post }}" # (2)
columns: str | list[str] = ["persona", "post"] # (3)
_follower_traits: dict[str, str] = {
"supporter": (
"- Encouraging and positive\n"
"- Tends to prioritize enjoyment and relaxation\n"
"- Focuses on the present moment and short-term pleasure\n"
"- Often uses humor and playful language\n"
"- Wants to help others feel good and have fun\n"
),
"troll": (
"- Provocative and confrontational\n"
"- Enjoys stirring up controversy and conflict\n"
"- Often uses sarcasm, irony, and mocking language\n"
"- Tends to belittle or dismiss others' opinions and feelings\n"
"- Seeks to get a rise out of others and create drama\n"
),
"alarmist": (
"- Anxious and warning-oriented\n"
"- Focuses on potential risks and negative consequences\n"
"- Often uses dramatic or sensational language\n"
"- Tends to be serious and stern in tone\n"
"- Seeks to alert others to potential dangers and protect them from harm (even if it's excessive or unwarranted)\n"
),
}
def load(self) -> None:
super().load()
self.system_prompt = self.system_prompt.format(
follower_type=self.follower_type,
traits=self._follower_traits[self.follower_type]
) # (4)
```
1. We have a custom system prompt that will depend on the `follower_type` we decide for our model.
2. The base template or prompt will answert to the `post` we have, from the point of view of a `persona`.
3. We will need our dataset to have both `persona` and `post` columns to populate the prompt.
4. In the load method we place the specific traits for our follower type in the system prompt.
## Data preparation
This is an example, so let's keep it short. We will use 3 posts, and 3 different types of personas. While there's potential to enhance this process (perhaps by implementing random persona selection or leveraging semantic similarity) we'll opt for a straightforward method in this demonstration.
Our goal is to create a set of nine examples, each pairing a post with a persona. To achieve this, we'll employ an LLM to respond to each post from the perspective of a specific `persona`, effectively simulating how different characters might engage with the content.
```python
posts = [
{
"post": "Hmm, ok now I'm torn: should I go for healthy chicken tacos or unhealthy beef tacos for late night cravings?"
},
{
"post": "I need to develop a training course for my company on communication skills. Need to decide how deliver it remotely."
},
{
"post": "I'm always 10 minutes late to meetups but no one's complained. Could this be annoying to them?"
},
]
personas = (
load_dataset("argilla/FinePersonas-v0.1-clustering-100k", split="train")
.shuffle()
.select(range(3))
.select_columns("persona")
.to_list()
)
data = []
for post in posts:
for persona in personas:
data.append({"post": post["post"], "persona": persona["persona"]})
```
Each row in will have the following format:
```python
import json
print(json.dumps(data[0], indent=4))
{
"post": "Hmm, ok now I'm torn: should I go for healthy chicken tacos or unhealthy beef tacos for late night cravings?",
"persona": "A high school or college environmental science teacher or an ecology student specializing in biogeography and ecosystem dynamics."
}
```
This will be our dataset, that we can ingest using the [`LoadDataFromDicts`](https://distilabel.argilla.io/dev/components-gallery/steps/loaddatafromdicts/):
```python
loader = LoadDataFromDicts(data=data)
```
## Simulating from different types of followers
With our data in hand, we're ready to explore the capabilities of our SocialAI task. For this demonstration, we'll make use of of `meta-llama/Meta-Llama-3.1-70B-Instruct`
While this model has become something of a go-to choice recently, it's worth noting that experimenting with a variety of models could yield even more interesting results:
```python
from distilabel.models import InferenceEndpointsLLM
llm = InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
generation_kwargs={
"temperature": 0.7,
"max_new_tokens": 256,
},
)
follower_type = "supporter"
follower = SocialAI(
llm=llm,
follower_type=follower_type,
name=f"{follower_type}_user",
)
```
This setup simplifies the process, we only need to input the follower type, and the system handles the rest. We could update this too to have a random type of follower by default, and simulate from a bunch of different personalities.
## Building our Pipeline
The foundation of our pipeline is now in place. At its core is a single, powerful LLM. This versatile model will be repurposed to drive three distinct `SocialAI` Tasks, each tailored to a specific `TextGeneration` task, and each one of them will be prepared for Supervised Fine Tuning using [`FormatTextGenerationSFT`](https://distilabel.argilla.io/dev/components-gallery/steps/formattextgenerationsft/):
```python
with Pipeline(name="Social AI Personas") as pipeline:
loader = LoadDataFromDicts(data=data, batch_size=1)
llm = InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
generation_kwargs={
"temperature": 0.7,
"max_new_tokens": 256,
},
)
for follower_type in ["supporter", "troll", "alarmist"]:
follower = SocialAI(
llm=llm,
follower_type=follower_type,
name=f"{follower_type}_user", # (1)
output_mappings={
"generation": f"interaction_{follower_type}" # (2)
}
)
format_sft = FormatTextGenerationSFT(
name=f"format_sft_{follower_type}",
input_mappings={
"instruction": "post",
"generation": f"interaction_{follower_type}" # (3)
},
)
loader >> follower >> format_sft # (4)
```
1. We update the name of the step to keep track in the pipeline.
2. The `generation` column from each LLM will be mapped to avoid them being overriden, as we are reusing the same task.
3. As we have modified the output column from `SocialAI`, we redirect each one of the "follower_type" responses.
4. Connect the loader to each one of the follower tasks and `format_sft` to obtain 3 different subsets.
The outcome of this pipeline will be three specialized models, each fine-tuned to a unique `follower type` crafted by the `SocialAI` task. These models will generate SFT-formatted datasets, where each post is paired with its corresponding interaction data for a specific follower type. This setup enables seamless fine-tuning using your preferred framework, such as [TRL](https://huggingface.co/docs/trl/index), or any other training framework of your choice.
## Script and final dataset
All the pieces are in place for our script, the full pipeline can be seen here:
??? Run
```python
python examples/finepersonas_social_ai.py
```
```python title="finepersonas_social_ai.py"
--8<-- "examples/finepersonas_social_ai.py"
```
This is the final toy dataset we obtain: [FinePersonas-SocialAI-test](https://huggingface.co/datasets/plaguss/FinePersonas-SocialAI-test)
You can see examples of how to load each subset of them to fine-tune a model:
```python
from datasets import load_dataset
ds = load_dataset("plaguss/FinePersonas-SocialAI-test", "format_sft_troll")
```
And a sample of the generated field with the corresponding `post` and `persona`:
```json
{
"post": "Hmm, ok now I\u0027m torn: should I go for healthy chicken tacos or unhealthy beef tacos for late night cravings?",
"persona": "A high school or undergraduate physics or chemistry teacher, likely with a focus on experimental instruction.",
"interaction_troll": "\"Late night cravings? More like late night brain drain. Either way, it\u0027s just a collision of molecules in your stomach. Choose the one with more calories, at least that\u0027s some decent kinetic energy.\"",
}
```
There's a lot of room for improvement, but quite a promising start.
---
hide: toc
---
# Image generation with `distilabel`
Create synthetic images using `distilabel`.
This example shows how distilabel can be used to generate image data, either using [`InferenceEndpointsImageGeneration`](https://distilabel.argilla.io/dev/components-gallery/image_generation/inferenceendpointsimagegeneration/) or [`OpenAIImageGeneration`](https://distilabel.argilla.io/dev/components-gallery/image_generation/openaiimagegeneration/), thanks to the [`ImageGeneration`](https://distilabel.argilla.io/dev/components-gallery/task/imagegeneration/) task.
=== "Inference Endpoints - black-forest-labs/FLUX.1-schnell"
```python
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns
from distilabel.models.image_generation import InferenceEndpointsImageGeneration
from distilabel.steps.tasks import ImageGeneration
from datasets import load_dataset
ds = load_dataset("dvilasuero/finepersonas-v0.1-tiny", split="train").select(range(3))
with Pipeline(name="image_generation_pipeline") as pipeline:
ilm = InferenceEndpointsImageGeneration(
model_id="black-forest-labs/FLUX.1-schnell"
)
img_generation = ImageGeneration(
name="flux_schnell",
llm=ilm,
input_mappings={"prompt": "persona"}
)
keep_columns = KeepColumns(columns=["persona", "model_name", "image"])
img_generation >> keep_columns
```
Sample image for the prompt:
> A local art historian and museum professional interested in 19th-century American art and the local cultural heritage of Cincinnati.
![image_ie](https://huggingface.co/datasets/plaguss/test-finepersonas-v0.1-tiny-flux-schnell/resolve/main/artifacts/flux_schnell/images/3333f9870feda32a449994017eb72675.jpeg)
=== "OpenAI - dall-e-3"
```python
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns
from distilabel.models.image_generation import OpenAIImageGeneration
from distilabel.steps.tasks import ImageGeneration
from datasets import load_dataset
ds = load_dataset("dvilasuero/finepersonas-v0.1-tiny", split="train").select(range(3))
with Pipeline(name="image_generation_pipeline") as pipeline:
ilm = OpenAIImageGeneration(
model="dall-e-3",
generation_kwargs={
"size": "1024x1024",
"quality": "standard",
"style": "natural"
}
)
img_generation = ImageGeneration(
name="dalle-3"
llm=ilm,
input_mappings={"prompt": "persona"}
)
keep_columns = KeepColumns(columns=["persona", "model_name", "image"])
img_generation >> keep_columns
```
Sample image for the prompt:
> A local art historian and museum professional interested in 19th-century American art and the local cultural heritage of Cincinnati.
![image_oai](https://huggingface.co/datasets/plaguss/test-finepersonas-v0.1-tiny-dall-e-3/resolve/main/artifacts/dalle-3/images/3333f9870feda32a449994017eb72675.jpeg)
!!! success "Save the Distiset as an Image Dataset"
Note the call to `Distiset.transform_columns_to_image`, to have the images uploaded directly as an [`Image dataset`](https://huggingface.co/docs/hub/en/datasets-image):
```python
if __name__ == "__main__":
distiset = pipeline.run(use_cache=False, dataset=ds)
# Save the images as `PIL.Image.Image`
distiset = distiset.transform_columns_to_image("image")
distiset.push_to_hub("plaguss/test-finepersonas-v0.1-tiny-flux-schnell")
```
The full pipeline can be run at the following example. Keep in mind, you need to install `pillow` first: `pip install distilabel[vision]`.
??? Run
```python
python examples/image_generation.py
```
```python title="image_generation.py"
--8<-- "examples/image_generation.py"
```
---
hide: toc
---
# Structured generation with `outlines`
Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `distilabel`.
This script makes use of [`LlamaCppLLM`][distilabel.models.llms.llamacpp.LlamaCppLLM] and the structured output capabilities thanks to [`outlines`](https://outlines-dev.github.io/outlines/welcome/) to generate RPG characters that adhere to a JSON schema.
![Arena Hard](../../../assets/pipelines/knowledge_graphs.png)
It makes use of a local model which can be downloaded using curl (explained in the script itself), and can be exchanged with other `LLMs` like [`vLLM`][distilabel.models.llms.vllm.vLLM].
??? Run
```python
python examples/structured_generation_with_outlines.py
```
```python title="structured_generation_with_outlines.py"
--8<-- "examples/structured_generation_with_outlines.py"
```
\ No newline at end of file
---
hide: toc
---
# Structured generation with `instructor`
Answer instructions with knowledge graphs defined as `pydantic.BaseModel` objects using `instructor` in `distilabel`.
This script makes use of [`MistralLLM`][distilabel.models.llms.mistral.MistralLLM] and the structured output capabilities thanks to [`instructor`](https://python.useinstructor.com/) to generate knowledge graphs from complex topics.
![Knowledge graph figure](../../../assets/pipelines/knowledge_graphs.png)
This example is translated from this [awesome example](https://python.useinstructor.com/examples/knowledge_graph/) from `instructor` cookbook.
??? Run
```python
python examples/structured_generation_with_instructor.py
```
```python title="structured_generation_with_instructor.py"
--8<-- "examples/structured_generation_with_instructor.py"
```
??? "Visualizing the graphs"
Want to see how to visualize the graphs? You can test it using the following script. Generate some samples on your own and take a look:
!!! NOTE
This example uses graphviz to render the graph, you can install with `pip` in the following way:
```console
pip install graphviz
```
```python
python examples/draw_kg.py 2 # You can pass 0,1,2 to visualize each of the samples.
```
![Knowledge graph figure](../../../assets/images/sections/examples/knowledge-graph-example.png)
\ No newline at end of file
---
hide: toc
---
# Text generation with images in `distilabel`
Answer questions about images using `distilabel`.
Image-text-to-text models take in an image and text prompt and output text. In this example we will use an LLM [`InferenceEndpointsLLM`](https://distilabel.argilla.io/dev/components-gallery/llms/inferenceendpointsllm/) with [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) to ask a question about an image, and [`OpenAILLM`](https://distilabel.argilla.io/dev/components-gallery/llms/openaillm/) with `gpt-4o-mini`. We will ask a simple question to showcase how the [`TextGenerationWithImage`](https://distilabel.argilla.io/dev/components-gallery/tasks/textgenerationwithimage/) task can be used in a pipeline.
=== "Inference Endpoints - meta-llama/Llama-3.2-11B-Vision-Instruct"
```python
from distilabel.models.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks.text_generation_with_image import TextGenerationWithImage
from distilabel.steps import LoadDataFromDicts
with Pipeline(name="vision_generation_pipeline") as pipeline:
loader = LoadDataFromDicts(
data=[
{
"instruction": "What’s in this image?",
"image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
],
)
llm = InferenceEndpointsLLM(
model_id="meta-llama/Llama-3.2-11B-Vision-Instruct",
)
vision = TextGenerationWithImage(
name="vision_gen",
llm=llm,
image_type="url" # (1)
)
loader >> vision
```
1. The *image_type* can be a url pointing to the image, the base64 string representation, or a PIL image, take a look at the [`TextGenerationWithImage`](https://distilabel.argilla.io/dev/components-gallery/tasks/textgenerationwithimage/) for more information.
Image:
![Image](https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg)
Question:
> What’s in this image?
Response:
> This image depicts a wooden boardwalk weaving its way through a lush meadow, flanked by vibrant green grass that stretches towards the horizon under a calm and inviting sky. The boardwalk runs straight ahead, away from the viewer, forming a clear pathway through the tall, lush green grass, crops or other plant types or an assortment of small trees and shrubs. This meadow is dotted with trees and shrubs, appearing to be healthy and green. The sky above is a beautiful blue with white clouds scattered throughout, adding a sense of tranquility to the scene. While this image appears to be of a natural landscape, because grass is...
=== "OpenAI - gpt-4o-mini"
```python
from distilabel.models.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks.text_generation_with_image import TextGenerationWithImage
from distilabel.steps import LoadDataFromDicts
with Pipeline(name="vision_generation_pipeline") as pipeline:
loader = LoadDataFromDicts(
data=[
{
"instruction": "What’s in this image?",
"image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
],
)
llm = OpenAILLM(
model="gpt-4o-mini",
)
vision = TextGenerationWithImage(
name="vision_gen",
llm=llm,
image_type="url" # (1)
)
loader >> vision
```
1. The *image_type* can be a url pointing to the image, the base64 string representation, or a PIL image, take a look at the [`VisionGeneration`](https://distilabel.argilla.io/dev/components-gallery/tasks/visiongeneration/) for more information.
Image:
![Image](https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg)
Question:
> What’s in this image?
Response:
> The image depicts a scenic landscape featuring a wooden walkway or path that runs through a lush green marsh or field. The area is surrounded by tall grass and various shrubs, with trees likely visible in the background. The sky is blue with some wispy clouds, suggesting a beautiful day. Overall, it presents a peaceful natural setting, ideal for a stroll or nature observation.
The full pipeline can be run at the following example:
??? Note "Run the full pipeline"
```python
python examples/text_generation_with_image.py
```
```python title="text_generation_with_image.py"
--8<-- "examples/text_generation_with_image.py"
```
A sample dataset can be seen at [plaguss/test-vision-generation-Llama-3.2-11B-Vision-Instruct](https://huggingface.co/datasets/plaguss/test-vision-generation-Llama-3.2-11B-Vision-Instruct).
---
hide: toc
---
# Tutorials
- **End-to-end tutorials** provide detailed step-by-step explanations and the code used for end-to-end workflows.
- **Paper implementations** provide reproductions of fundamental papers in the synthetic data domain.
- **Examples** don't provide explenations but simply show code for different tasks.
## End-to-end tutorials
<div class="grid cards" markdown>
- __Generate a preference dataset__
---
Learn about synthetic data generation for ORPO and DPO.
[:octicons-arrow-right-24: Tutorial](tutorials/generate_preference_dataset.ipynb)
- __Clean an existing preference dataset__
---
Learn about how to provide AI feedback to clean an existing dataset.
[:octicons-arrow-right-24: Tutorial](tutorials/clean_existing_dataset.ipynb)
- __Retrieval and reranking models__
---
Learn about synthetic data generation for fine-tuning custom retrieval and reranking models.
[:octicons-arrow-right-24: Tutorial](tutorials/GenerateSentencePair.ipynb)
- __Generate text classification data__
---
Learn about how synthetic data generation for text classification can help address data imbalance or scarcity.
[:octicons-arrow-right-24: Tutorial](tutorials/generate_textcat_dataset.ipynb)
</div>
## Paper Implementations
<div class="grid cards" markdown>
- __Deepseek Prover__
---
Learn about an approach to generate mathematical proofs for theorems generated from informal math problems.
[:octicons-arrow-right-24: Example](papers/deepseek_prover.md)
- __DEITA__
---
Learn about prompt, response tuning for complexity and quality and LLMs as judges for automatic data selection.
[:octicons-arrow-right-24: Paper](papers/deita.md)
- __Instruction Backtranslation__
---
Learn about automatically labeling human-written text with corresponding instructions.
[:octicons-arrow-right-24: Paper](papers/instruction_backtranslation.md)
- __Prometheus 2__
---
Learn about using open-source models as judges for direct assessment and pair-wise ranking.
[:octicons-arrow-right-24: Paper](papers/prometheus.md)
- __UltraFeedback__
---
Learn about a large-scale, fine-grained, diverse preference dataset, used for training powerful reward and critic models.
[:octicons-arrow-right-24: Paper](papers/ultrafeedback.md)
- __APIGen__
---
Learn how to create verifiable high-quality datases for function-calling applications.
[:octicons-arrow-right-24: Paper](papers/apigen.md)
- __CLAIR__
---
Learn Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs.
[:octicons-arrow-right-24: Paper](papers/clair.md)
- __Math Shepherd__
---
Learn about Math-Shepherd, a framework to generate datasets to train process reward models (PRMs) which assign reward scores to each step of math problem solutions.
[:octicons-arrow-right-24: Paper](papers/math_shepherd.md)
</div>
## Examples
<div class="grid cards" markdown>
- __Benchmarking with distilabel__
---
Learn about reproducing the Arena Hard benchmark with disitlabel.
[:octicons-arrow-right-24: Example](examples/benchmarking_with_distilabel.md)
- __Structured generation with outlines__
---
Learn about generating RPG characters following a pydantic.BaseModel with outlines in distilabel.
[:octicons-arrow-right-24: Example](examples/llama_cpp_with_outlines.md)
- __Structured generation with instructor__
---
Learn about answering instructions with knowledge graphs defined as pydantic.BaseModel objects using instructor in distilabel.
[:octicons-arrow-right-24: Example](examples/mistralai_with_instructor.md)
- __Create a social network with FinePersonas__
---
Learn how to leverage FinePersonas to create a synthetic social network and fine-tune adapters for Multi-LoRA.
[:octicons-arrow-right-24: Example](examples/fine_personas_social_network.md)
- __Create questions and answers for a exam__
---
Learn how to generate questions and answers for a exam, using a raw wikipedia page and structured generation.
[:octicons-arrow-right-24: Example](examples/exam_questions.md)
- __Image generation with distilabel__
---
Generate synthetic images using distilabel.
[:octicons-arrow-right-24: Example](examples/image_generation.md)
- __Text generation with images in distilabel__
---
Ask questions about images using distilabel.
[:octicons-arrow-right-24: Example](examples/text_generation_with_image.md)
</div>
---
hide: toc
---
# Create Function-Calling datasets with APIGen
This example will introduce [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518), a data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications.
## Replication
The following figure showcases the APIGen framework:
![APIGen framework](../../../assets/tutorials-assets/overview-apigen.jpg)
Now, let's walk through the key steps illustrated in the figure:
- [`DataSampler`](https://distilabel.argilla.io/dev/components-gallery/step/datasampler/): With the help of this step and the original [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) we are getting the Seed QA Data Sampler for the prompt template.
- [`APIGenGenerator`](https://distilabel.argilla.io/dev/components-gallery/task/apigengenerator/): This step does the job of the *Query-Answer Generator*, including the format checker from *Stage 1: Format Checker* thanks to the structured output generation.
- [`APIGenExecutionChecker`](https://distilabel.argilla.io/dev/components-gallery/task/apigenexecutionchecker/): This step is in charge of the *Stage 2: Execution Checker*.
- [`APIGenSemanticChecker`](https://distilabel.argilla.io/dev/components-gallery/task/apigensemanticchecker/): Step in charge of running *Stage 3: Semantic Checker*, can use the same or a different LLM, we are using the same as in [`APIGenGenerator`](https://distilabel.argilla.io/dev/components-gallery/task/apigengenerator/) step.
The current implementation hasn't utilized the *Diverse Prompt Library*. To incorporate it, one could either adjust the prompt template within the [`APIGenGenerator`](https://distilabel.argilla.io/dev/components-gallery/task/apigengenerator/) or develop a new sampler specifically for this purpose. As for the *API Sampler*, while no specific data is shared here, we've created illustrative examples to demonstrate the pipeline's functionality. These examples represent a mix of data that could be used to replicate the sampler's output.
## Data preparation
The original paper tells about the data they used and give some hints, but nothing was shared. In this example, we will write a bunch of examples by hand to showcase how this pipeline can be built.
Assume we have the following function names, and corresponding descriptions of their behaviour:
```python
data = [
{
"func_name": "final_velocity",
"func_desc": "Calculates the final velocity of an object given its initial velocity, acceleration, and time.",
},
{
"func_name": "permutation_count",
"func_desc": "Calculates the number of permutations of k elements from a set of n elements.",
},
{
"func_name": "getdivision",
"func_desc": "Divides two numbers by making an API call to a division service.",
},
{
"func_name": "binary_addition",
"func_desc": "Adds two binary numbers and returns the result as a binary string.",
},
{
"func_name": "swapi_planet_resource",
"func_desc": "get a specific planets resource",
},
{
"func_name": "disney_character",
"func_desc": "Find a specific character using this endpoint",
}
]
```
The original paper refers to both python functions and APIs, but we will make use of python functions exclusively for simplicity. In order to execute and check this functions/APIs, we need access to the code, which we have moved to a Python file: [lib_apigen.py](https://github.com/argilla-io/distilabel/blob/main/examples/lib_apigen.py). All this functions are executable, but we also need access to their *tool* representation. For this, we will make use of transformers' *get_json_schema* function[^1].
[^1]: Read this nice blog post for more information on tools and the reasoning behind `get_json_schema`: [Tool Use, Unified](https://huggingface.co/blog/unified-tool-use).
We have all the machinery prepared in our libpath, except from the *tool* definition. With the help of our helper function `load_module_from_path` we will load this python module, collect all the tools, and add them to each row in our `data` variable.
```python
from distilabel.steps.tasks.apigen.utils import load_module_from_path
libpath_module = load_module_from_path(libpath)
tools = getattr(libpath_module, "get_tools")() # call get_tools()
for row in data:
# The tools should have a mix where both the correct and irrelevant tools are present.
row.update({"tools": [tools[row["func_name"]]]})
```
Now we have all the necessary data for our prompt. Additionally, we will make use of the original dataset as few-shot examples to enhance the model:
```python
ds_og = (
load_dataset("Salesforce/xlam-function-calling-60k", split="train")
.shuffle(seed=42)
.select(range(500))
.to_list()
)
```
We have just loaded a subset and transformed it to a list of dictionaries, as we will use it in the [`DataSampler`](https://distilabel.argilla.io/dev/components-gallery/steps/datasampler/) `GeneratorStep`, grabbing random examples from the original dataset.
## Building the Pipeline
Now that we've walked through each component, it's time to see how it all comes together, here's the Pipeline code:
```python
with Pipeline(name="apigen-example") as pipeline:
loader_seeds = LoadDataFromDicts(data=data) # (1)
sampler = DataSampler( # (2)
data=ds_og,
size=2,
samples=len(data),
batch_size=8,
)
prep_examples = PrepareExamples() # This step will add the 'examples' column
combine_steps = CombineOutputs() # (3)
model_id = "meta-llama/Meta-Llama-3.1-70B-Instruct"
llm=InferenceEndpointsLLM( # (4)
model_id=model_id,
tokenizer_id=model_id,
generation_kwargs={
"temperature": 0.7,
"max_new_tokens": 2048,
},
)
apigen = APIGenGenerator( # (5)
llm=llm,
use_default_structured_output=True,
)
execution_checker = APIGenExecutionChecker(libpath=str(libpath)) # (6)
semantic_checker = APIGenSemanticChecker(llm=llm) # (7)
sampler >> prep_examples
(
[loader_seeds, prep_examples]
>> combine_steps
>> apigen
>> execution_checker
>> semantic_checker
)
```
1. Load the data seeds we are going to use to generate our function calling dataset.
2. The `DataSampler` together with `PrepareExamples` will be used to help us create the few-shot
examples from the original dataset to be fed in our prompt.
3. Combine both columns to obtain a single stream of data
4. Will reuse the same LLM for the generation and the semantic checks.
5. Creates the `query` and `answers` that will be used together with the `tools` to fine-tune a new model. Will generate the structured outputs to ensure we have valid JSON formatted answers.
6. Adds columns `keep_row_after_execution_check` and `execution_result`.
7. Adds columns `keep_row_after_semantic_check` and `thought`.
## Script and final dataset
To see all the pieces in place, take a look at the full pipeline, as well as an example row that would be generated from this pipeline.
??? Run
```python
python examples/pipeline_apigen.py
```
```python title="pipeline_apigen.py"
--8<-- "examples/pipeline_apigen.py"
```
Example row:
```json
{
"func_name": "final_velocity",
"func_desc": "Calculates the final velocity of an object given its initial velocity, acceleration, and time.",
"tools": [
{
"function": {
"description": "Calculates the final velocity of an object given its initial velocity, acceleration, and time.",
"name": "final_velocity",
"parameters": {
"properties": {
"acceleration": {
"description": "The acceleration of the object.",
"type": "number"
},
"initial_velocity": {
"description": "The initial velocity of the object.",
"type": "number"
},
"time": {
"description": "The time elapsed.",
"type": "number"
}
},
"required": [
"initial_velocity",
"acceleration",
"time"
],
"type": "object"
}
},
"type": "function"
}
],
"examples": "## Query:\nRetrieve the first 15 comments for post ID '12345' from the Tokapi mobile API.\n## Answers:\n[{\"name\": \"v1_post_post_id_comments\", \"arguments\": {\"post_id\": \"12345\", \"count\": 15}}]\n\n## Query:\nRetrieve the detailed recipe for the cake with ID 'cake101'.\n## Answers:\n[{\"name\": \"detailed_cake_recipe_by_id\", \"arguments\": {\"is_id\": \"cake101\"}}]\n\n## Query:\nWhat are the frequently asked questions and their answers for Coca-Cola Company? Also, what are the suggested tickers based on Coca-Cola Company?\n## Answers:\n[{\"name\": \"symbols_faq\", \"arguments\": {\"ticker_slug\": \"KO\"}}, {\"name\": \"symbols_suggested\", \"arguments\": {\"ticker_slug\": \"KO\"}}]",
"query": "What would be the final velocity of an object that starts at rest and accelerates at 9.8 m/s^2 for 10 seconds.",
"answers": "[{\"arguments\": {\"acceleration\": \"9.8\", \"initial_velocity\": \"0\", \"time\": \"10\"}, \"name\": \"final_velocity\"}]",
"distilabel_metadata": {
"raw_input_a_p_i_gen_generator_0": [
{
"content": "You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.\n\nConstruct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.\n\nEnsure the query:\n- Is clear and concise\n- Demonstrates typical use cases\n- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words\n- Across a variety level of difficulties, ranging from beginner and advanced use cases\n- The corresponding result's parameter types and ranges match with the function's descriptions\n\nEnsure the answer:\n- Is a list of function calls in JSON format\n- The length of the answer list should be equal to the number of requests in the query\n- Can solve all the requests in the query effectively",
"role": "system"
},
{
"content": "Here are examples of queries and the corresponding answers for similar functions:\n## Query:\nRetrieve the first 15 comments for post ID '12345' from the Tokapi mobile API.\n## Answers:\n[{\"name\": \"v1_post_post_id_comments\", \"arguments\": {\"post_id\": \"12345\", \"count\": 15}}]\n\n## Query:\nRetrieve the detailed recipe for the cake with ID 'cake101'.\n## Answers:\n[{\"name\": \"detailed_cake_recipe_by_id\", \"arguments\": {\"is_id\": \"cake101\"}}]\n\n## Query:\nWhat are the frequently asked questions and their answers for Coca-Cola Company? Also, what are the suggested tickers based on Coca-Cola Company?\n## Answers:\n[{\"name\": \"symbols_faq\", \"arguments\": {\"ticker_slug\": \"KO\"}}, {\"name\": \"symbols_suggested\", \"arguments\": {\"ticker_slug\": \"KO\"}}]\n\nNote that the query could be interpreted as a combination of several independent requests.\n\nBased on these examples, generate 1 diverse query and answer pairs for the function `final_velocity`.\nThe detailed function description is the following:\nCalculates the final velocity of an object given its initial velocity, acceleration, and time.\n\nThese are the available tools to help you:\n[{'type': 'function', 'function': {'name': 'final_velocity', 'description': 'Calculates the final velocity of an object given its initial velocity, acceleration, and time.', 'parameters': {'type': 'object', 'properties': {'initial_velocity': {'type': 'number', 'description': 'The initial velocity of the object.'}, 'acceleration': {'type': 'number', 'description': 'The acceleration of the object.'}, 'time': {'type': 'number', 'description': 'The time elapsed.'}}, 'required': ['initial_velocity', 'acceleration', 'time']}}}]\n\nThe output MUST strictly adhere to the following JSON format, and NO other text MUST be included:\n```json\n[\n {\n \"query\": \"The generated query.\",\n \"answers\": [\n {\n \"name\": \"api_name\",\n \"arguments\": {\n \"arg_name\": \"value\"\n ... (more arguments as required)\n }\n },\n ... (more API calls as required)\n ]\n }\n]\n```\n\nNow please generate 1 diverse query and answer pairs following the above format.",
"role": "user"
}
],
"raw_input_a_p_i_gen_semantic_checker_0": [
{
"content": "As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\u2019s intentions.\n\nDo not pass if:\n1. The function call does not align with the query\u2019s objective, or the input arguments appear incorrect.\n2. The function call and arguments are not properly chosen from the available functions.\n3. The number of function calls does not correspond to the user\u2019s intentions.\n4. The execution results are irrelevant and do not match the function\u2019s purpose.\n5. The execution results contain errors or reflect that the function calls were not executed successfully.",
"role": "system"
},
{
"content": "Given Information:\n- All Available Functions:\nCalculates the final velocity of an object given its initial velocity, acceleration, and time.\n- User Query: What would be the final velocity of an object that starts at rest and accelerates at 9.8 m/s^2 for 10 seconds.\n- Generated Function Calls: [{\"arguments\": {\"acceleration\": \"9.8\", \"initial_velocity\": \"0\", \"time\": \"10\"}, \"name\": \"final_velocity\"}]\n- Execution Results: ['9.8']\n\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\n\nThe main decision factor is wheather the function calls accurately reflect the query's intentions and the function descriptions.\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\n\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\n```\n{\n \"thought\": \"Concisely describe your reasoning here\",\n \"passes\": \"yes\" or \"no\"\n}\n```\n",
"role": "user"
}
],
"raw_output_a_p_i_gen_generator_0": "{\"pairs\": [\n {\n \"answers\": [\n {\n \"arguments\": {\n \"acceleration\": \"9.8\",\n \"initial_velocity\": \"0\",\n \"time\": \"10\"\n },\n \"name\": \"final_velocity\"\n }\n ],\n \"query\": \"What would be the final velocity of an object that starts at rest and accelerates at 9.8 m/s^2 for 10 seconds.\"\n }\n]}",
"raw_output_a_p_i_gen_semantic_checker_0": "{\n \"thought\": \"\",\n \"passes\": \"yes\"\n}"
},
"model_name": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"keep_row_after_execution_check": true,
"execution_result": [
"9.8"
],
"thought": "",
"keep_row_after_semantic_check": true
}
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment