# Serving an `LLM` for sharing it between several `Task`s
It's very common to want to use the same `LLM` for several `Task`s in a pipeline. To avoid loading the `LLM` as many times as the number of `Task`s and avoid wasting resources, it's recommended to serve the model using solutions like [`text-generation-inference`](https://huggingface.co/docs/text-generation-inference/quicktour#launching-tgi) or [`vLLM`](https://docs.vllm.ai/en/stable/serving/deploying_with_docker.html), and then use an `AsyncLLM` compatible client like `InferenceEndpointsLLM` or `OpenAILLM` to communicate with the server respectively.
## Serving LLMs using `text-generation-inference`
```bash
model=meta-llama/Meta-Llama-3-8B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v$volume:/data \
The bash command above has been copy-pasted from the official docs [text-generation-inference](https://huggingface.co/docs/text-generation-inference/quicktour#launching-tgi). Please refer to the official docs for more information.
And then we can use `InferenceEndpointsLLM` with `base_url=http://localhost:8080` (pointing to our `TGI` local deployment):
The bash command above has been copy-pasted from the official docs [vLLM](https://docs.vllm.ai/en/stable/serving/deploying_with_docker.html). Please refer to the official docs for more information.
And then we can use `OpenAILLM` with `base_url=http://localhost:8000` (pointing to our `vLLM` local deployment):
`Distilabel` has integrations with relevant libraries to generate structured text i.e. to guide the [`LLM`][distilabel.models.llms.LLM] towards the generation of structured outputs following a JSON schema, a regex, etc.
## Outlines
`Distilabel` integrates [`outlines`](https://outlines-dev.github.io/outlines/welcome/) within some [`LLM`][distilabel.models.llms.LLM] subclasses. At the moment, the following LLMs integrated with `outlines` are supported in `distilabel`: [`TransformersLLM`][distilabel.models.llms.TransformersLLM], [`vLLM`][distilabel.models.llms.vLLM] or [`LlamaCppLLM`][distilabel.models.llms.LlamaCppLLM], so that anyone can generate structured outputs in the form of *JSON* or a parseable *regex*.
The [`LLM`][distilabel.models.llms.LLM] has an argument named `structured_output`[^1] that determines how we can generate structured outputs with it, let's see an example using [`LlamaCppLLM`][distilabel.models.llms.LlamaCppLLM].
!!! Note
For `outlines` integration to work you may need to install the corresponding dependencies:
```bash
pip install distilabel[outlines]
```
### JSON
We will start with a JSON example, where we initially define a `pydantic.BaseModel` schema to guide the generation of the structured output.
!!! NOTE
Take a look at [`StructuredOutputType`][distilabel.typing.models.StructuredOutputType] to see the expected format
of the `structured_output` dict variable.
```python
frompydanticimportBaseModel
classUser(BaseModel):
name:str
last_name:str
id:int
```
And then we provide that schema to the `structured_output` argument of the LLM.
1. We have previously downloaded a GGUF model i.e. `llama.cpp` compatible, from the Hugging Face Hub using curl[^2], but any model can be used as replacement, as long as the `model_path` argument is updated.
And we are ready to pass our instruction as usual:
```python
importjson
result=llm.generate(
[[{"role":"user","content":"Create a user profile for the following marathon"}]],
We get back a Python dictionary (formatted as a string) that we can parse using `json.loads`, or validate it directly using the `User`, which si a `pydantic.BaseModel` instance.
### Regex
The following example shows an example of text generation whose output adhere to a regular expression:
```python
pattern=r"<name>(.*?)</name>.*?<grade>(.*?)</grade>"# the same pattern for re.compile
For other LLM providers behind APIs, there's no direct way of accessing the internal logit processor like `outlines` does, but thanks to [`instructor`](https://python.useinstructor.com/) we can generate structured output from LLM providers based on `pydantic.BaseModel` objects. We have integrated `instructor` to deal with the [`AsyncLLM`][distilabel.models.llms.AsyncLLM].
!!! Note
For `instructor` integration to work you may need to install the corresponding dependencies:
```bash
pip install distilabel[instructor]
```
!!! Note
Take a look at [`InstructorStructuredOutputType`][distilabel.typing.models.InstructorStructuredOutputType] to see the expected format
of the `structured_output` dict variable.
The following is the same example you can see with `outlines`'s `JSON` section for comparison purposes.
```python
frompydanticimportBaseModel
classUser(BaseModel):
name:str
last_name:str
id:int
```
And then we provide that schema to the `structured_output` argument of the LLM:
!!! NOTE
In this example we are using *Meta Llama 3.1 8B Instruct*, keep in mind not all the models support structured outputs.
We get back a Python dictionary (formatted as a string) that we can parse using `json.loads`, or validate it directly using the `User`, which is a `pydantic.BaseModel` instance.
!!! Tip
A full pipeline example can be seen in the following script:
OpenAI offers a [JSON Mode](https://platform.openai.com/docs/guides/text-generation/json-mode) to deal with structured output via their API, let's see how to make use of them. The JSON mode instructs the model to always return a JSON object following the instruction required.
!!! WARNING
Bear in mind, for this to work, you must instruct the model in some way to generate JSON, either in the `system message` or in the instruction, as can be seen in the [API reference](https://platform.openai.com/docs/guides/text-generation/json-mode).
Contrary to what we have via `outlines`, JSON mode will not guarantee the output matches any specific schema, only that it is valid and parses without errors. More information can be found in the OpenAI documentation.
Other than the reference to generating JSON, to ensure the model generates parseable JSON we can pass the argument `response_format="json"`[^3]:
[{"role":"user","content":"What's the capital of Spain?"}],
],
)
# [
# {
# "generations": [
# "The capital of Spain is Madrid."
# ],
# "statistics": {
# "input_tokens": [
# 43
# ],
# "output_tokens": [
# 8
# ]
# }
# }
# ]
```
!!! Note
Always call the `LLM.load` or `Task.load` method when using LLMs standalone or as part of a `Task`. If using a `Pipeline`, this is done automatically in `Pipeline.run()`.
!!! Tip "New in version 1.5.0"
Since version `1.5.0` the LLM output is a list of dictionaries (one per item in the `inputs`),
each containing `generations`, that reports the text returned by the `LLM`, and a `statistics` field that will store statistics related to the `LLM` generation. Initially, this will include
`input_tokens` and `output_tokens` when available, which will be obtained via the API when available, or if a tokenizer is available for the model used, using the tokenizer for the model.
This data will be moved by the corresponding `Task` during the pipeline processing and moved to `distilabel_metadata` so we can operate on this data if we want, like for example computing the number of tokens per dataset.
To access to the previous result one just has to access to the generations in the resulting dictionary: `result[0]["generations"]`.
### Offline Batch Generation
By default, all `LLM`s will generate text in a synchronous manner i.e. send inputs using `generate_outputs` method that will get blocked until outputs are generated. There are some `LLM`s (such as [OpenAILLM][distilabel.models.llms.openai.OpenAILLM]) that implements what we denote as _offline batch generation_, which allows to send the inputs to the LLM-as-a-service which will generate the outputs asynchronously and give us a job id that we can use later to check the status and retrieve the generated outputs when they are ready. LLM-as-a-service platforms offers this feature as a way to save costs in exchange of waiting for the outputs to be generated.
To use this feature in `distilabel` the only thing we need to do is to set the `use_offline_batch_generation` attribute to `True` when creating the `LLM` instance:
```python
fromdistilabel.modelsimportOpenAILLM
llm=OpenAILLM(
model="gpt-4o",
use_offline_batch_generation=True,
)
llm.load()
llm.jobs_ids# (1)
# None
llm.generate_outputs(# (2)
inputs=[
[{"role":"user","content":"What's the capital of Spain?"}],
],
)
# DistilabelOfflineBatchGenerationNotFinishedException: Batch generation with jobs_ids=('batch_OGB4VjKpu2ay9nz3iiFJxt5H',) is not finished
llm.jobs_ids# (3)
# ('batch_OGB4VjKpu2ay9nz3iiFJxt5H',)
llm.generate_outputs(# (4)
inputs=[
[{"role":"user","content":"What's the capital of Spain?"}],
],
)
# [{'generations': ['The capital of Spain is Madrid.'],
2. The first call to `generate_outputs` will send the inputs to the LLM-as-a-service and return a `DistilabelOfflineBatchGenerationNotFinishedException` since the outputs are not ready yet.
3. After the first call to `generate_outputs` the `jobs_ids` attribute will contain the job ids created for generating the outputs.
4. The second call or subsequent calls to `generate_outputs` will return the outputs if they are ready or raise a `DistilabelOfflineBatchGenerationNotFinishedException` if they are not ready yet.
The `offline_batch_generation_block_until_done` attribute can be used to block the `generate_outputs` method until the outputs are ready polling the platform the specified amount of seconds.
```python
fromdistilabel.modelsimportOpenAILLM
llm=OpenAILLM(
model="gpt-4o",
use_offline_batch_generation=True,
offline_batch_generation_block_until_done=5,# poll for results every 5 seconds
)
llm.load()
llm.generate_outputs(
inputs=[
[{"role":"user","content":"What's the capital of Spain?"}],
],
)
# [{'generations': ['The capital of Spain is Madrid.'],
As mentioned in *Working with LLMs* section, the generation of an LLM is automatically moved to `distilabel_metadata` to avoid interference with the common workflow, so the addition of the `statistics` it's an extra component available for the user, but nothing has to be changed in the
defined pipelines.
### Runtime Parameters
LLMs can have runtime parameters, such as `generation_kwargs`, provided via the `Pipeline.run()` method using the `params` argument.
!!! Note
Runtime parameters can differ between LLM subclasses, caused by the different functionalities offered by the LLM providers.
To create custom LLMs, subclass either [`LLM`][distilabel.models.llms.LLM] for synchronous or [`AsyncLLM`][distilabel.models.llms.AsyncLLM] for asynchronous LLMs. Implement the following methods:
*`model_name`: A property containing the model's name.
*`generate`: A method that takes a list of prompts and returns generated texts.
*`agenerate`: A method that takes a single prompt and returns generated texts. This method is used within the `generate` method of the `AsyncLLM` class.
* (optional) `get_last_hidden_state`: is a method that will take a list of prompts and return a list of hidden states. This method is optional and will be used by some tasks such as the [`GenerateEmbeddings`][distilabel.steps.tasks.GenerateEmbeddings] task.
=== "Custom LLM"
```python
from typing import Any
from pydantic import validate_call
from distilabel.models import LLM
from distilabel.typing import GenerateOutput, HiddenState
`generate` and `agenerate` keyword arguments (but `input` and `num_generations`) are considered as `RuntimeParameter`s, so a value can be passed to them via the `parameters` argument of the `Pipeline.run` method.
!!! Note
To have the arguments of the `generate` and `agenerate` coerced to the expected types, the `validate_call` decorator is used, which will automatically coerce the arguments to the expected types, and raise an error if the types are not correct. This is specially useful when providing a value for an argument of `generate` or `agenerate` from the CLI, since the CLI will always provide the arguments as strings.
!!! Warning
Additional LLMs created in `distilabel` will have to take into account how the `statistics` are generated to properly include them in the LLM output.
## Available LLMs
[Our LLM gallery](../../../../components-gallery/llms/index.md) shows a list of the available LLMs that can be used within the `distilabel` library.
[`Pipeline`][distilabel.pipeline.Pipeline] organise the Steps and Tasks in a sequence, where the output of one step is the input of the next one.
A [`Pipeline`][distilabel.pipeline.Pipeline] should be created by making use of the context manager along with passing a **name**, and optionally a **description**.
```python
fromdistilabel.pipelineimportPipeline
withPipeline("pipe-name",description="My first pipe")aspipeline:
...
```
### Connecting steps with the `Step.connect` method
Now, we can define the steps of our [`Pipeline`][distilabel.pipeline.Pipeline].
!!! NOTE
Steps without predecessors (i.e. root steps), need to be [`GeneratorStep`][distilabel.steps.GeneratorStep]s such as [`LoadDataFromDicts`][distilabel.steps.LoadDataFromDicts] or [`LoadDataFromHub`][distilabel.steps.LoadDataFromHub]. After this, other steps can be defined.
```python
fromdistilabel.pipelineimportPipeline
fromdistilabel.stepsimportLoadDataFromHub
withPipeline("pipe-name",description="My first pipe")aspipeline:
load_dataset=LoadDataFromHub(name="load_dataset")
...
```
!!! Tip "Easily load your datasets"
If you are already used to work with Hugging Face's `Dataset` via `load_dataset` or `pd.DataFrame`, you can create the `GeneratorStep` directly from the dataset (or dataframe), and create the step with the help of [`make_generator_step`][distilabel.steps.generators.utils.make_generator_step]:
=== "From a list of dicts"
```python
from distilabel.pipeline import Pipeline
from distilabel.steps import make_generator_step
dataset = [{"instruction": "Tell me a joke."}]
with Pipeline("pipe-name", description="My first pipe") as pipeline:
Next, we will use `prompt` column from the dataset obtained through `LoadDataFromHub` and use several `LLM`s to execute a `TextGeneration` task. We will also use the `Task.connect()` method to connect the steps, so the output of one step is the input of the next one.
!!! NOTE
The order of the execution of the steps will be determined by the connections of the steps. In this case, the `TextGeneration` tasks will be executed after the `LoadDataFromHub` step.
For each row of the dataset, the `TextGeneration` task will generate a text based on the `instruction` column and the `LLM` model, and store the result (a single string) in a new column called `generation`. Because we need to have the `response`s in the same column, we will add `GroupColumns` to combine them all in the same column as a list of strings.
!!! NOTE
In this case, the `GroupColumns` tasks will be executed after all `TextGeneration` steps.
Besides the `Step.connect` method: `step1.connect(step2)`, there's an alternative way by making use of the `>>` operator. We can connect steps in a more readable way, and it's also possible to connect multiple steps at once.
=== "Step per step"
Each call to `step1.connect(step2)` has been exchanged by `step1 >> step2` within the loop.
```python
from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import GroupColumns, LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
with Pipeline("pipe-name", description="My first pipe") as pipeline:
In some pipelines, you may want to send batches from a single upstream step to specific downstream steps based on certain conditions. To achieve this, you can use a `routing_batch_function`. This function takes a list of downstream steps and returns a list of step names to which each batch should be routed.
Let's update the example above to route the batches loaded by the `LoadDataFromHub` step to just 2 of the `TextGeneration` tasks. First, we will create our custom [`routing_batch_function`][distilabel.pipeline.routing_batch_function.routing_batch_function], and then we will update the pipeline to use it:
The `routing_batch_function` that we just built is a common one, so `distilabel` comes with a builtin function that can be used to achieve the same behavior:
```python
fromdistilable.pipelineimportsample_n_steps
sample_two_steps=sample_n_steps(2)
```
## Running the pipeline
### Pipeline.dry_run
Before running the `Pipeline` we can check if the pipeline is valid using the `Pipeline.dry_run()` method. It takes the same parameters as the `run` method which we will discuss in the following section, plus the `batch_size` we want the dry run to use (by default set to 1).
```python
withPipeline("pipe-name",description="My first pipe")aspipeline:
But if we run the pipeline above, we will see that the `run` method will fail:
```
ValueError: Step 'text_generation_with_gpt-4-0125-preview' requires inputs ['instruction'], but only the inputs=['prompt', 'completion', 'meta'] are available, which means that the inputs=['instruction'] are missing or not available
when the step gets to be executed in the pipeline. Please make sure previous steps to 'text_generation_with_gpt-4-0125-preview' are generating the required inputs.
```
This is because, before actually running the pipeline, we must ensure each step has the necessary input columns to be executed. In this case, the `TextGeneration` task requires the `instruction` column, but the `LoadDataFromHub` step generates the `prompt` column. To solve this, we can use the `output_mappings` or `input_mapping` arguments of individual `Step`s, to map columns from one step to another.
```python
withPipeline("pipe-name",description="My first pipe")aspipeline:
load_dataset=LoadDataFromHub(
name="load_dataset",
output_mappings={"prompt":"instruction"}
)
...
```
If we execute the pipeline again, it will run successfully and we will have a `Distiset` with the outputs of all the leaf steps of the pipeline which we can push to the Hugging Face Hub.
Note that in most cases if you don't need the extra flexibility the [`GeneratorSteps`][distilabel.steps.base.GeneratorStep] bring you, you can create a dataset as you would normally do and pass it to the [Pipeline.run][distilabel.pipeline.base.BasePipeline.run] method directly. Look at the highlighted lines to see the updated lines:
```python hl_lines="11-14 33 38"
import random
from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline, routing_batch_function
In case you want to stop the pipeline while it's running, you can press ++ctrl+c++ or ++cmd+c++ depending on your OS (or send a `SIGINT` to the main process), and the outputs will be stored in the cache. Pressing an additional time will force the pipeline to stop its execution, but this can lead to losing the generated outputs for certain batches.
## Cache
If for some reason, the pipeline execution stops (for example by pressing `Ctrl+C`), the state of the pipeline and the outputs will be stored in the cache, so we can resume the pipeline execution from the point where it was stopped.
If we want to force the pipeline to run again without can, then we can use the `use_cache` argument of the `Pipeline.run()` method:
For more information on caching, we refer the reader to the [caching](../../advanced/caching.md) section.
## Adjusting the batch size for each step
Memory issues can arise when processing large datasets or when using large models. To avoid this, we can use the `input_batch_size` argument of individual tasks. `TextGeneration` task will receive 5 dictionaries, while the `LoadDataFromHub` step will send 10 dictionaries per batch:
Sharing a pipeline with others is very easy, as we can serialize the pipeline object using the `save` method. We can save the pipeline in different formats, such as `yaml` or `json`:
=== "yaml"
```python
if__name__=="__main__":
pipeline.save("pipeline.yaml",format="yaml")
```
=== "json"
```python
if__name__=="__main__":
pipeline.save("pipeline.json",format="json")
```
To load the pipeline, we can use the `from_yaml` or `from_json` methods:
=== "yaml"
```python
pipeline=Pipeline.from_yaml("pipeline.yaml")
```
=== "json"
```python
pipeline=Pipeline.from_json("pipeline.json")
```
Serializing the pipeline is very useful when we want to share the pipeline with others, or when we want to store the pipeline for future use. It can even be hosted online, so the pipeline can be executed directly using the [CLI](../../advanced/cli/index.md).
## Visualizing the pipeline
We can visualize the pipeline using the `Pipeline.draw()` method. This will create a `mermaid` graph, and return the path to the image.
```python
path_to_image=pipeline.draw(
top_to_bottom=True,
show_edge_labels=True,
)
```
Within notebooks, we can simply call `pipeline` and the graph will be displayed. Alternatively, we can use the `Pipeline.draw()` method to have more control over the graph visualization and use `IPython` to display it.
```python
fromIPython.displayimportImage,display
display(Image(path_to_image))
```
Let's now see how the pipeline of the [fully working example](#fully-working-example) looks like.
To sum up, here is the full code of the pipeline we have created in this section. Note that you will need to change the name of the Hugging Face repository where the resulting will be pushed, set `OPENAI_API_KEY` environment variable, set `MISTRAL_API_KEY` and have `gcloud` installed and configured:
The [`GeneratorStep`][distilabel.steps.GeneratorStep] is a subclass of [`Step`][distilabel.steps.Step] that is intended to be used as the first step within a [`Pipeline`][distilabel.pipeline.Pipeline], because it doesn't require input and generates data that can be used by other steps. Alternatively, it can also be used as a standalone.
instructions=["Tell me a joke.","Tell me a story."],
batch_size=1,
)
step.load()
next(step.process(offset=0))
# ([{'instruction': 'Tell me a joke.'}], False)
next(step.process(offset=1))
# ([{'instruction': 'Tell me a story.'}], True)
```
!!! NOTE
The `Step.load()` always needs to be executed when being used as a standalone. Within a pipeline, this will be done automatically during pipeline execution.
## Defining custom GeneratorSteps
We can define a custom generator step by creating a new subclass of the [`GeneratorStep`][distilabel.steps.GeneratorStep] and defining the following:
-`outputs`: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.
-`process`: is a method that yields output data and a boolean flag indicating whether that's the last batch to be generated.
!!! NOTE
The default signature for the `process` method is `process(self, offset: int = 0) -> GeneratorStepOutput`. The argument `offset` should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one [`Step`][distilabel.steps.Step] at a time could be connected to the current one.
!!! WARNING
For the custom [`Step`][distilabel.steps.Step] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.
=== "Inherit from `GeneratorStep`"
We can inherit from the `GeneratorStep` class and define the `outputs`, and `process` methods as follows:
```python
from typing import List, TYPE_CHECKING
from typing_extensions import override
from distilabel.steps import GeneratorStep
if TYPE_CHECKING:
from distilabel.typing import StepColumns, GeneratorStepOutput
class MyGeneratorStep(GeneratorStep):
instructions: List[str]
@override
def process(self, offset: int = 0) -> "GeneratorStepOutput":
...
@property
def outputs(self) -> "StepColumns":
...
```
=== "Using the `@step` decorator"
The `@step` decorator will take care of the boilerplate code, and will allow to define the `outputs`, and `process` methods in a more straightforward way. One downside is that it won't let you access the `self` attributes if any, neither set those, so if you need to access or set any attribute, you should go with the first approach of defining the custom [`GeneratorStep`][distilabel.steps.GeneratorStep] subclass.
```python
from typing import TYPE_CHECKING
from distilabel.steps import step
if TYPE_CHECKING:
from distilabel.typing import GeneratorStepOutput
@step(outputs=[...], step_type="generator")
def CustomGeneratorStep(offset: int = 0) -> "GeneratorStepOutput":
The [`GlobalStep`][distilabel.steps.GlobalStep] is a subclass of [`Step`][distilabel.steps.Step] that is used to define a step that requires the previous steps to be completed to run, since it will wait until all the input batches are received before running. This step is useful when you need to run a step that requires all the input data to be processed before running. Alternatively, it can also be used as a standalone.
## Defining custom GlobalSteps
We can define a custom step by creating a new subclass of the [`GlobalStep`][distilabel.steps.GlobalStep] and defining the following:
-`inputs`: is a property that returns a list of strings with the names of the required input fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.
-`outputs`: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.
-`process`: is a method that receives the input data and returns the output data, and it should be a generator, meaning that it should `yield` the output data.
!!! NOTE
The default signature for the `process` method is `process(self, *inputs: StepInput) -> StepOutput`. The argument `inputs` should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one [`Step`][distilabel.steps.Step] at a time could be connected to the current one.
!!! WARNING
For the custom [`GlobalStep`][distilabel.steps.GlobalStep] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.
=== "Inherit from `GlobalStep`"
We can inherit from the `GlobalStep` class and define the `inputs`, `outputs`, and `process` methods as follows:
```python
from typing import TYPE_CHECKING
from distilabel.steps import GlobalStep, StepInput
if TYPE_CHECKING:
from distilabel.typing import StepColumns, StepOutput
The `@step` decorator will take care of the boilerplate code, and will allow to define the `inputs`, `outputs`, and `process` methods in a more straightforward way. One downside is that it won't let you access the `self` attributes if any, neither set those, so if you need to access or set any attribute, you should go with the first approach of defining the custom [`GlobalStep`][distilabel.steps.GlobalStep] subclass.
The [`Step`][distilabel.steps.Step] is intended to be used within the scope of a [`Pipeline`][distilabel.pipeline.Pipeline], which will orchestrate the different steps defined but can also be used standalone.
Assuming that we have a [`Step`][distilabel.steps.Step] already defined as it follows:
The `Step.load()` always needs to be executed when being used as a standalone. Within a pipeline, this will be done automatically during pipeline execution.
### Arguments
-`input_mappings`, is a dictionary that maps keys from the input dictionaries to the keys expected by the step. For example, if `input_mappings={"instruction": "prompt"}`, means that the input key `prompt` will be used as the key `instruction` for current step.
-`output_mappings`, is a dictionary that can be used to map the outputs of the step to other names. For example, if `output_mappings={"conversation": "prompt"}`, means that output key `conversation` will be renamed to `prompt` for the next step.
-`input_batch_size` (by default set to 50), is independent for every step and will determine how many input dictionaries will process at once.
### Runtime parameters
`Step`s can also have `RuntimeParameter`, which are parameters that can only be used after the pipeline initialisation when calling the `Pipeline.run`.
description="The number of rows that will contain the batches processed by the"
" step.",
)
```
## Types of Steps
There are two special types of [`Step`][distilabel.steps.Step] in `distilabel`:
*[`GeneratorStep`][distilabel.steps.GeneratorStep]: is a step that only generates data, and it doesn't need any input data from previous steps and normally is the first node in a [`Pipeline`][distilabel.pipeline.Pipeline]. More information: [Components -> Step - GeneratorStep](./generator_step.md).
*[`GlobalStep`][distilabel.steps.GlobalStep]: is a step with the standard interface i.e. receives inputs and generates outputs, but it processes all the data at once, and often is the final step in the [`Pipeline`][distilabel.pipeline.Pipeline]. The fact that a [`GlobalStep`][distilabel.steps.GlobalStep] requires the previous steps to finish before being able to start. More information: [Components - Step - GlobalStep](global_step.md).
*[`Task`][distilabel.steps.tasks.Task], is essentially the same as a default [`Step`][distilabel.steps.Step], but it relies on an [`LLM`][distilabel.models.llms.LLM] as an attribute, and the `process` method will be in charge of calling that LLM. More information: [Components - Task](../task/index.md).
## Defining custom Steps
We can define a custom step by creating a new subclass of the [`Step`][distilabel.steps.Step] and defining the following:
-`inputs`: is a property that returns a list of strings with the names of the required input fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.
-`outputs`: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.
-`process`: is a method that receives the input data and returns the output data, and it should be a generator, meaning that it should `yield` the output data.
!!! NOTE
The default signature for the `process` method is `process(self, *inputs: StepInput) -> StepOutput`. The argument `inputs` should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one [`Step`][distilabel.steps.Step] at a time could be connected to the current one.
!!! WARNING
For the custom [`Step`][distilabel.steps.Step] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.
=== "Inherit from `Step`"
We can inherit from the `Step` class and define the `inputs`, `outputs`, and `process` methods as follows:
```python
from typing import TYPE_CHECKING
from distilabel.steps import Step, StepInput
if TYPE_CHECKING:
from distilabel.typing import StepColumns, StepOutput
The `@step` decorator will take care of the boilerplate code, and will allow to define the `inputs`, `outputs`, and `process` methods in a more straightforward way. One downside is that it won't let you access the `self` attributes if any, neither set those, so if you need to access or set any attribute, you should go with the first approach of defining the custom [`Step`][distilabel.steps.Step] subclass.
The [`GeneratorTask`][distilabel.steps.tasks.GeneratorTask] is a custom implementation of a [`Task`][distilabel.steps.tasks.Task] based on the [`GeneratorStep`][distilabel.steps.GeneratorStep]. As with a [`Task`][distilabel.steps.tasks.Task], it is normally used within a [`Pipeline`][distilabel.pipeline.Pipeline] but can also be used standalone.
!!! WARNING
This task is still experimental and may be subject to changes in the future.
# [{'output_field": "Why did the scarecrow win an award? Because he was outstanding!", "model_name": "gpt-4"}]
```
!!! NOTE
Most of the times you would need to override the default `process` method, as it's suited for the standard [`Task`][distilabel.steps.tasks.Task] and not for the [`GeneratorTask`][distilabel.steps.tasks.GeneratorTask]. But within the context of the `process` function you can freely use the `llm` to generate data in any way.
!!! NOTE
The `Step.load()` always needs to be executed when being used as a standalone. Within a pipeline, this will be done automatically during pipeline execution.
## Defining custom GeneratorTasks
We can define a custom generator task by creating a new subclass of the [`GeneratorTask`][distilabel.steps.tasks.Task] and defining the following:
-`process`: is a method that generates the data based on the [`LLM`][distilabel.models.llms.LLM] and the `instruction` provided within the class instance, and returns a dictionary with the output data formatted as needed i.e. with the values for the columns in `outputs`. Note that the `inputs` argument is not allowed in this function since this is a [`GeneratorTask`][distilabel.steps.tasks.GeneratorTask]. The signature only expects the `offset` argument, which is used to keep track of the current iteration in the generator.
-`outputs`: is a property that returns a list of strings with the names of the output fields, this property should always include `model_name` as one of the outputs since that's automatically injected from the LLM.
-`format_output`: is a method that receives the output from the [`LLM`][distilabel.models.llms.LLM] and optionally also the input data (which may be useful to build the output in some scenarios), and returns a dictionary with the output data formatted as needed i.e. with the values for the columns in `outputs`. Note that there's no need to include the `model_name` in the output.
The [`ImageTask`][distilabel.steps.tasks.ImageTask] is a custom implementation of a [`Task`][distilabel.steps.tasks.Task] special to deal images. These tasks behave exactly as any other [`Task`][distilabel.steps.tasks.Task], but instead of relying on an [`LLM`][distilabel.models.llms.LLM], they work with a [`ImageGenerationModel`][distilabel.models.image_generation.ImageGenerationModel].
!!! info "New in version 1.5.0"
This task is new and is expected to work with Image Generation Models.
These tasks take as attribute an `image_generation_model` instead of `llm` as we would have with the standard `Task`, but everything else remains the same. Let's see an example with [`ImageGeneration`](https://distilabel.argilla.io/dev/components-gallery/tasks/imagegeneration/):
If you are testing the `ImageGeneration` task in a notebook, you can do the following
to see the rendered image:
```python
from distilabel.models.image_generation.utils import image_from_str
result = next(task.process([{"prompt": "a white siamese cat"}]))
image_from_str(result[0]["image"]) # Returns a `PIL.Image.Image` that renders directly
```
!!! tip "Running ImageGeneration in a Pipeline"
This transformation between image as string and as PIL object can be done for the whole dataset if running a pipeline, by calling the method `transform_columns_to_image` on the final distiset and passing the name (or list of names) of the column image.
## Defining custom ImageTasks
We can define a custom generator task by creating a new subclass of the [`ImageTask`][distilabel.steps.tasks.ImageTask] and defining the following:
-`process`: is a method that generates the data based on the [`ImageGenerationModel`][distilabel.models.image_generation.ImageGenerationModel] and the `prompt` provided within the class instance, and returns a dictionary with the output data formatted as needed i.e. with the values for the columns in `outputs`.
-`inputs`: is a property that returns a list of strings with the names of the required input fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.
-`outputs`: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not. This property should always include `model_name` as one of the outputs since that's automatically injected from the LLM.
-`format_input`: is a method that receives a dictionary with the input data and returns a *prompt* to be passed to the model.
-`format_output`: is a method that receives the output from the [`ImageGenerationModel`][distilabel.models.image_generation.ImageGenerationModel] and optionally also the input data (which may be useful to build the output in some scenarios), and returns a dictionary with the output data formatted as needed i.e. with the values for the columns in `outputs`.
Note the fact that in the `process` method we are not dealing with the `image_generation` attribute but with the `llm`. This is not a bug, but intended, as internally we rename the `image_generation` to `llm` to reuse the code.
The [`Task`][distilabel.steps.tasks.Task] is a special kind of [`Step`][distilabel.steps.Step] that includes the [`LLM`][distilabel.models.llms.LLM] as a mandatory argument. As with a [`Step`][distilabel.steps.Step], it is normally used within a [`Pipeline`][distilabel.pipeline.Pipeline] but can also be used standalone.
For example, the most basic task is the [`TextGeneration`][distilabel.steps.tasks.TextGeneration] task, which generates text based on a given instruction.
1. The `LLMs` will not only return the text but also a `statistics_{STEP_NAME}` field that will contain statistics related to the generation. If available, at least the input and output tokens will be returned.
!!! Note
The `Step.load()` always needs to be executed when being used as a standalone. Within a pipeline, this will be done automatically during pipeline execution.
As shown above, the [`TextGeneration`][distilabel.steps.tasks.TextGeneration] task adds a `generation` based on the `instruction`.
!!! Tip "New in version 1.2.0"
Since version `1.2.0`, we provide some metadata about the LLM call through `distilabel_metadata`. This can be disabled by setting the `add_raw_output` attribute to `False` when creating the task.
Additionally, since version `1.4.0`, the formatted input can also be included, which can be helpful when testing
custom templates (testing the pipeline using the [`dry_run`][distilabel.pipeline.local.Pipeline.dry_run] method).
Since version `1.5.0``distilabel_metadata` includes a new `statistics` field out of the box. The generation from the LLM will not only contain the text, but also statistics associated with the text if available, like the input and output tokens. This field will be generated with `statistic_{STEP_NAME}` to avoid collisions between different steps in the pipeline, similar to how `raw_output_{STEP_NAME}` works.
### Task.print
!!! Info "New in version 1.4.0"
New since version `1.4.0`, [`Task.print`][distilabel.steps.tasks.base._Task.print]`Task.print` method.
The `Tasks` include a handy method to show what the prompt formatted for an `LLM` would look like, let's see an example with [`UltraFeedback`][distilabel.steps.tasks.ultrafeedback.UltraFeedback], but it applies to any other `Task`.
The result will be a rendered prompt, with the System prompt (if contained for the task) and the User prompt, rendered with rich (it will show exactly the same in a jupyter notebook).
In case you want to test with a custom input, you can pass an example to the tasks` `format_input` method (or generate it on your own depending on the task), and pass it to the print method so that it shows your example:
In case you don't want to load an LLM to render the template, you can create a dummy one like the ones we could use for testing.
```python
from distilabel.models import LLM
from distilabel.models.mixins import MagpieChatTemplateMixin
class DummyLLM(AsyncLLM, MagpieChatTemplateMixin):
structured_output: Any = None
magpie_pre_query_template: str = "llama3"
def load(self) -> None:
pass
@property
def model_name(self) -> str:
return "test"
def generate(
self, input: "FormattedInput", num_generations: int = 1
) -> "GenerateOutput":
return ["output" for _ in range(num_generations)]
```
You can use this `LLM` just as any of the other ones to `load` your task and call `print`:
```python
uf = UltraFeedback(llm=DummyLLM())
uf.load()
uf.print()
```
!!! Note
When creating a custom task, the `print` method will be available by default, but it is limited to the most common scenarios for the inputs. If you test your new task and find it's not working as expected (for example, if your task contains one input consisting of a list of texts instead of a single one), you should override the `_sample_input` method. You can inspect the [`UltraFeedback`][distilabel.steps.tasks.ultrafeedback.UltraFeedback] source code for this.
## Specifying the number of generations and grouping generations
All the `Task`s have a `num_generations` attribute that allows defining the number of generations that we want to have per input. We can update the example above to generate 3 completions per input:
```python
from distilabel.models import InferenceEndpointsLLM
In addition, we might want to group the generations in a single output row as maybe one downstream step expects a single row with multiple generations. We can achieve this by setting the `group_generations` attribute to `True`:
```python
from distilabel.models import InferenceEndpointsLLM
We can define a custom step by creating a new subclass of the [`Task`][distilabel.steps.tasks.Task] and defining the following:
- `inputs`: is a property that returns a list of strings with the names of the required input fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.
- `format_input`: is a method that receives a dictionary with the input data and returns a [`ChatType`][distilabel.typing.models.ChatType] following [the chat-completion OpenAI message formatting](https://platform.openai.com/docs/guides/text-generation).
- `outputs`: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not. This property should always include `model_name` as one of the outputs since that's automatically injected from the LLM.
- `format_output`: is a method that receives the output from the [`LLM`][distilabel.models.llms.LLM] and optionally also the input data (which may be useful to build the output in some scenarios), and returns a dictionary with the output data formatted as needed i.e. with the values for the columns in `outputs`. Note that there's no need to include the `model_name` in the output.
=== "Inherit from `Task`"
When using the `Task` class inheritance method for creating a custom task, we can also optionally override the `Task.process` method to define a more complex processing logic involving an `LLM`, as the default one just calls the `LLM.generate` method once previously formatting the input and subsequently formatting the output. For example, [EvolInstruct][distilabel.steps.tasks.EvolInstruct] task overrides this method to call the `LLM.generate` multiple times (one for each evolution).
```python
from typing import Any, Dict, List, Union, TYPE_CHECKING
from distilabel.steps.tasks import Task
if TYPE_CHECKING:
from distilabel.typing import StepColumns, ChatType
If your task just needs a system prompt, a user message template and a way to format the output given by the `LLM`, then you can use the `@task` decorator to avoid writing too much boilerplate code.
Most `Tasks` reuse the `Task.process` method to process the generations, but if a new `Task` defines a custom `process` method, like happens for example with [`Magpie`][distilabel.steps.tasks.magpie.base.Magpie], one hast to deal with the `statistics` returned by the `LLM`.
Welcome to the how-to guides section! Here you will find a collection of guides that will help you get started with Distilabel. We have divided the guides into two categories: basic and advanced. The basic guides will help you get started with the core concepts of Distilabel, while the advanced guides will help you explore more advanced features.
## Basic
<divclass="grid cards"markdown>
- __Define Steps for your Pipeline__
---
Steps are the building blocks of your pipeline. They can be used to generate data, evaluate models, manipulate data, or any other general task.
Benchmark LLMs with `distilabel`: reproducing the Arena Hard benchmark.
The script below first defines both the `ArenaHard` and the `ArenaHardResults` tasks, so as to generate responses for a given collection of prompts/questions with up to two LLMs, and then calculate the results as per the original implementation, respectively. Additionally, the second part of the example builds a `Pipeline` to run the generation on top of the prompts with `InferenceEndpointsLLM` while streaming the rest of the generations from a pre-computed set of GPT-4 generations, and then evaluate one against the other with `OpenAILLM` generating an alternate response, a comparison between the responses, and a result as A>>B, A>B, B>A, B>>A, or tie.
# Create exam questions using structured generation
This example will showcase how to generate exams questions and answers from a text page. In this case, we will use a wikipedia page as an example, and show how to leverage the prompt to help the model generate the data in the appropriate format.
We are going to use a `meta-llama/Meta-Llama-3.1-8B-Instruct` to generate questions and answers for a mock exam from a wikipedia page. In this case, we are going to use the *Transfer Learning* entry for it. With the help of structured generation we will guide the model to create structured data for us that is easy to parse. The structure will be question, answer, and distractors (wrong answers).
??? "Click to see the sample results"
Example page [Transfer_learning](https://en.wikipedia.org/wiki/Transfer_learning):
"answer": "A technique in machine learning where knowledge learned from a task is re-used to boost performance on a related task.",
"distractors": ["A type of neural network architecture", "A machine learning algorithm for image classification", "A method for data preprocessing"],
"question": "What is transfer learning?"
},
{
"answer": "1976",
"distractors": ["1981", "1992", "1998"],
"question": "In which year did Bozinovski and Fulgosi publish a paper addressing transfer learning in neural network training?"
},
{
"answer": "Discriminability-based transfer (DBT) algorithm",
"distractors": ["Multi-task learning", "Learning to Learn", "Cost-sensitive machine learning"],
"question": "What algorithm was formulated by Lorien Pratt in 1992?"
},
{
"answer": "A domain consists of a feature space and a marginal probability distribution.",
"distractors": ["A domain consists of a label space and an objective predictive function.", "A domain consists of a task and a learning algorithm.", "A domain consists of a dataset and a model."],
"question": "What is the definition of a domain in the context of transfer learning?"
},
{
"answer": "Transfer learning aims to help improve the learning of the target predictive function in the target domain using the knowledge in the source domain and learning task.",
"distractors": ["Transfer learning aims to learn a new task from scratch.", "Transfer learning aims to improve the learning of the source predictive function in the source domain.", "Transfer learning aims to improve the learning of the target predictive function in the source domain."],
"question": "What is the goal of transfer learning?"
},
{
"answer": "Markov logic networks, Bayesian networks, cancer subtype discovery, building utilization, general game playing, text classification, digit recognition, medical imaging, and spam filtering.",
"distractors": ["Supervised learning, unsupervised learning, reinforcement learning, natural language processing, computer vision, and robotics.", "Image classification, object detection, segmentation, and tracking.", "Speech recognition, sentiment analysis, and topic modeling."],
"question": "What are some applications of transfer learning?"
question:str=Field(...,description="The question to be answered")
answer:str=Field(...,description="The correct answer to the question")
distractors:List[str]=Field(
...,description="A list of incorrect but viable answers to the question"
)
classExamQuestions(BaseModel):# (2)
exam:List[ExamQuestion]
SYSTEM_PROMPT="""\
You are an exam writer specialized in writing exams for students.
Your goal is to create questions and answers based on the document provided, and a list of distractors, that are incorrect but viable answers to the question.
1. Download a single page for the demo. We could donwnload first the pages, or apply the same procedure to any type of data we want. In a real world use case, we would want to make a dataset from these documents first.
2. Define the structure required for the answer using Pydantic. In this case we want for each page, a list with questions and answers (additionally we've added distractors, but can be ignored for this case). So our output will be a `ExamQuestions` model, which is a list of `ExamQuestion`, where each one consists in the `question` and `answer` fields as string fields. The language model will use the field descriptions to generate the values.
3. Use the system prompt to guide the model towards the behaviour we want from it. Independently from the structured output we are forcing the model to have, it helps if we pass the format expected in our prompt.
4. Move the page content from wikipedia to a row in the dataset.
5. The [`TextGeneration`](https://distilabel.argilla.io/dev/components-gallery/tasks/textgeneration/) task gets the system prompt, and the user prompt by means of the `template` argument, where we aid the model to generate the questions and answers based on the page content, that will be obtained from the corresponding column of the loaded data.
6. Connect both steps, and we are done.
## Run the example
To run this example you will first need to install the wikipedia dependency to download the sample data, being `pip install wikipedia`. *Change the username first in case you want to push the dataset to the hub using your account*.
In this example, we'll explore the creation of specialized user personas for social network interactions using the [FinePersonas-v0.1](https://huggingface.co/datasets/argilla/FinePersonas-v0.1) dataset from Hugging Face. The final dataset will be ready to fine-tune a chat model with specific traits and characteristics.
## Introduction
We'll delve into the process of fine-tuning different LoRA (Low-Rank Adaptation) models to imbue these personas with specific traits and characteristics.
This approach draws inspiration from Michael Sayman's work on [SocialAI](https://apps.apple.com/us/app/socialai-ai-social-network/id6670229993)(visit the [profile](https://x.com/michaelsayman) to see some examples), to leverage [FinePersonas-v0.1](https://huggingface.co/datasets/argilla/FinePersonas-v0.1) for building models that can emulate bots with specific behaviour.
By fine-tuning these adapters, we can potentially create AI personas with distinct characteristics, communication styles, and areas of expertise. The result? AI interactions that feel more natural and tailored to specific contexts or user needs. For those interested in the technical aspects of this approach, we recommend the insightful blog post on [Multi-LoRA serving](https://huggingface.co/blog/multi-lora-serving). It provides a clear and comprehensive explanation of the technology behind this innovative method.
Let's jump to the demo.
## Creating our SocialAI Task
Building on the new [`TextGeneration`](https://distilabel.argilla.io/dev/components-gallery/tasks/textgeneration/), creating custom tasks is easier than ever before. This powerful tool opens up a world of possibilities for creating tailored text-based content with ease and precision. We will create a `SocialAI` task that will be in charge of generating responses to user interactions, taking into account a given `follower_type`, and use the perspective from a given `persona`:
"You are an AI assistant expert at simulating user interactions. "
"You must answer as if you were a '{follower_type}', be concise answer with no more than 200 characters, nothing else."
"Here are some traits to use for your personality:\n\n"
"{traits}"
)# (1)
template:str="You are the folowing persona:\n\n{{ persona }}\n\nWhat would you say to the following?\n\n {{ post }}"# (2)
columns:str|list[str]=["persona","post"]# (3)
_follower_traits:dict[str,str]={
"supporter":(
"- Encouraging and positive\n"
"- Tends to prioritize enjoyment and relaxation\n"
"- Focuses on the present moment and short-term pleasure\n"
"- Often uses humor and playful language\n"
"- Wants to help others feel good and have fun\n"
),
"troll":(
"- Provocative and confrontational\n"
"- Enjoys stirring up controversy and conflict\n"
"- Often uses sarcasm, irony, and mocking language\n"
"- Tends to belittle or dismiss others' opinions and feelings\n"
"- Seeks to get a rise out of others and create drama\n"
),
"alarmist":(
"- Anxious and warning-oriented\n"
"- Focuses on potential risks and negative consequences\n"
"- Often uses dramatic or sensational language\n"
"- Tends to be serious and stern in tone\n"
"- Seeks to alert others to potential dangers and protect them from harm (even if it's excessive or unwarranted)\n"
),
}
defload(self)->None:
super().load()
self.system_prompt=self.system_prompt.format(
follower_type=self.follower_type,
traits=self._follower_traits[self.follower_type]
)# (4)
```
1. We have a custom system prompt that will depend on the `follower_type` we decide for our model.
2. The base template or prompt will answert to the `post` we have, from the point of view of a `persona`.
3. We will need our dataset to have both `persona` and `post` columns to populate the prompt.
4. In the load method we place the specific traits for our follower type in the system prompt.
## Data preparation
This is an example, so let's keep it short. We will use 3 posts, and 3 different types of personas. While there's potential to enhance this process (perhaps by implementing random persona selection or leveraging semantic similarity) we'll opt for a straightforward method in this demonstration.
Our goal is to create a set of nine examples, each pairing a post with a persona. To achieve this, we'll employ an LLM to respond to each post from the perspective of a specific `persona`, effectively simulating how different characters might engage with the content.
```python
posts=[
{
"post":"Hmm, ok now I'm torn: should I go for healthy chicken tacos or unhealthy beef tacos for late night cravings?"
},
{
"post":"I need to develop a training course for my company on communication skills. Need to decide how deliver it remotely."
},
{
"post":"I'm always 10 minutes late to meetups but no one's complained. Could this be annoying to them?"
"post":"Hmm, ok now I'm torn: should I go for healthy chicken tacos or unhealthy beef tacos for late night cravings?",
"persona":"A high school or college environmental science teacher or an ecology student specializing in biogeography and ecosystem dynamics."
}
```
This will be our dataset, that we can ingest using the [`LoadDataFromDicts`](https://distilabel.argilla.io/dev/components-gallery/steps/loaddatafromdicts/):
```python
loader=LoadDataFromDicts(data=data)
```
## Simulating from different types of followers
With our data in hand, we're ready to explore the capabilities of our SocialAI task. For this demonstration, we'll make use of of `meta-llama/Meta-Llama-3.1-70B-Instruct`
While this model has become something of a go-to choice recently, it's worth noting that experimenting with a variety of models could yield even more interesting results:
This setup simplifies the process, we only need to input the follower type, and the system handles the rest. We could update this too to have a random type of follower by default, and simulate from a bunch of different personalities.
## Building our Pipeline
The foundation of our pipeline is now in place. At its core is a single, powerful LLM. This versatile model will be repurposed to drive three distinct `SocialAI` Tasks, each tailored to a specific `TextGeneration` task, and each one of them will be prepared for Supervised Fine Tuning using [`FormatTextGenerationSFT`](https://distilabel.argilla.io/dev/components-gallery/steps/formattextgenerationsft/):
```python
withPipeline(name="Social AI Personas")aspipeline:
1. We update the name of the step to keep track in the pipeline.
2. The `generation` column from each LLM will be mapped to avoid them being overriden, as we are reusing the same task.
3. As we have modified the output column from `SocialAI`, we redirect each one of the "follower_type" responses.
4. Connect the loader to each one of the follower tasks and `format_sft` to obtain 3 different subsets.
The outcome of this pipeline will be three specialized models, each fine-tuned to a unique `follower type` crafted by the `SocialAI` task. These models will generate SFT-formatted datasets, where each post is paired with its corresponding interaction data for a specific follower type. This setup enables seamless fine-tuning using your preferred framework, such as [TRL](https://huggingface.co/docs/trl/index), or any other training framework of your choice.
## Script and final dataset
All the pieces are in place for our script, the full pipeline can be seen here:
??? Run
```python
python examples/finepersonas_social_ai.py
```
```python title="finepersonas_social_ai.py"
--8<-- "examples/finepersonas_social_ai.py"
```
This is the final toy dataset we obtain: [FinePersonas-SocialAI-test](https://huggingface.co/datasets/plaguss/FinePersonas-SocialAI-test)
You can see examples of how to load each subset of them to fine-tune a model:
And a sample of the generated field with the corresponding `post` and `persona`:
```json
{
"post":"Hmm, ok now I\u0027m torn: should I go for healthy chicken tacos or unhealthy beef tacos for late night cravings?",
"persona":"A high school or undergraduate physics or chemistry teacher, likely with a focus on experimental instruction.",
"interaction_troll":"\"Late night cravings? More like late night brain drain. Either way, it\u0027s just a collision of molecules in your stomach. Choose the one with more calories, at least that\u0027s some decent kinetic energy.\"",
}
```
There's a lot of room for improvement, but quite a promising start.
This example shows how distilabel can be used to generate image data, either using [`InferenceEndpointsImageGeneration`](https://distilabel.argilla.io/dev/components-gallery/image_generation/inferenceendpointsimagegeneration/) or [`OpenAIImageGeneration`](https://distilabel.argilla.io/dev/components-gallery/image_generation/openaiimagegeneration/), thanks to the [`ImageGeneration`](https://distilabel.argilla.io/dev/components-gallery/task/imagegeneration/) task.
!!! success "Save the Distiset as an Image Dataset"
Note the call to `Distiset.transform_columns_to_image`, to have the images uploaded directly as an [`Image dataset`](https://huggingface.co/docs/hub/en/datasets-image):
Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `distilabel`.
This script makes use of [`LlamaCppLLM`][distilabel.models.llms.llamacpp.LlamaCppLLM] and the structured output capabilities thanks to [`outlines`](https://outlines-dev.github.io/outlines/welcome/) to generate RPG characters that adhere to a JSON schema.
It makes use of a local model which can be downloaded using curl (explained in the script itself), and can be exchanged with other `LLMs` like [`vLLM`][distilabel.models.llms.vllm.vLLM].
Answer instructions with knowledge graphs defined as `pydantic.BaseModel` objects using `instructor` in `distilabel`.
This script makes use of [`MistralLLM`][distilabel.models.llms.mistral.MistralLLM] and the structured output capabilities thanks to [`instructor`](https://python.useinstructor.com/) to generate knowledge graphs from complex topics.
Image-text-to-text models take in an image and text prompt and output text. In this example we will use an LLM [`InferenceEndpointsLLM`](https://distilabel.argilla.io/dev/components-gallery/llms/inferenceendpointsllm/) with [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) to ask a question about an image, and [`OpenAILLM`](https://distilabel.argilla.io/dev/components-gallery/llms/openaillm/) with `gpt-4o-mini`. We will ask a simple question to showcase how the [`TextGenerationWithImage`](https://distilabel.argilla.io/dev/components-gallery/tasks/textgenerationwithimage/) task can be used in a pipeline.
1. The *image_type* can be a url pointing to the image, the base64 string representation, or a PIL image, take a look at the [`TextGenerationWithImage`](https://distilabel.argilla.io/dev/components-gallery/tasks/textgenerationwithimage/) for more information.
> This image depicts a wooden boardwalk weaving its way through a lush meadow, flanked by vibrant green grass that stretches towards the horizon under a calm and inviting sky. The boardwalk runs straight ahead, away from the viewer, forming a clear pathway through the tall, lush green grass, crops or other plant types or an assortment of small trees and shrubs. This meadow is dotted with trees and shrubs, appearing to be healthy and green. The sky above is a beautiful blue with white clouds scattered throughout, adding a sense of tranquility to the scene. While this image appears to be of a natural landscape, because grass is...
=== "OpenAI - gpt-4o-mini"
```python
from distilabel.models.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks.text_generation_with_image import TextGenerationWithImage
from distilabel.steps import LoadDataFromDicts
with Pipeline(name="vision_generation_pipeline") as pipeline:
1. The *image_type* can be a url pointing to the image, the base64 string representation, or a PIL image, take a look at the [`VisionGeneration`](https://distilabel.argilla.io/dev/components-gallery/tasks/visiongeneration/) for more information.
> The image depicts a scenic landscape featuring a wooden walkway or path that runs through a lush green marsh or field. The area is surrounded by tall grass and various shrubs, with trees likely visible in the background. The sky is blue with some wispy clouds, suggesting a beautiful day. Overall, it presents a peaceful natural setting, ideal for a stroll or nature observation.
The full pipeline can be run at the following example:
??? Note "Run the full pipeline"
```python
python examples/text_generation_with_image.py
```
```python title="text_generation_with_image.py"
--8<-- "examples/text_generation_with_image.py"
```
A sample dataset can be seen at [plaguss/test-vision-generation-Llama-3.2-11B-Vision-Instruct](https://huggingface.co/datasets/plaguss/test-vision-generation-Llama-3.2-11B-Vision-Instruct).
Learn about Math-Shepherd, a framework to generate datasets to train process reward models (PRMs) which assign reward scores to each step of math problem solutions.
This example will introduce [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518), a data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications.
## Replication
The following figure showcases the APIGen framework:
Now, let's walk through the key steps illustrated in the figure:
-[`DataSampler`](https://distilabel.argilla.io/dev/components-gallery/step/datasampler/): With the help of this step and the original [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) we are getting the Seed QA Data Sampler for the prompt template.
-[`APIGenGenerator`](https://distilabel.argilla.io/dev/components-gallery/task/apigengenerator/): This step does the job of the *Query-Answer Generator*, including the format checker from *Stage 1: Format Checker* thanks to the structured output generation.
-[`APIGenExecutionChecker`](https://distilabel.argilla.io/dev/components-gallery/task/apigenexecutionchecker/): This step is in charge of the *Stage 2: Execution Checker*.
-[`APIGenSemanticChecker`](https://distilabel.argilla.io/dev/components-gallery/task/apigensemanticchecker/): Step in charge of running *Stage 3: Semantic Checker*, can use the same or a different LLM, we are using the same as in [`APIGenGenerator`](https://distilabel.argilla.io/dev/components-gallery/task/apigengenerator/) step.
The current implementation hasn't utilized the *Diverse Prompt Library*. To incorporate it, one could either adjust the prompt template within the [`APIGenGenerator`](https://distilabel.argilla.io/dev/components-gallery/task/apigengenerator/) or develop a new sampler specifically for this purpose. As for the *API Sampler*, while no specific data is shared here, we've created illustrative examples to demonstrate the pipeline's functionality. These examples represent a mix of data that could be used to replicate the sampler's output.
## Data preparation
The original paper tells about the data they used and give some hints, but nothing was shared. In this example, we will write a bunch of examples by hand to showcase how this pipeline can be built.
Assume we have the following function names, and corresponding descriptions of their behaviour:
```python
data=[
{
"func_name":"final_velocity",
"func_desc":"Calculates the final velocity of an object given its initial velocity, acceleration, and time.",
},
{
"func_name":"permutation_count",
"func_desc":"Calculates the number of permutations of k elements from a set of n elements.",
},
{
"func_name":"getdivision",
"func_desc":"Divides two numbers by making an API call to a division service.",
},
{
"func_name":"binary_addition",
"func_desc":"Adds two binary numbers and returns the result as a binary string.",
},
{
"func_name":"swapi_planet_resource",
"func_desc":"get a specific planets resource",
},
{
"func_name":"disney_character",
"func_desc":"Find a specific character using this endpoint",
}
]
```
The original paper refers to both python functions and APIs, but we will make use of python functions exclusively for simplicity. In order to execute and check this functions/APIs, we need access to the code, which we have moved to a Python file: [lib_apigen.py](https://github.com/argilla-io/distilabel/blob/main/examples/lib_apigen.py). All this functions are executable, but we also need access to their *tool* representation. For this, we will make use of transformers' *get_json_schema* function[^1].
[^1]:Read this nice blog post for more information on tools and the reasoning behind `get_json_schema`: [Tool Use, Unified](https://huggingface.co/blog/unified-tool-use).
We have all the machinery prepared in our libpath, except from the *tool* definition. With the help of our helper function `load_module_from_path` we will load this python module, collect all the tools, and add them to each row in our `data` variable.
We have just loaded a subset and transformed it to a list of dictionaries, as we will use it in the [`DataSampler`](https://distilabel.argilla.io/dev/components-gallery/steps/datasampler/)`GeneratorStep`, grabbing random examples from the original dataset.
## Building the Pipeline
Now that we've walked through each component, it's time to see how it all comes together, here's the Pipeline code:
```python
withPipeline(name="apigen-example")aspipeline:
loader_seeds=LoadDataFromDicts(data=data)# (1)
sampler=DataSampler(# (2)
data=ds_og,
size=2,
samples=len(data),
batch_size=8,
)
prep_examples=PrepareExamples()# This step will add the 'examples' column
1. Load the data seeds we are going to use to generate our function calling dataset.
2. The `DataSampler` together with `PrepareExamples` will be used to help us create the few-shot
examples from the original dataset to be fed in our prompt.
3. Combine both columns to obtain a single stream of data
4. Will reuse the same LLM for the generation and the semantic checks.
5. Creates the `query` and `answers` that will be used together with the `tools` to fine-tune a new model. Will generate the structured outputs to ensure we have valid JSON formatted answers.
6. Adds columns `keep_row_after_execution_check` and `execution_result`.
7. Adds columns `keep_row_after_semantic_check` and `thought`.
## Script and final dataset
To see all the pieces in place, take a look at the full pipeline, as well as an example row that would be generated from this pipeline.
??? Run
```python
python examples/pipeline_apigen.py
```
```python title="pipeline_apigen.py"
--8<-- "examples/pipeline_apigen.py"
```
Example row:
```json
{
"func_name":"final_velocity",
"func_desc":"Calculates the final velocity of an object given its initial velocity, acceleration, and time.",
"tools":[
{
"function":{
"description":"Calculates the final velocity of an object given its initial velocity, acceleration, and time.",
"name":"final_velocity",
"parameters":{
"properties":{
"acceleration":{
"description":"The acceleration of the object.",
"type":"number"
},
"initial_velocity":{
"description":"The initial velocity of the object.",
"type":"number"
},
"time":{
"description":"The time elapsed.",
"type":"number"
}
},
"required":[
"initial_velocity",
"acceleration",
"time"
],
"type":"object"
}
},
"type":"function"
}
],
"examples":"## Query:\nRetrieve the first 15 comments for post ID '12345' from the Tokapi mobile API.\n## Answers:\n[{\"name\": \"v1_post_post_id_comments\", \"arguments\": {\"post_id\": \"12345\", \"count\": 15}}]\n\n## Query:\nRetrieve the detailed recipe for the cake with ID 'cake101'.\n## Answers:\n[{\"name\": \"detailed_cake_recipe_by_id\", \"arguments\": {\"is_id\": \"cake101\"}}]\n\n## Query:\nWhat are the frequently asked questions and their answers for Coca-Cola Company? Also, what are the suggested tickers based on Coca-Cola Company?\n## Answers:\n[{\"name\": \"symbols_faq\", \"arguments\": {\"ticker_slug\": \"KO\"}}, {\"name\": \"symbols_suggested\", \"arguments\": {\"ticker_slug\": \"KO\"}}]",
"query":"What would be the final velocity of an object that starts at rest and accelerates at 9.8 m/s^2 for 10 seconds.",
"content":"You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.\n\nConstruct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.\n\nEnsure the query:\n- Is clear and concise\n- Demonstrates typical use cases\n- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words\n- Across a variety level of difficulties, ranging from beginner and advanced use cases\n- The corresponding result's parameter types and ranges match with the function's descriptions\n\nEnsure the answer:\n- Is a list of function calls in JSON format\n- The length of the answer list should be equal to the number of requests in the query\n- Can solve all the requests in the query effectively",
"role":"system"
},
{
"content":"Here are examples of queries and the corresponding answers for similar functions:\n## Query:\nRetrieve the first 15 comments for post ID '12345' from the Tokapi mobile API.\n## Answers:\n[{\"name\": \"v1_post_post_id_comments\", \"arguments\": {\"post_id\": \"12345\", \"count\": 15}}]\n\n## Query:\nRetrieve the detailed recipe for the cake with ID 'cake101'.\n## Answers:\n[{\"name\": \"detailed_cake_recipe_by_id\", \"arguments\": {\"is_id\": \"cake101\"}}]\n\n## Query:\nWhat are the frequently asked questions and their answers for Coca-Cola Company? Also, what are the suggested tickers based on Coca-Cola Company?\n## Answers:\n[{\"name\": \"symbols_faq\", \"arguments\": {\"ticker_slug\": \"KO\"}}, {\"name\": \"symbols_suggested\", \"arguments\": {\"ticker_slug\": \"KO\"}}]\n\nNote that the query could be interpreted as a combination of several independent requests.\n\nBased on these examples, generate 1 diverse query and answer pairs for the function `final_velocity`.\nThe detailed function description is the following:\nCalculates the final velocity of an object given its initial velocity, acceleration, and time.\n\nThese are the available tools to help you:\n[{'type': 'function', 'function': {'name': 'final_velocity', 'description': 'Calculates the final velocity of an object given its initial velocity, acceleration, and time.', 'parameters': {'type': 'object', 'properties': {'initial_velocity': {'type': 'number', 'description': 'The initial velocity of the object.'}, 'acceleration': {'type': 'number', 'description': 'The acceleration of the object.'}, 'time': {'type': 'number', 'description': 'The time elapsed.'}}, 'required': ['initial_velocity', 'acceleration', 'time']}}}]\n\nThe output MUST strictly adhere to the following JSON format, and NO other text MUST be included:\n```json\n[\n {\n\"query\": \"The generated query.\",\n\"answers\": [\n {\n\"name\": \"api_name\",\n\"arguments\": {\n\"arg_name\": \"value\"\n ... (more arguments as required)\n }\n },\n ... (more API calls as required)\n ]\n }\n]\n```\n\nNow please generate 1 diverse query and answer pairs following the above format.",
"role":"user"
}
],
"raw_input_a_p_i_gen_semantic_checker_0":[
{
"content":"As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\u2019s intentions.\n\nDo not pass if:\n1. The function call does not align with the query\u2019s objective, or the input arguments appear incorrect.\n2. The function call and arguments are not properly chosen from the available functions.\n3. The number of function calls does not correspond to the user\u2019s intentions.\n4. The execution results are irrelevant and do not match the function\u2019s purpose.\n5. The execution results contain errors or reflect that the function calls were not executed successfully.",
"role":"system"
},
{
"content":"Given Information:\n- All Available Functions:\nCalculates the final velocity of an object given its initial velocity, acceleration, and time.\n- User Query: What would be the final velocity of an object that starts at rest and accelerates at 9.8 m/s^2 for 10 seconds.\n- Generated Function Calls: [{\"arguments\": {\"acceleration\": \"9.8\", \"initial_velocity\": \"0\", \"time\": \"10\"}, \"name\": \"final_velocity\"}]\n- Execution Results: ['9.8']\n\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\n\nThe main decision factor is wheather the function calls accurately reflect the query's intentions and the function descriptions.\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\n\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\n```\n{\n\"thought\": \"Concisely describe your reasoning here\",\n\"passes\": \"yes\" or \"no\"\n}\n```\n",
"role":"user"
}
],
"raw_output_a_p_i_gen_generator_0":"{\"pairs\": [\n {\n\"answers\": [\n {\n\"arguments\": {\n\"acceleration\": \"9.8\",\n\"initial_velocity\": \"0\",\n\"time\": \"10\"\n },\n\"name\": \"final_velocity\"\n }\n ],\n\"query\": \"What would be the final velocity of an object that starts at rest and accelerates at 9.8 m/s^2 for 10 seconds.\"\n }\n]}",