Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
<divclass="grid cards"markdown>
- __Get started in 5 minutes!__
---
Install distilabel with `pip` and run your first `Pipeline` to generate and evaluate synthetic data.
Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.
<pstyle="font-size:20px">Improve your AI output quality through data quality</p>
Compute is expensive and output quality is important. We help you **focus on data quality**, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time **achieving and keeping high-quality standards for your synthetic data**.
<pstyle="font-size:20px">Take control of your data and models</p>
**Ownership of data for fine-tuning your own LLMs** is not easy but distilabel can help you to get started. We integrate **AI feedback from any LLM provider out there** using one unified API.
<pstyle="font-size:20px">Improve efficiency by quickly iterating on the right data and models</p>
Synthesize and judge data with **latest research papers** while ensuring **flexibility, scalability and fault tolerance**. So you can focus on improving your data and training your models.
## What do people build with distilabel?
The Argilla community uses distilabel to create amazing [datasets](https://huggingface.co/datasets?other=distilabel) and [models](https://huggingface.co/models?other=distilabel).
- The [1M OpenHermesPreference](https://huggingface.co/datasets/argilla/OpenHermesPreferences) is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to **synthesize data on an immense scale**.
- Our [distilabeled Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) and the [improved OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B), show how we **improve model performance by filtering out 50%** of the original dataset through **AI feedback**.
- The [haiku DPO data](https://github.com/davanstrien/haiku-dpo) outlines how anyone can create a **dataset for a specific task** and **the latest research papers** to improve the quality of the dataset.
description:This is a step-by-step guide to help you contribute to the distilabel project. We are excited to have you on board! 🚀
hide:
-footer
---
Thank you for investing your time in contributing to the project! Any contribution you make will be reflected in the most recent version of distilabel 🤩.
??? Question "New to contributing in general?"
If you're a new contributor, read the [README](https://github.com/argilla-io/distilabel/blob/develop/README.md) to get an overview of the project. In addition, here are some resources to help you get started with open-source contributions:
* **Discord**: You are welcome to join the [distilabel Discord community](http://hf.co/join/discord), where you can keep in touch with other users, contributors and the distilabel team. In the following [section](#first-contact-in-discord), you can find more information on how to get started in Discord.
* **Git**: This is a very useful tool to keep track of the changes in your files. Using the command-line interface (CLI), you can make your contributions easily. For that, you need to have it [installed and updated](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) on your computer.
* **GitHub**: It is a platform and cloud-based service that uses git and allows developers to collaborate on projects. To contribute to distilabel, you'll need to create an account. Check the [Contributor Workflow with Git and Github](#contributor-workflow-with-git-and-github) for more info.
* **Developer Documentation**: To collaborate, you'll need to set up an efficient environment. Check the [Installation](../getting_started/installation.md) guide to know how to do it.
## First Contact in Discord
Discord is a handy tool for more casual conversations and to answer day-to-day questions. As part of Hugging Face, we have set up some distilabel channels on the server. Click [here](http://hf.co/join/discord) to join the Hugging Face Discord community effortlessly.
When part of the Hugging Face Discord, you can select "Channels & roles" and select "Argilla" along with any of the other groups that are interesting to you. "Argilla" will cover anything about argilla and distilabel. You can join the following channels:
***#argilla-distilabel-general**: 💬 For general discussions.
***#argilla-distilabel-help**: 🙋♀️ Need assistance? We're always here to help. Select the appropriate label (argilla or distilabel) for your issue and post it.
So now there is only one thing left to do: introduce yourself and talk to the community. You'll always be welcome! 🤗👋
## Contributor Workflow with Git and GitHub
If you're working with distilabel and suddenly a new idea comes to your mind or you find an issue that can be improved, it's time to actively participate and contribute to the project!
### Report an issue
If you spot a problem, [search if an issue already exists](https://github.com/argilla-io/distilabel/issues?q=is%3Aissue), you can use the `Label` filter. If that is the case, participate in the conversation. If it does not exist, create an issue by clicking on `New Issue`. This will show various templates; choose the one that best suits your issue. Once you choose one, you will need to fill it in following the guidelines. Try to be as clear as possible. In addition, you can assign yourself to the issue and add or choose the right labels. Finally, click on `Submit new issue`.
### Work with a fork
#### Fork the distilabel repository
After having reported the issue, you can start working on it. For that, you will need to create a fork of the project. To do that, click on the `Fork` button. Now, fill in the information. Remember to uncheck the `Copy develop branch only` if you are going to work in or from another branch (for instance, to fix documentation, the `main` branch is used). Then, click on `Create fork`.
You will be redirected to your fork. You can see that you are in your fork because the name of the repository will be your `username/distilabel`, and it will indicate `forked from argilla-io/distilabel`.
#### Clone your forked repository
In order to make the required adjustments, clone the forked repository to your local machine. Choose the destination folder and run the following command:
For each issue you're addressing, it's advisable to create a new branch. GitHub offers a straightforward method to streamline this process.
> ⚠️ Never work directly on the `main` or `develop` branch. Always create a new branch for your changes.
Navigate to your issue, and on the right column, select `Create a branch`.

After the new window pops up, the branch will be named after the issue and include a prefix such as feature/, bug/, or docs/ to facilitate quick recognition of the issue type. In the `Repository destination`, pick your fork ( [your-github-username]/distilabel), and then select `Change branch source` to specify the source branch for creating the new one. Complete the process by clicking `Create branch`.
> 🤔 Remember that the `main` branch is only used to work with the documentation. For any other changes, use the `develop` branch.
Now, locally, change to the new branch you just created.
```sh
git fetch origin
git checkout [branch-name]
```
### Make changes and push them
Make the changes you want in your local repository, and test that everything works and you are following the guidelines.
Once you have finished, you can check the status of your repository and synchronize with the upstreaming repo with the following command:
```sh
# Check the status of your repository
git status
# Synchronize with the upstreaming repo
git checkout [branch-name]
git rebase [default-branch]
```
If everything is right, we need to commit and push the changes to your fork. For that, run the following commands:
```sh
# Add the changes to the staging area
git add filename
# Commit the changes by writing a proper message
git commit -m"commit-message"
# Push the changes to your fork
git push origin [branch-name]
```
When pushing, you will be asked to enter your GitHub login credentials. Once the push is complete, all local commits will be on your GitHub repository.
### Create a pull request
Come back to GitHub, navigate to the original repository where you created your fork, and click on `Compare & pull request`.
First, click on `compare across forks` and select the right repositories and branches.
> In the base repository, keep in mind that you should select either `main` or `develop` based on the modifications made. In the head repository, indicate your forked repository and the branch corresponding to the issue.
Then, fill in the pull request template. You should add a prefix to the PR name, as we did with the branch above. If you are working on a new feature, you can name your PR as `feat: TITLE`. If your PR consists of a solution for a bug, you can name your PR as `bug: TITLE`. And, if your work is for improving the documentation, you can name your PR as `docs: TITLE`.
In addition, on the right side, you can select a reviewer (for instance, if you discussed the issue with a member of the team) and assign the pull request to yourself. It is highly advisable to add labels to PR as well. You can do this again by the labels section right on the screen. For instance, if you are addressing a bug, add the `bug` label, or if the PR is related to the documentation, add the `documentation` label. This way, PRs can be easily filtered.
Finally, fill in the template carefully and follow the guidelines. Remember to link the original issue and enable the checkbox to allow maintainer edits so the branch can be updated for a merge. Then, click on `Create pull request`.
For the PR body, ensure you give a description of what the PR contains, and add examples if possible (and if they apply to the contribution) to help with the review process. You can take a look at [#PR 974](https://github.com/argilla-io/distilabel/pull/974) or [#PR 983](https://github.com/argilla-io/distilabel/pull/983) for examples of typical PRs.
### Review your pull request
Once you submit your PR, a team member will review your proposal. We may ask questions, request additional information, or ask for changes to be made before a PR can be merged, either using [suggested changes](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/incorporating-feedback-in-your-pull-request) or pull request comments.
You can apply the changes directly through the UI (check the files changed and click on the right-corner three dots; see image below) or from your fork, and then commit them to your branch. The PR will be updated automatically, and the suggestions will appear as `outdated`.
> If you run into any merge issues, check out this [git tutorial](https://github.com/skills/resolve-merge-conflicts) to help you resolve merge conflicts and other issues.
### Your PR is merged!
Congratulations 🎉🎊 We thank you 🤩
Once your PR is merged, your contributions will be publicly visible on the [distilabel GitHub](https://github.com/argilla-io/distilabel#contributors).
Additionally, we will include your changes in the next release based on our [development branch](https://github.com/argilla-io/argilla/tree/develop).
## Additional resources
Here are some helpful resources for your reference.
*[Configuring Discord](https://support.discord.com/hc/en-us/categories/115000217151), a guide to learning how to get started with Discord.
*[Pro Git](https://git-scm.com/book/en/v2), a book to learn Git.
*[Git in VSCode](https://code.visualstudio.com/docs/sourcecontrol/overview), a guide to learning how to easily use Git in VSCode.
*[GitHub Skills](https://skills.github.com/), an interactive course for learning GitHub.
description:This is a step-by-step guide to help you develop distilabel.
hide:
-footer
---
Thank you for investing your time in contributing to the project!
If you don't have the repository locally, and need any help, go to the [contributor guide](../community/contributor.md) and read the contributor workflow with Git and GitHub first.
## Set up the Python environment
To work on the `distilabel`, you must install the package on your system.
!!! Tip
This guide will use `uv`, but `pip` and `venv` can be used as well, this guide can work quite similar with both options.
From the root of the cloned Distilabel repository, you should move to the distilabel folder in your terminal.
```bash
cd distilabel
```
### Create a virtual environment
The first step will be creating a virtual environment to keep our dependencies isolated. Here we are choosing `python 3.11` ([uv venv](https://docs.astral.sh/uv/pip/environments/) documentation), and then activate it:
```bash
uv venv .venv --python 3.11
source .venv/bin/activate
```
### Install the project
Installing from local (we are using [`uv pip`](https://docs.astral.sh/uv/pip/packages/)):
```bash
uv pip install-e .
```
We have extra dependencies with their name, depending on the part you are working on, you may want to install some dependency (take a look at `pyproject.toml` in the repo to see all the extra dependencies):
```bash
uv pip install-e".[vllm,outlines]"
```
### Linting and formatting
To maintain a consistent code format, install the pre-commit hooks to run before each commit automatically (we rely heavily on [`ruff`](https://docs.astral.sh/ruff/)):
```bash
uv pip install-e".[dev]"
pre-commit install
```
### Running tests
All the changes you add to the codebase should come with tests, either `unit` or `integration` tests, depending on the type of change, which are placed under `tests/unit` and `tests/integration` respectively.
Start by installing the tests dependencies:
```bash
uv pip install".[tests]"
```
Running the whole tests suite may take some time, and you will need all the dependencies installed, so just run your tests, and the whole tests suite will be run for you in the CI:
```bash
# Run specific tests
pytest tests/unit/steps/generators/test_data.py
```
## Set up the documentation
To contribute to the documentation and generate it locally, ensure you have installed the development dependencies:
```bash
uv pip install-e".[docs]"
```
And run the following command to create the development server with `mkdocs`:
```bash
mkdocs serve
```
### Documentation guidelines
As mentioned, we use mkdocs to build the documentation. You can write the documentation in `markdown` format, and it will automatically be converted to HTML. In addition, you can include elements such as tables, tabs, images, and others, as shown in this guide. We recommend following these guidelines:
- Use clear and concise language: Ensure the documentation is easy to understand for all users by using straightforward language and including meaningful examples. Images are not easy to maintain, so use them only when necessary and place them in the appropriate folder within the docs/assets/images directory.
- Verify code snippets: Double-check that all code snippets are correct and runnable.
- Review spelling and grammar: Check the spelling and grammar of the documentation.
- Update the table of contents: If you add a new page, include it in the relevant index.md or the mkdocs.yml file.
### Components gallery
The components gallery section of the documentation is automatically generated thanks to a custom plugin, it will be run when `mkdocs serve` is called. This guide to the steps helps us visualize each step, as well as examples of use.
!!! Note
Changes done to the docstrings of `Steps/Tasks/LLMs` won't appear in the components gallery automatically, you will have to stop the `mkdocs` server and run it again to see the changes, everything else is reloaded automatically.
We are an open-source community-driven project not only focused on building a great product but also on building a great community, where you can get support, share your experiences, and contribute to the project! We would love to hear from you and help you get started with distilabel.
<divclass="grid cards"markdown>
- __Discord__
---
In our Discord channels (#argilla-general and #argilla-help), you can get direct support from the community.
If you build something cool with `distilabel` consider adding one of these badges to your dataset or model card.
[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)
[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)
[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)
[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)
## Contribute
To directly contribute with `distilabel`, check our [good first issues](https://github.com/argilla-io/distilabel/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or [open a new one](https://github.com/argilla-io/distilabel/issues/new/choose).
description:Distilabel is an AI Feedback (AIF) framework for building datasets with and for LLMs.
hide:
-toc
---
# Frequent Asked Questions (FAQ)
??? faq "How can I rename the columns in a batch?"
Every [`Step`][distilabel.steps.base.Step] has both `input_mappings` and `output_mappings` attributes that can be used to rename the columns in each batch.
But `input_mappings` will only map, meaning that if you have a batch with the column `A` and you want to rename it to `B`, you should use `input_mappings={"A": "B"}`, but that will only be applied to that specific [`Step`][distilabel.steps.base.Step] meaning that the next step in the pipeline will still have the column `A` instead of `B`.
While `output_mappings` will indeed apply the rename, meaning that if the [`Step`][distilabel.steps.base.Step] produces the column `A` and you want to rename to `B`, you should use `output_mappings={"A": "B"}`, and that will be applied to the next [`Step`][distilabel.steps.base.Step] in the pipeline.
??? faq "Will the API Keys be exposed when sharing the pipeline?"
No, those will be masked out using `pydantic.SecretStr`, meaning that those won't be exposed when sharing the pipeline.
This also means that if you want to re-run your own pipeline and the API keys have not been provided via environment variable but either via an attribute or runtime parameter, you will need to provide them again.
??? faq "Does it work for Windows?"
Yes, but you may need to set the `multiprocessing` context in advance to ensure that the `spawn` method is used since the default method `fork` is not available on Windows.
```python
import multiprocessing as mp
mp.set_start_method("spawn")
```
??? faq "Will the custom Steps / Tasks / LLMs be serialized too?"
No, at the moment, only the references to the classes within the `distilabel` library will be serialized, meaning that if you define a custom class used within the pipeline, the serialization won't break, but the deserialize will fail since the class won't be available unless used from the same file.
??? faq "What happens if `Pipeline.run` fails? Do I lose all the data?"
No, indeed, we're using a cache mechanism to store all the intermediate results in the disk so, if a [`Step`][distilabel.steps.base.Step] fails; the pipeline can be re-run from that point without losing the data, only if nothing is changed in the `Pipeline`.
All the data will be stored in `.cache/distilabel`, but the only data that will persist at the end of the `Pipeline.run` execution is the one from the leaf step/s, so bear that in mind.
For more information on the caching mechanism in `distilabel`, you can check the [Learn - Advanced - Caching](../how_to_guides/advanced/caching.md) section.
Also, note that when running a [`Step`][distilabel.steps.base.Step] or a [`Task`][distilabel.steps.tasks.Task] standalone, the cache mechanism won't be used, so if you want to use that, you should use the `Pipeline` context manager.
??? faq "How can I use the same `LLM` across several tasks without having to load it several times?"
You can serve the LLM using a solution like TGI or vLLM, and then connect to it using an `AsyncLLM` client like `InferenceEndpointsLLM` or `OpenAILLM`. Please refer to [Serving LLMs guide](../how_to_guides/advanced/serving_an_llm_for_reuse.md) for more information.
??? faq "Can `distilabel` be used with [OpenAI Batch API](https://platform.openai.com/docs/guides/batch)?"
Yes, `distilabel` is integrated with OpenAI Batch API via [OpenAILLM][distilabel.models.llms.openai.OpenAILLM]. Check [LLMs - Offline Batch Generation](../how_to_guides/basic/llm/index.md#offline-batch-generation) for a small example on how to use it and [Advanced - Offline Batch Generation](../how_to_guides/advanced/offline_batch_generation.md) for a more detailed guide.
??? faq "Prevent overloads on [Free Serverless Endpoints][distilabel.models.llms.huggingface.InferenceEndpointsLLM]"
When running a task using the [InferenceEndpointsLLM][distilabel.models.llms.huggingface.InferenceEndpointsLLM] with Free Serverless Endpoints, you may be facing some errors such as `Model is overloaded` if you let the batch size to the default (set at 50). To fix the issue, lower the value or even better set `input_batch_size=1` in your task. It may take a longer time to finish, but please remember this is a free service.
```python
from distilabel.models import InferenceEndpointsLLM
We are installing from `develop` since that's the branch we use to collect all the features, bug fixes, and improvements that will be part of the next release. If you want to install from a specific branch, you can replace `develop` with the branch name.
## Extras
Additionally, as part of `distilabel` some extra dependencies are available, mainly to add support for some of the LLM integrations we support. Here's a list of the available extras:
### LLMs
-`anthropic`: for using models available in [Anthropic API](https://www.anthropic.com/api) via the `AnthropicLLM` integration.
-`argilla`: for exporting the generated datasets to [Argilla](https://argilla.io/).
-`cohere`: for using models available in [Cohere](https://cohere.ai/) via the `CohereLLM` integration.
-`groq`: for using models available in [Groq](https://groq.com/) using [`groq`](https://github.com/groq/groq-python) Python client via the `GroqLLM` integration.
-`hf-inference-endpoints`: for using the [Hugging Face Inference Endpoints](https://huggingface.co/inference-endpoints) via the `InferenceEndpointsLLM` integration.
-`hf-transformers`: for using models available in [transformers](https://github.com/huggingface/transformers) package via the `TransformersLLM` integration.
-`litellm`: for using [`LiteLLM`](https://github.com/BerriAI/litellm) to call any LLM using OpenAI format via the `LiteLLM` integration.
-`llama-cpp`: for using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) Python bindings for `llama.cpp` via the `LlamaCppLLM` integration.
-`mistralai`: for using models available in [Mistral AI API](https://mistral.ai/news/la-plateforme/) via the `MistralAILLM` integration.
-`ollama`: for using [Ollama](https://ollama.com/) and their available models via `OllamaLLM` integration.
-`openai`: for using [OpenAI API](https://openai.com/blog/openai-api) models via the `OpenAILLM` integration, or the rest of the integrations based on OpenAI and relying on its client as `AnyscaleLLM`, `AzureOpenAILLM`, and `TogetherLLM`.
-`vertexai`: for using [Google Vertex AI](https://cloud.google.com/vertex-ai) proprietary models via the `VertexAILLM` integration.
-`vllm`: for using [vllm](https://github.com/vllm-project/vllm) serving engine via the `vLLM` integration.
-`sentence-transformers`: for generating sentence embeddings using [sentence-transformers](https://github.com/UKPLab/sentence-transformers).
-`mlx`: for using [MLX](https://github.com/ml-explore/mlx) models via the `MlxLLM` integration.
### Data processing
-`ray`: for scaling and distributing a pipeline with [Ray](https://github.com/ray-project/ray).
-`faiss-cpu` and `faiss-gpu`: for generating sentence embeddings using [faiss](https://github.com/facebookresearch/faiss).
-`minhash`: for using minhash for duplicate detection with [datasketch](https://github.com/datasketch/datasketch) and [nltk](https://github.com/nltk/nltk).
-`text-clustering`: for using text clustering with [UMAP](https://github.com/lmcinnes/umap) and [Scikit-learn](https://github.com/scikit-learn/scikit-learn).
### Structured generation
-`outlines`: for using structured generation of LLMs with [outlines](https://github.com/outlines-dev/outlines).
-`instructor`: for using structured generation of LLMs with [Instructor](https://github.com/jxnl/instructor/).
## Recommendations / Notes
The [`mistralai`](https://github.com/mistralai/client-python) dependency requires Python 3.9 or higher, so if you're willing to use the `distilabel.models.llms.MistralLLM` implementation, you will need to have Python 3.9 or higher.
In some cases like [`transformers`](https://github.com/huggingface/transformers) and [`vllm`](https://github.com/vllm-project/vllm), the installation of [`flash-attn`](https://github.com/Dao-AILab/flash-attention) is recommended if you are using a GPU accelerator since it will speed up the inference process, but the installation needs to be done separately, as it's not included in the `distilabel` dependencies.
```sh
pip install flash-attn --no-build-isolation
```
Also, if you are willing to use the [`llama-cpp-python`](https://github.com/abetlen/llama-cpp-python) integration for running local LLMs, note that the installation process may get a bit trickier depending on which OS are you using, so we recommend you to read through their [Installation](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#installation) section in their docs.
<imgsrc="https://colab.research.google.com/assets/colab-badge.svg"alt="Open In Colab"/>
</a>
# Quickstart
Distilabel provides all the tools you need to your scalable and reliable pipelines for synthetic data generation and AI-feedback. Pipelines are used to generate data, evaluate models, manipulate data, or any other general task. They are made up of different components: Steps, Tasks and LLMs, which are chained together in a directed acyclic graph (DAG).
-**Steps**: These are the building blocks of your pipeline. Normal steps are used for basic executions like loading data, applying some transformations, or any other general task.
-**Tasks**: These are steps that rely on LLMs and prompts to perform generative tasks. For example, they can be used to generate data, evaluate models or manipulate data.
-**LLMs**: These are the models that will perform the task. They can be local or remote models, and open-source or commercial models.
Pipelines are designed to be scalable and reliable. They can be executed in a distributed manner, and they can be cached and recovered. This is useful when dealing with large datasets or when you want to ensure that your pipeline is reproducible.
Besides that, pipelines are designed to be modular and flexible. You can easily add new steps, tasks, or LLMs to your pipeline, and you can also easily modify or remove them. An example architecture of a pipeline to generate a dataset of preferences is the following:
## Installation
To install the latest release with `hf-inference-endpoints` extra of the package from PyPI you can use the following command:
To use a generic pipeline for an ML task, you can use the `InstructionResponsePipeline` class. This class is a generic pipeline that can be used to generate data for supervised fine-tuning tasks. It uses the `InferenceEndpointsLLM` class to generate data based on the input data and the model.
The `InstructionResponsePipeline` class will use the `InferenceEndpointsLLM` class with the model `meta-llama/Meta-Llama-3.1-8B-Instruct` to generate data based on the system prompt. The output data will be a dataset with the columns `instruction` and `response`. The class uses a generic system prompt, but you can customize it by passing the `system_prompt` parameter to the class.
!!! note
We're actively working on building more pipelines for different tasks. If you have any suggestions or requests, please let us know! We're currently working on pipelines for classification, Direct Preference Optimization, and Information Retrieval tasks.
## Define a Custom pipeline
In this guide we will walk you through the process of creating a simple pipeline that uses the [InferenceEndpointsLLM][distilabel.models.llms.InferenceEndpointsLLM] class to generate text. The [Pipeline][distilabel.pipeline.Pipeline] will process a dataset loaded directly using the Hugging Face `datasets` library and use the [InferenceEndpointsLLM][distilabel.models.llms.InferenceEndpointsLLM] class to generate text using the [TextGeneration][distilabel.steps.tasks.text_generation.TextGeneration] task.
> You can check the available models in the [Hugging Face Model Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending) and filter by `Inference status`.
1. We define a [Pipeline][distilabel.pipeline.Pipeline] using its context manager. Any [Step][distilabel.steps.base.Step] subclass defined within the context manager will be automatically added to the pipeline.
2. We define a [TextGeneration][distilabel.steps.tasks.text_generation.TextGeneration] task that uses the [InferenceEndpointsLLM][distilabel.models.llms.InferenceEndpointsLLM] class with the model Meta-Llama-3.1-8B-Instruct. The generation parameters are set directly in the LLM configuration with a temperature of 0.7 and maximum of 512 new tokens.
3. We load the dataset directly using the Hugging Face datasets library from the repository "distilabel-internal-testing/instructions" using the "test" split.
4. Optionally, we can push the generated [Distiset][distilabel.distiset.Distiset] to the Hugging Face Hub repository distilabel-example. This will allow you to share the generated dataset with others and use it in other pipelines.
Being able to export the generated synthetic datasets to Argilla, is a core feature within `distilabel`. We believe in the potential of synthetic data, but without removing the impact a human annotator or group of annotators can bring. So on, the Argilla integration makes it straightforward to push a dataset to Argilla while the [`Pipeline`][distilabel.pipeline.Pipeline] is running, to be able to follow along the generation process in Argilla's UI, as well as annotating the records on the fly. One can include a [`Step`][distilabel.steps.Step] within the [`Pipeline`][distilabel.pipeline.Pipeline] to easily export the datasets to Argilla with a pre-defined configuration, suiting the annotation purposes.
Before using any of the steps about to be described below, you should first have an Argilla instance up and running, so that you can successfully upload the data to Argilla. In order to deploy Argilla, the easiest and most straightforward way is to deploy it via the [Argilla Template in Hugging Face Spaces](https://huggingface.co/docs/hub/en/spaces-sdks-docker-argilla) as simply as following the steps there, or just via the following button:
For text generation scenarios, i.e. when the [`Pipeline`][distilabel.pipeline.Pipeline] contains a single [`TextGeneration`][distilabel.steps.tasks.TextGeneration] step, we have designed the task [`TextGenerationToArgilla`][distilabel.steps.TextGenerationToArgilla], which will seamlessly push the generated data to Argilla, and allow the annotator to review the records.
The dataset will be pushed with the following configuration:
-**Fields**: `instruction` and `generation`, both being fields of type `argilla.TextField`, plus the automatically generated `id` for the given `instruction` to be able to search for other records with the same `instruction` in the dataset. The field `instruction` must always be a string, while the field `generation` can either be a single string or a list of strings (useful when there are multiple parent nodes of type [`TextGeneration`][distilabel.steps.tasks.TextGeneration]); even though each record will always contain at most one `instruction`-`generation` pair.
-**Questions**: `quality` will be the only question for the annotators to answer, i.e., to annotate, and it will be an `argilla.LabelQuestion` referring to the quality of the provided generation for the given instruction. It can be annotated as either 👎 (bad) or 👍 (good).
!!! NOTE
The [`TextGenerationToArgilla`][distilabel.steps.TextGenerationToArgilla] step will only work as is if the [`Pipeline`][distilabel.pipeline.Pipeline] contains one or multiple [`TextGeneration`][distilabel.steps.tasks.TextGeneration] steps, or if the columns `instruction` and `generation` are available within the batch data. Otherwise, the variable `input_mappings` will need to be set so that either both or one of `instruction` and `generation` are mapped to one of the existing columns in the batch data.
"instruction":"Write a short story about a dragon that saves a princess from a tower.",
},
],
)
text_generation=TextGeneration(
name="text_generation",
llm=OpenAILLM(model="gpt-4"),
)
to_argilla=TextGenerationToArgilla(
dataset_name="my-dataset",
dataset_workspace="admin",
api_url="<ARGILLA_API_URL>",
api_key="<ARGILLA_API_KEY>",
)
load_dataset>>text_generation>>to_argilla
pipeline.run()
```

### Preference
For preference scenarios, i.e. when the [`Pipeline`][distilabel.pipeline.Pipeline] contains multiple [`TextGeneration`][distilabel.steps.tasks.TextGeneration] steps, we have designed the task [`PreferenceToArgilla`][distilabel.steps.PreferenceToArgilla], which will seamlessly push the generated data to Argilla, and allow the annotator to review the records.
The dataset will be pushed with the following configuration:
-**Fields**: `instruction` and `generations`, both being fields of type `argilla.TextField`, plus the automatically generated `id` for the given `instruction` to be able to search for other records with the same `instruction` in the dataset. The field `instruction` must always be a string, while the field `generations` must be a list of strings, containing the generated texts for the given `instruction` so that at least there are two generations to compare. Other than that, the number of `generation` fields within each record in Argilla will be defined by the value of the variable `num_generations` to be provided in the [`PreferenceToArgilla`][distilabel.steps.PreferenceToArgilla] step.
-**Questions**: `rating` and `rationale` will be the pairs of questions to be defined per each generation i.e. per each value within the range from 0 to `num_generations`, and those will be of types `argilla.RatingQuestion` and `argilla.TextQuestion`, respectively. Note that only the first pair of questions will be mandatory, since only one generation is ensured to be within the batch data. Additionally, note that the provided ratings will range from 1 to 5, and to mention that Argilla only supports values above 0.
!!! NOTE
The [`PreferenceToArgilla`][distilabel.steps.PreferenceToArgilla] step will only work if the [`Pipeline`][distilabel.pipeline.Pipeline] contains multiple [`TextGeneration`][distilabel.steps.tasks.TextGeneration] steps, or if the columns `instruction` and `generations` are available within the batch data. Otherwise, the variable `input_mappings` will need to be set so that either both or one of `instruction` and `generations` are mapped to one of the existing columns in the batch data.
!!! NOTE
Additionally, if the [`Pipeline`][distilabel.pipeline.Pipeline] contains an [`UltraFeedback`][distilabel.steps.tasks.UltraFeedback] step, the `ratings` and `rationales` will also be available and be automatically injected as suggestions to the existing dataset.
When dealing with complex pipelines that get executed in a distributed environment with abundant resources (CPUs and GPUs), sometimes it's necessary to allocate these resources judiciously among the `Step`s. This is why `distilabel` allows to specify the number of `replicas`, `cpus` and `gpus` for each `Step`. Let's see that with an example:
In the example above, we're creating a `PrometheusEval` task (remember that `Task`s are `Step`s) that will use `vLLM` to serve `prometheus-eval/prometheus-7b-v2.0` model. This task is resource intensive as it requires an LLM, which in turn requires a GPU to run fast. With that in mind, we have specified the `resources` required for the task using the [`StepResources`][distilabel.steps.base.StepResources] class, and we have defined that we need `1` GPU and `1` CPU per replica of the task. In addition, we have defined that we need `2` replicas i.e. we will run two instances of the task so the computation for the whole dataset runs faster. In addition, `StepResources` uses the [RuntimeParametersMixin][distilabel.mixins.runtime_parameters.RuntimeParametersMixin], so we can also specify the resources for each step when running the pipeline:
`distilabel` will automatically save all the intermediate outputs generated by each [`Step`][distilabel.steps.base.Step] of a [`Pipeline`][distilabel.pipeline.local.Pipeline], so these outputs can be reused to recover the state of a pipeline execution that was stopped before finishing or to not have to re-execute steps from a pipeline after adding a new downstream step.
## How to enable/disable the cache
The use of the cache can be toggled using the `use_cache` parameter of the [`Pipeline.use_cache`][distilabel.pipeline.base.BasePipeline.run] method. If `True`, then `distilabel ` will use the reuse the outputs of previous executions for the new execution. If `False`, then `distilabel` will re-execute all the steps of the pipeline to generate new outputs for all the steps.
```python
withPipeline(name="my-pipeline")aspipeline:
...
if__name__=="__main__":
distiset=pipeline.run(use_cache=False)# (1)
```
1. Pipeline cache is disabled
In addition, the cache can be enabled/disabled at [`Step`][distilabel.steps.base.Step] level using its `use_cache` attribute. If `True`, then the outputs of the step will be reused in the new pipeline execution. If `False`, then the step will be re-executed to generate new outputs. If the cache of one step is disabled and the outputs have to be regenerated, then the outputs of the steps that depend on this step will also be regenerated.
1. Step cache is disabled and every time the pipeline is executed, this step will be re-executed
## How a cache hit is triggered
`distilabel` groups information and data generated by a `Pipeline` using the name of the pipeline, so the first factor that triggers a cache hit is the name of the pipeline. The second factor, is the [`Pipeline.signature`][distilabel.pipeline.local.Pipeline.signature] property. This property returns a hash that is generated using the names of the steps used in the pipeline and their connections. The third factor, is the [`Pipeline.aggregated_steps_signature`][distilabel.pipeline.local.Pipeline.aggregated_steps_signature] property which is used to determine if the new pipeline execution is exactly the same as one of the previous i.e. the pipeline contains exactly the same steps, with exactly the same connections and the steps are using exactly the same parameters. If these three factors are met, then the cache hit is triggered and the pipeline won't get re-executed and instead the function [`create_distiset`][distilabel.distiset.create_distiset] will be used to create the resulting [`Distiset`][distilabel.distiset.Distiset] using the outputs of the previous execution, as it can be seen in the following image:
If the new pipeline execution have a different `Pipeline.aggregated_steps_signature` i.e. at least one step has changed its parameters, `distilabel` will reuse the outputs of the steps that have not changed and re-execute the steps that have changed, as it can be seen in the following image:
The same pipeline from above gets executed a third time, but this time the last step `text_generation_1` changed, so it's needed to re-execute it. The other steps, as they have not been changed, doesn't need to be re-executed and their outputs are reused.
`Distilabel` offers a [`CLI`][distilabel.cli.pipeline.utils] to _explore_ and _re-run_ existing [`Pipeline`][distilabel.pipeline.Pipeline] dumps, meaning that an existing dump can be explored to see the steps, how those are connected, the runtime parameters used, and also re-run it with the same or different runtime parameters, respectively.
## Available commands
The only available command as of the current version of `distilabel` is `distilabel pipeline`.
So on, `distilabel pipeline` has two subcommands: `info` and `run`, as described below. Note that for testing purposes we will be using the following [dataset](https://huggingface.co/datasets/distilabel-internal-testing/ultrafeedback-mini).
As we can see from the help message, we need to pass either a `Path` or a `URL`. This second option comes handy for datasets stored in Hugging Face Hub, for example:
```bash
distilabel pipeline info --config"https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml"
The pipeline information includes the steps used in the `Pipeline` along with the `Runtime Parameter` that was used, as well as a description of each of them, and also the connections between these steps. These can be helpful to explore the Pipeline locally.
### `distilabel pipeline run`
We can also run a `Pipeline` from the CLI just pointing to the same `pipeline.yaml` file or an URL pointing to it and calling `distilabel pipeline run`. Alternatively, an URL pointing to a Python script containing a distilabel pipeline can be used:
Using `--config` option, we must pass a path with a `pipeline.yaml` file.
To specify the runtime parameters of the steps we will need to use the `--param` option and the value of the parameter in the following format:
```bash
distilabel pipeline run --config"https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml"\
A [`Pipeline`][distilabel.pipeline.Pipeline] in `distilabel` returns a special type of Hugging Face [`datasets.DatasetDict`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict) which is called [`Distiset`][distilabel.distiset.Distiset].
The [`Distiset`][distilabel.distiset.Distiset] is a dictionary-like object that contains the different configurations generated by the [`Pipeline`][distilabel.pipeline.Pipeline], where each configuration corresponds to each leaf step in the DAG built by the [`Pipeline`][distilabel.pipeline.Pipeline]. Each configuration corresponds to a different subset of the dataset. This is a concept taken from 🤗 `datasets` that lets you upload different configurations of the same dataset within the same repository and can contain different columns i.e. different configurations, which can be seamlessly pushed to the Hugging Face Hub.
Below you can find an example of how to create a [`Distiset`][distilabel.distiset.Distiset] object that resembles a [`datasets.DatasetDict`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict):
If there's only one leaf node, i.e., only one step at the end of the [`Pipeline`][distilabel.pipeline.Pipeline], then the configuration name won't be the name of the last step, but it will be set to "default" instead, as that's more aligned with standard datasets within the Hugging Face Hub.
## Distiset methods
We can interact with the different pieces generated by the [`Pipeline`][distilabel.pipeline.Pipeline] and treat them as different [`configurations`](https://huggingface.co/docs/datasets-server/configs_and_splits#configurations). The [`Distiset`][distilabel.distiset.Distiset] contains just two methods:
### Train/Test split
Create a train/test split partition of the dataset for the different configurations or subsets.
```python
>>>distiset.train_test_split(train_size=0.9)
Distiset({
leaf_step_1:DatasetDict({
train:Dataset({
features:['instruction'],
num_rows:2
})
test:Dataset({
features:['instruction'],
num_rows:1
})
})
leaf_step_2:DatasetDict({
train:Dataset({
features:['instruction','generation'],
num_rows:3
})
test:Dataset({
features:['instruction','generation'],
num_rows:1
})
})
})
```
### Push to Hugging Face Hub
Push the [`Distiset`][distilabel.distiset.Distiset] to a Hugging Face repository, where each one of the subsets will correspond to a different configuration:
```python
distiset.push_to_hub(
"my-org/my-dataset",
commit_message="Initial commit",
private=False,
token=os.getenv("HF_TOKEN"),
generate_card=True,
include_script=False
)
```
!!! info "New since version 1.3.0"
Since version `1.3.0` you can automatically push the script that created your pipeline to the same repository. For example, assuming you have a file like the following:
``` py title="sample_pipe.py"
with Pipeline() as pipe:
...
distiset = pipe.run()
distiset.push_to_hub(
"my-org/my-dataset,
include_script=True
)
```
After running the command, you could visit the repository and the file `sample_pipe.py` will be stored to simplify sharing your pipeline with the community.
#### Custom Docstrings
`distilabel` contains a custom plugin to automatically generates a gallery for the different components. The information is extracted by parsing the `Step`'s docstrings. You can take a look at the docstrings in the source code of the [UltraFeedback][distilabel.steps.tasks.ultrafeedback.UltraFeedback], and take a look at the corresponding entry in the components gallery to see an example of how the docstrings are rendered.
If you create your own components and want the `Citations` automatically rendered in the README card (in case you are sharing your final distiset in the Hugging Face Hub), you may want to add the citation section. This is an example for the [MagpieGenerator][distilabel.steps.tasks.magpie.generator.MagpieGenerator] Task:
```python
classMagpieGenerator(GeneratorTask,MagpieBase):
r"""Generator task the generates instructions or conversations using Magpie.
...
Citations:
```
@misc{xu2024magpiealignmentdatasynthesis,
title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},
author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},
year={2024},
eprint={2406.08464},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.08464},
}
```
"""
```
The `Citations` section can include any number of bibtex references. To define them, you can add as much elements as needed just like in the example: each citation will be a block of the form: ` ```@misc{...}``` `. This information will be automatically used in the README of your `Distiset` if you decide to call `distiset.push_to_hub`. Alternatively, if the `Citations` is not found, but in the `References` there are found any urls pointing to `https://arxiv.org/`, we will try to obtain the `Bibtex` equivalent automatically. This way, Hugging Face can automatically track the paper for you and it's easier to find other datasets citing the same paper, or directly visiting the paper page.
#### Image Datasets
!!! info "Keep reading if you are interested in Image datasets"
The `Distiset` object has a new method `transform_columns_to_image` specifically to transform the images to `PIL.Image.Image` before pushing the dataset to the hugging face hub.
Since version `1.5.0` we have the [`ImageGeneration`](https://distilabel.argilla.io/dev/components-gallery/task/imagegeneration/) task that is able to generate images from text. By default, all the process will work internally with a string representation for the images. This is done for simplicity while processing. But to take advantage of the Hugging Face Hub functionalities if the dataset generated is going to be stored there, a proper Image object may be preferable, so we can see the images in the dataset viewer for example. Let's take a look at the following pipeline extracted from "examples/image_generation.py" at the root of the repository to see how we can do it:
```diff
# Assume all the imports are already done, we are only interested
with Pipeline(name="image_generation_pipeline") as pipeline:
After calling [`transform_columns_to_image`][distilabel.distiset.Distiset.transform_columns_to_image] on the image columns we may have generated (in this case we only want to transform the `image` column, but a list can be passed). This will apply to any leaf nodes we have in the pipeline, meaning if we have different subsets, the "image" column will be found in all of them, or we can pass a list of columns.
### Save and load from disk
Take into account that these methods work as `datasets.load_from_disk` and `datasets.Dataset.save_to_disk` so the arguments are directly passed to those methods. This means you can also make use of `storage_options` argument to save your [`Distiset`][distilabel.distiset.Distiset] in your cloud provider, including the distilabel artifacts (`pipeline.yaml`, `pipeline.log` and the `README.md` with the dataset card). You can read more in `datasets` documentation [here](https://huggingface.co/docs/datasets/filesystems#saving-serialized-datasets).
=== "Save to disk"
Save the [`Distiset`][distilabel.distiset.Distiset] to disk, and optionally (will be done by default) saves the dataset card, the pipeline config file and logs:
```python
distiset.save_to_disk(
"my-dataset",
save_card=True,
save_pipeline_config=True,
save_pipeline_log=True
)
```
=== "Load from disk (local)"
Load a [`Distiset`][distilabel.distiset.Distiset] that was saved using [`Distiset.save_to_disk`][distilabel.distiset.Distiset.save_to_disk] just the same way:
```python
distiset = Distiset.load_from_disk("my-dataset")
```
=== "Load from disk (cloud)"
Load a [`Distiset`][distilabel.distiset.Distiset] from a remote location, like S3, GCS. You can pass the `storage_options` argument to authenticate with the cloud provider:
```python
distiset = Distiset.load_from_disk(
"s3://path/to/my_dataset", # gcs:// or any filesystem tolerated by fsspec
storage_options={
"key": os.environ["S3_ACCESS_KEY"],
"secret": os.environ["S3_SECRET_KEY"],
...
}
)
```
Take a look at the remaining arguments at [`Distiset.save_to_disk`][distilabel.distiset.Distiset.save_to_disk] and [`Distiset.load_from_disk`][distilabel.distiset.Distiset.load_from_disk].
## Dataset card
Having this special type of dataset comes with an added advantage when calling [`Distiset.push_to_hub`][distilabel.distiset.Distiset], which is the automatically generated dataset card in the Hugging Face Hub. Note that it is enabled by default, but can be disabled by setting `generate_card=False`:
We will have an automatic dataset card (an example can be seen [here](https://huggingface.co/datasets/distilabel-internal-testing/deita)) with some handy information like reproducing the [`Pipeline`][distilabel.pipeline.Pipeline] with the `CLI`, or examples of the records from the different subsets.
## create_distiset helper
Lastly, we presented in the [caching](./caching.md) section the [`create_distiset`][distilabel.distiset.create_distiset] function, you can take a look at the section to see how to create a [`Distiset`][distilabel.distiset.Distiset] from the cache folder, using the helper function to automatically include all the relevant data.
# Using a file system to pass data of batches between steps
In some situations, it can happen that the batches contains so much data that is faster to write it to disk and read it back in the next step, instead of passing it using the queue. To solve this issue, `distilabel` uses [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to allow providing a file system configuration and whether if this file system should be used to pass data between steps in the `run` method of the `distilabel` pipelines:
!!! WARNING
In order to use a specific file system/cloud storage, you will need to install the specific package providing the `fsspec` implementation for that file system. For instance, to use Google Cloud Storage you will need to install `gcsfs`:
```bash
pip install gcsfs
```
Check the available implementations: [fsspec - Other known implementations](https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations)
```python
fromdistilabel.pipelineimportPipeline
withPipeline(name="my-pipeline")aspipeline:
...
if__name__=="__main__":
distiset=pipeline.run(
...,
storage_parameters={"path":"gcs://my-bucket"},
use_fs_to_pass_data=True
)
```
The code above setups a file system (in this case Google Cloud Storage) and sets the flag `use_fs_to_pass_data` to specify that the data of the batches should be passed to the steps using the file system. The `storage_parameters` argument is optional, and in the case it's not provided but `use_fs_to_pass_data==True`, `distilabel` will use the local file system.
!!! NOTE
As `GlobalStep`s receives all the data from the previous steps in one single batch accumulating all the data, it's very likely that the data of the batch will be too big to be passed using the queue. In this case and even if `use_fs_to_pass_data==False`, `distilabel` will use the file system to pass the data to the `GlobalStep`.
By default, the `distilabel` architecture loads all steps of a pipeline at the same time, as they are all supposed to process batches of data in parallel. However, loading all steps at once can waste resources in two scenarios: when using `GlobalStep`s that must wait for upstream steps to complete before processing data, or when running on machines with limited resources that cannot execute all steps simultaneously. In these cases, steps need to be loaded and executed in distinct **load stages**.
## Load stages
A load stage represents a point in the pipeline execution where a group of steps are loaded at the same time to process batches in parallel. These stages are required because:
1. There are some kind of steps like the `GlobalStep`s that needs to receive all the data at once from their upstream steps i.e. needs their upstream steps to have finished its execution. It would be wasteful to load a `GlobalStep` at the same time as other steps of the pipeline as that would take resources (from the machine or cluster running the pipeline) that wouldn't be used until upstream steps have finished.
2. When running on machines or clusters with limited resources, it may be not possible to load and execute all steps simultaneously as they would need to access the same limited resources (memory, CPU, GPU, etc.).
Having that said, the first element that will create a load stage when executing a pipeline are the [`GlobalStep`][distilabel.steps.base.GlobalStep], as they mark and divide a pipeline in three stages: one stage with the upstream steps of the global step, one stage with the global step, and one final stage with the downstream steps of the global step. For example, the following pipeline will contain three stages:
As we can see, the `GlobalStep` divided the pipeline execution in three stages.
## Load groups
While `GlobalStep`s automatically divide pipeline execution into stages, we many need fine-grained control over how steps are loaded and executed within each stage. This is where **load groups** come in.
Load groups allows to specify which steps of the pipeline have to be loaded together within a stage. This is particularly useful when running on resource-constrained environments where all the steps cannot be executed in parallel.
In this example, we're working with a machine that has a single GPU, but the pipeline includes two instances of [TextGeneration]() tasks both using [vLLM]() and requesting 1 GPU. We cannot execute both steps in parallel. To fix that,
we specify in the `run` method using the `load_groups` argument that the `text_generation_0` step has to be executed in isolation in a stage. This way, we can run the pipeline on a single GPU machine by executing the steps in different stages (sequentially) instead of in parallel.
Some key points about load groups:
1. Load groups are specified as a list of lists, where each inner list represents a group of steps that should be loaded together.
2. Same as `GlobalSteps`s, the load groups creates a new load stage dividing the pipeline in 3 stages: one for the upstream steps, one for the steps in the load group, and one for the downstream steps.
### Load groups modes
In addition, `distilabel` allows passing some modes to the `load_groups` argument that will handle the creation of the load groups:
-`"sequential_step_execution"`: when passed, it will create a load group for each step i.e. the execution of the steps of the pipeline will be sequential.
The [offline batch generation](../basic/llm/index.md#offline-batch-generation) is a feature that some `LLM`s implemented in `distilabel` offers, allowing to send the inputs to a LLM-as-a-service platform and waiting for the outputs in a asynchronous manner. LLM-as-a-service platforms offer this feature as it allows them to gather many inputs and creating batches as big as the hardware allows, maximizing the hardware utilization and reducing the cost of the service. In exchange, the user has to wait certain time for the outputs to be ready but the cost per token is usually much lower.
`distilabel` pipelines are able to handle `LLM`s that offer this feature in the following way:
* The first time the pipeline gets executed, the `LLM` will send the inputs to the platform. The platform will return jobs ids that can be used later to check the status of the jobs and retrieve the results. The `LLM` will save these jobs ids in its `jobs_ids` attribute and raise an special exception [DistilabelOfflineBatchGenerationNotFinishedException][distilabel.exceptions.DistilabelOfflineBatchGenerationNotFinishedException] that will be handled by the `Pipeline`. The jobs ids will be saved in the pipeline cache, so they can be used in subsequent calls.
* The second time and subsequent calls will recover the pipeline execution and the `LLM` won't send the inputs again to the platform. This time as it has the `jobs_ids` it will check if the jobs have finished, and if they have then it will retrieve the results and return the outputs. If they haven't finished, then it will raise again `DistilabelOfflineBatchGenerationNotFinishedException` again.
* In addition, LLMs with offline batch generation can be specified to do polling until the jobs have finished, blocking the pipeline until they are done. If for some reason the polling needs to be stopped, one can press ++ctrl+c++ or ++cmd+c++ depending on your OS (or send a `SIGINT` to the main process) which will stop the polling and raise `DistilabelOfflineBatchGenerationNotFinishedException` that will be handled by the pipeline as described above.
!!! WARNING
In order to recover the pipeline execution and retrieve the results, the pipeline cache must be enabled. If the pipeline cache is disabled, then it will send the inputs again and create different jobs incurring in extra costs.
## Example pipeline using `OpenAILLM` with offline batch generation
When sharing a `Pipeline` that contains custom `Step`s or `Task`s, you may want to add the specific requirements that are needed to run them. `distilabel` will take this list of requirements and warn the user if any are missing.
Let's see how we can add additional requirements with an example. The first thing we're going to do is to add requirements for our `CustomStep`. To do so we use the `requirements` decorator to specify that the step has `nltk>=3.8` as dependency (we can use [version specifiers](https://peps.python.org/pep-0440/#version-specifiers)). In addition, we're going to specify at `Pipeline` level that we need `distilabel>=1.3.0` to run it.
Once we call `pipeline.run()`, if any of the requirements informed at the `Step` or `Pipeline` level is missing, a `ValueError` will be raised telling us that we should install the list of dependencies:
Some `Step`s might need to produce an auxiliary artifact that is not a result of the computation, but is needed for the computation. For example, the [`FaissNearestNeighbour`](../../../components-gallery/steps/faissnearestneighbour.md) needs to create a Faiss index to compute the output of the step which are the top `k` nearest neighbours for each input. Generating the Faiss index takes time and it could potentially be reused outside of the `distilabel` pipeline, so it would be a shame not saving it.
For this reason, `Step`s have a method called `save_artifact` that allows saving artifacts that will be included along the outputs of the pipeline in the generated [`Distiset`][distilabel.distiset.Distiset]. The generated artifacts will be uploaded and saved when using `Distiset.push_to_hub` or `Distiset.save_to_disk` respectively. Let's see how to use it with a simple example.
As it can be seen in the example above, we have created a simple step that counts the number of characters in each input text and generates a histogram with the distribution of the character counts. We save the histogram as an artifact of the step using the `save_artifact` method. The method takes three arguments:
-`name`: The name we want to give to the artifact.
-`write_function`: A function that writes the artifact to the desired path. The function will receive a `path` argument which is a `pathlib.Path` object pointing to the directory where the artifact should be saved.
-`metadata`: A dictionary with metadata about the artifact. This metadata will be saved along with the artifact.
Let's execute the step with a simple pipeline and push the resulting `Distiset` to the Hugging Face Hub:
??? "Example full code"
```python
from typing import TYPE_CHECKING, List
import matplotlib.pyplot as plt
from datasets import load_dataset
from distilabel.pipeline import Pipeline
from distilabel.steps import GlobalStep, StepInput, StepOutput
The generated [distilabel-internal-testing/distilabel-artifacts-example](https://huggingface.co/datasets/distilabel-internal-testing/distilabel-artifacts-example) dataset repository has a section in its card [describing the artifacts generated by the pipeline](https://huggingface.co/datasets/distilabel-internal-testing/distilabel-artifacts-example#artifacts) and the generated plot can be seen [here](https://huggingface.co/datasets/distilabel-internal-testing/distilabel-artifacts-example/blob/main/artifacts/count_text_characters_0/text_character_count_distribution/figure.png).
Although the local [Pipeline][distilabel.pipeline.local.Pipeline] based on [`multiprocessing`](https://docs.python.org/3/library/multiprocessing.html) + [serving LLMs with an external service](serving_an_llm_for_reuse.md) is enough for executing most of the pipelines used to create SFT and preference datasets, there are scenarios where you might need to scale your pipeline across multiple machines. In such cases, distilabel leverages [Ray](https://www.ray.io/) to distribute the workload efficiently. This allows you to generate larger datasets, reduce execution time, and maximize resource utilization across a cluster of machines, without needing to change a single line of code.
## Relation between distilabel steps and Ray Actors
A `distilabel` pipeline consist of several [`Step`][distilabel.steps.base.Step]s. An `Step` is a class that defines a basic life-cycle:
1. It will load or create the resources (LLMs, clients, etc) required to run its logic.
2. It will run a loop waiting for incoming batches received using a queue. Once it receives one batch, it will process it and put the processed batch into an output queue.
3. When it finish a batch that is the final one or receives a special signal, the loop will finish and the unload logic will be executed.
So an `Step` needs to maintain a minimum state and the best way to do that with Ray is using [actors](https://docs.ray.io/en/latest/ray-core/actors.html).
``` mermaid
graph TD
A[Step] -->|has| B[Multiple Replicas]
B -->|wrapped in| C[Ray Actor]
C -->|maintains| D[Step Replica State]
C -->|executes| E[Step Lifecycle]
E -->|1. Load/Create Resources| F[LLMs, Clients, etc.]
E -->|2. Process batches from| G[Input Queue]
E -->|3. Processed batches are put in| H[Output Queue]
E -->|4. Unload| I[Cleanup]
```
## Executing a pipeline with Ray
The recommended way to execute a `distilabel` pipeline using Ray is using the [Ray Jobs API](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html#ray-jobs-api).
Before jumping on the explanation, let's first install the prerequisites:
```bash
pip install distilabel[ray]
```
!!! tip
It's recommended to create a virtual environment.
For the purpose of explaining how to execute a pipeline with Ray, we'll use the following pipeline throughout the examples:
1. We're setting [resources](assigning_resources_to_step.md) for the `text_generation` step and defining that we want two replicas and one GPU per replica. `distilabel` will create two replicas of the step i.e. two actors in the Ray cluster, and each actor will request to be allocated in a node of the cluster that have at least one GPU. You can read more about how Ray manages the resources [here](https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#resources).
2. You should modify this and add your user or organization on the Hugging Face Hub.
It's a basic pipeline with just two steps: one to load a dataset from the Hub with an `instruction` column and one to generate a `response` for that instruction using Llama 3 8B Instruct with [vLLM](../../../components-gallery/llms/vllm.md). Simple but enough to demonstrate how to distribute and scale the workload using a Ray cluster!
### Using Ray Jobs API
If you don't know the Ray Jobs API then it's recommended to read [Ray Jobs Overview](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html#ray-jobs-overview). Quick summary: Ray Jobs is the recommended way to execute a job in a Ray cluster as it will handle packaging, deploying and managing the Ray application.
To execute the pipeline above, we first need to create a directory (kind of a package) with the pipeline script (or scripts)
that we will submit to the Ray cluster:
```bash
mkdir ray-pipeline
```
The content of the directory `ray-pipeline` should be:
```
ray-pipeline/
├── pipeline.py
└── runtime_env.yaml
```
The first file contains the code of the pipeline, while the second one (`runtime_env.yaml`) is a specific Ray file containing the [environment dependencies](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#environment-dependencies) required to run the job:
```yaml
pip:
-distilabel[ray,vllm] >= 1.3.0
env_vars:
HF_TOKEN:<YOUR_HF_TOKEN>
```
With this file we're basically informing to the Ray cluster that it will have to install `distilabel` with the `vllm` and `ray` extra dependencies to be able to run the job. In addition, we're defining the `HF_TOKEN` environment variable that will be used (by the `push_to_hub` method) to upload the resulting dataset to the Hugging Face Hub.
After that, we can proceed to execute the `ray` command that will submit the job to the Ray cluster:
What this will do, it's to basically upload the `--working-dir` to the Ray cluster, install the dependencies and then execute the `python pipeline.py` command from the head node.
## File system requirements
As described in [Using a file system to pass data to steps](fs_to_pass_data.md), `distilabel` relies on the file system to pass the data to the `GlobalStep`s, so if the pipeline to be executed in the Ray cluster have any `GlobalStep` or do you want to set the `use_fs_to_pass_data=True` of the [run][distilabel.pipeline.local.Pipeline.run] method, then you will need to setup a file system to which all the nodes of the Ray cluster have access:
1. All the nodes of the Ray cluster should have access to `/mnt/data`.
## Executing a `RayPipeline` in a cluster with Slurm
If you have access to an HPC, then you're probably also a user of [Slurm](https://slurm.schedmd.com/), a workload manager typically used on HPCs. We can create Slurm job that takes some nodes and deploy a Ray cluster to run a distributed `distilabel` pipeline:
```bash
#!/bin/bash
#SBATCH --job-name=distilabel-ray-text-generation
#SBATCH --partition=your-partition
#SBATCH --qos=normal
#SBATCH --nodes=2 # (1)
#SBATCH --exclusive
#SBATCH --ntasks-per-node=1 # (2)
#SBATCH --gpus-per-node=1 # (3)
#SBATCH --time=0:30:00
set-ex
echo"SLURM_JOB_ID: $SLURM_JOB_ID"
echo"SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"
# Activate virtual environment
source /path/to/virtualenv/.venv/bin/activate
# Getting the node names
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
1. In this case, we just want two nodes: one to run the Ray head node and one to run a worker.
2. We just want to run a task per node i.e. the Ray command that starts the head/worker node.
3. We have selected 1 GPU per node, but we could have selected more depending on the pipeline.
4. We need to set the environment variable `OUTLINES_CACHE_DIR` to `/tmp/.outlines` to avoid issues with the nodes trying to read/write the same `outlines` cache files, which is not possible.
## `vLLM` and `tensor_parallel_size`
In order to use `vLLM` multi-GPU and multi-node capabilities with `ray`, we need to do a few changes in the example pipeline from above. The first change needed is to specify a value for `tensor_parallel_size` aka "In how many GPUs do I want you to load the model", and the second one is to define `ray` as the `distributed_executor_backend` as the default one in `vLLM` is to use `multiprocessing`:
More information about distributed inference with `vLLM` can be found here: [vLLM - Distributed Serving](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)