Metadata-Version: 2.4 Name: open-dataflow Version: 1.0.7 Summary: Modern Data Centric AI system for Large Language Models Author-email: Hao Liang , Xiaochen Ma License: Apache-2.0 Project-URL: Github, https://github.com/Open-DataFlow/DataFlow Project-URL: Documentation, https://open-dataflow.github.io/DataFlow-Doc/ Project-URL: Bug Reports, https://github.com/Open-DataFlow/DataFlow/issues Keywords: AI,artificial intelligence Classifier: Development Status :: 3 - Alpha Classifier: Intended Audience :: Developers Classifier: Intended Audience :: Science/Research Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence Classifier: License :: Free For Educational Use Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Classifier: Programming Language :: Python :: 3 :: Only Requires-Python: <4,>=3.7 Description-Content-Type: text/markdown License-File: LICENSE Requires-Dist: numpy<2.0.0 Requires-Dist: datasets Requires-Dist: scipy Requires-Dist: tqdm Requires-Dist: transformers<4.54.0 Requires-Dist: math_verify Requires-Dist: word2number Requires-Dist: accelerate Requires-Dist: rapidfuzz Requires-Dist: colorlog Requires-Dist: appdirs Requires-Dist: datasketch Requires-Dist: modelscope Requires-Dist: addict Requires-Dist: pytest Requires-Dist: rich Requires-Dist: docstring_parser Requires-Dist: pydantic Requires-Dist: nltk Requires-Dist: colorama Requires-Dist: gradio>5 Requires-Dist: json5 Requires-Dist: tiktoken Requires-Dist: func_timeout Requires-Dist: sqlglot Requires-Dist: pymysql Requires-Dist: fasttext-wheel Requires-Dist: langkit Requires-Dist: openai Requires-Dist: sentencepiece Requires-Dist: datasketch Requires-Dist: presidio_analyzer[transformers] Requires-Dist: presidio_anonymizer Requires-Dist: vendi-score==0.0.3 Requires-Dist: google-api-core Requires-Dist: google-api-python-client Requires-Dist: evaluate Requires-Dist: contractions Requires-Dist: symspellpy Requires-Dist: simhash Requires-Dist: chonkie Requires-Dist: trafilatura Requires-Dist: lxml_html_clean Requires-Dist: pymupdf Requires-Dist: httpx[socks] Requires-Dist: cloudpickle Requires-Dist: fastapi Requires-Dist: httpx Requires-Dist: pandas Requires-Dist: psutil Requires-Dist: pyfiglet Requires-Dist: pyyaml Requires-Dist: requests Requires-Dist: termcolor Requires-Dist: uvicorn Requires-Dist: sseclient-py Requires-Dist: librosa Requires-Dist: soundfile Requires-Dist: google-cloud-aiplatform>=1.55 Requires-Dist: google-cloud-bigquery Requires-Dist: google-genai Requires-Dist: gcsfs Provides-Extra: vllm Requires-Dist: vllm<=0.9.2,>=0.7.0; extra == "vllm" Requires-Dist: numpy<2.0.0; extra == "vllm" Provides-Extra: vllm07 Requires-Dist: vllm<0.8; extra == "vllm07" Requires-Dist: numpy<2.0.0; extra == "vllm07" Provides-Extra: vllm08 Requires-Dist: vllm<0.9; extra == "vllm08" Provides-Extra: kbc Requires-Dist: vllm==0.6.3; extra == "kbc" Requires-Dist: mineru[pipeline]==2.0.6; extra == "kbc" Provides-Extra: mineru Requires-Dist: mineru[all]; extra == "mineru" Requires-Dist: numpy<2.0.0,>=1.24; extra == "mineru" Requires-Dist: sglang[all]>=0.4.8; extra == "mineru" Requires-Dist: pypdf; extra == "mineru" Requires-Dist: reportlab; extra == "mineru" Provides-Extra: myscale Requires-Dist: clickhouse-driver; extra == "myscale" Provides-Extra: sglang Requires-Dist: sglang[all]; extra == "sglang" Provides-Extra: litellm Requires-Dist: litellm<2.0.0,>=1.70.0; extra == "litellm" Provides-Extra: audio Requires-Dist: librosa; extra == "audio" Requires-Dist: soundfile; extra == "audio" Provides-Extra: vectorsql Requires-Dist: sqlite-vec; extra == "vectorsql" Requires-Dist: sqlite-lembed; extra == "vectorsql" Requires-Dist: sentence_transformers; extra == "vectorsql" Provides-Extra: pdf2model Requires-Dist: llamafactory[metrics,torch]>=0.9.0; extra == "pdf2model" Requires-Dist: vllm<0.9.2,>=0.7.0; extra == "pdf2model" Requires-Dist: numpy<2.0.0,>=1.24; extra == "pdf2model" Requires-Dist: mineru[pipeline]; extra == "pdf2model" Requires-Dist: mineru-vl-utils; extra == "pdf2model" Provides-Extra: eval Requires-Dist: vllm<0.9.2,>=0.7.0; extra == "eval" Provides-Extra: rag Requires-Dist: lightrag-hku; extra == "rag" Requires-Dist: asyncio; extra == "rag" Dynamic: license-file # DataFlow
[![Documents](https://img.shields.io/badge/Documents-Click_here-brightgreen?logo=read-the-docs)](https://OpenDCAI.github.io/DataFlow-Doc/) [![](https://img.shields.io/github/license/OpenDCAI/DataFlow)](https://github.com/OpenDCAI/DataFlow/blob/main/LICENSE) [![](https://img.shields.io/github/stars/OpenDCAI/DataFlow?style=social)](https://github.com/OpenDCAI/DataFlow) [![](https://img.shields.io/github/contributors/OpenDCAI/DataFlow)](https://github.com/OpenDCAI/DataFlow/graphs/contributors) [![](https://img.shields.io/github/repo-size/OpenDCAI/DataFlow?color=green)](https://github.com/OpenDCAI/DataFlow) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/OpenDCAI/DataFlow) 🎉 If you like our project, please give us a star ⭐ on GitHub for the latest update. **Beginner-friendly learning resources (continuously updated)**: 🎬 [DataFlow Video Tutorials](https://space.bilibili.com/3546929239689711?spm_id_from=333.337.0.0); 📚 [DataFlow Written Tutorials](https://wcny4qa9krto.feishu.cn/wiki/I9tbw2qnBi0lEakmmAGclTysnFd) [简体中文](./README-zh.md) | English
## 📰 1. News - **[2025-11-20] Introducing New Data Agents for DataFlow!** 🤖 You can try them out now and follow the tutorial on Bilibili for a quick start. - [2025-06-28] 🎉 We’re excited to announce that DataFlow, our Data-centric AI system, is now released! Stay tuned for future updates. ## 🔍 2. Overview ![dataflow_framework](https://github.com/user-attachments/assets/b44db630-754a-44a8-bec7-6d350bf5ed61) DataFlow is a data preparation and training system designed to **parse, generate, process, and evaluate** high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuning, RL training) or RAG using knowledge base cleaning. **DataFlow has been empirically validated to improve domain-oriented LLMs' performance in fields such as healthcare, finance, and law.** Specifically, we are constructing diverse `operators` leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct `pipelines`, collectively forming the comprehensive `DataFlow system`. Additionally, we develop an intelligent `DataFlow-agent` capable of dynamically assembling new `pipelines` by recombining existing `operators` on demand. ## 🛠️ 3. Operators Functionality ### 🔧 3.1 How Operators Work DataFlow adopts a modular operator design philosophy, building flexible data processing pipelines by combining different types of operators. As the basic unit of data processing, an operator can receive structured data input (such as in json/jsonl/csv format) and, after intelligent processing, output high-quality data results. For a detailed guide on using operators, please refer to the [Operator Documentation](https://opendcai.github.io/DataFlow-Doc/en/guide/text_evaluation_operators/ ). ![dataflow_operator](https://github.com/user-attachments/assets/d79a0d8b-09ef-457e-af8b-85af0d03b73d) ### 📊 3.2 Operator Classification System In the DataFlow framework, operators are divided into three core categories based on their functional characteristics: | Operator Type | Quantity | Main Function | |---|---|---| | **Generic Operators** | 80+ | Covers general functions for text evaluation, processing, and synthesis | | **Domain-Specific Operators** | 40+ | Specialized processing for specific domains (e.g., medical, financial, legal) | | **Evaluation Operators** | 20+ | Comprehensively evaluates data quality from 6 dimensions | ## 🛠️ 4. Pipelines Functionality ### 🔧 4.1 Ready-to-Use PipeLines Current Pipelines in Dataflow are as follows: - [📝 **Text Pipeline**](https://opendcai.github.io/DataFlow-Doc/en/guide/textpipeline): Mine question-answer pairs from large-scale plain-text data (mostly crawed from InterNet) for use in SFT and RL training. - ![dataflow_text_pipeline](https://github.com/user-attachments/assets/34e3aef2-ba4f-4997-9127-9d21fdb2dede) - [[HuggingFace🤗 demo input & output for **Text Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text) - [🧠 **Reasoning Pipeline**](https://opendcai.github.io/DataFlow-Doc/en/guide/reasoningpipeline/#_2-question-handling): Enhances existing question–answer pairs with (1) extended chain-of-thought, (2) category classification, and (3) difficulty estimation. - ![dataflow_reasoning_pipeline](https://github.com/user-attachments/assets/fef5829b-3991-4dcb-99ad-d61d95c982ea) - [[HuggingFace🤗 demo input & output for **Reasoning Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Reasonning) - [🗃️ **Text2SQL Pipeline**](https://opendcai.github.io/DataFlow-Doc/en/guide/text2sqlpipeline/): Translates natural language questions into SQL queries, supplemented with explanations, chain-of-thought reasoning, and contextual schema information. - ![dataflow_text2sql_pipeline](https://github.com/user-attachments/assets/bae9914e-851b-4502-8696-291d6c1b8824) - [[HuggingFace🤗 demo input & output for **Text2SQL Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text2SQL) - [📚 **Knowlege Base Cleaning Pipeline**](https://opendcai.github.io/DataFlow-Doc/en/guide/r51ooua8/): Extract and structure knowledge from unorganized sources like tables, PDFs, and Word documents into usable entries for downstream RAG or QA pair generation. - ![dataflow_KnowledgeBaseClean_pipeline](https://github.com/user-attachments/assets/6f21e682-ec10-42af-b5e2-8fec2929eeae) - [🤖 **Agentic RAG Pipeline**](https://opendcai.github.io/DataFlow-Doc/en/guide/agenticrag_pipeline/): Identify and extract QA pairs from existing QA datasets or knowledge bases that require external knowledge to answer, for use in downstream training of Agnetic RAG tasks. - ![dataflow_agenticRAG_pipeline](https://github.com/user-attachments/assets/65e80dca-f286-495b-abb7-804b3fc34a53) ### ⚙️ 4.2 Flexible Operator PipeLines In this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the [documentation](https://OpenDCAI.github.io/DataFlow-Doc/) for details. ### 🤖 4.3 Agent Guided Pipelines - **DataFlow Agent**: An intelligent assistant that performs data analysis, writes custom `operators`, and automatically orchestrates them into `pipelines` based on specific task objectives. - ![dataflow_agent_pipeline](https://github.com/user-attachments/assets/fe0776fa-55bd-49cd-bfe6-06ad377f62bb) - [[HuggingFace🤗 demo input & output for **DataFlow Agent**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Agent) ## ⚡ 5. Quick Start ### 🛠️ 5.1 Environment Setup and Installation Please use the following commands for environment setup and installation👇 ```shell conda create -n dataflow python=3.10 conda activate dataflow pip install open-dataflow ``` If you want to use your own GPU for local inference, please use: ```shell pip install open-dataflow[vllm] ``` > DataFlow supports Python>=3.10 environments After installation, you can use the following command to check if dataflow has been installed correctly: ```shell dataflow -v ``` If installed correctly, you should see: ```log open-dataflow codebase version: 1.0.0 Checking for updates... Local version: 1.0.0 PyPI newest version: 1.0.0 You are using the latest version: 1.0.0. ``` #### 🐳 5.1.1 Docker Installation (Alternative) We also provide a **Dockerfile** for easy deployment and a **pre-built Docker image** for immediate use. ##### Option 1: Use Pre-built Docker Image You can directly pull and use our pre-built Docker image: ```shell # Pull the pre-built image docker pull molyheci/dataflow:cu124 # Run the container with GPU support docker run --gpus all -it molyheci/dataflow:cu124 # Inside the container, verify installation dataflow -v ``` ##### Option 2: Build from Dockerfile Alternatively, you can build the Docker image from the provided Dockerfile: ```shell # Clone the repository (HTTPS) git clone https://github.com/OpenDCAI/DataFlow.git # Or use SSH # git clone git@github.com:OpenDCAI/DataFlow.git cd DataFlow # Build the Docker image docker build -t dataflow:custom . # Run the container docker run --gpus all -it dataflow:custom # Inside the container, verify installation dataflow -v ``` > **Note**: The Docker image includes CUDA 12.4.1 support and comes with vLLM pre-installed for GPU acceleration. Make sure you have [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed to use GPU features. ### 📖 5.2 Reference Project Documentation For detailed **usage instructions** and **getting started guide**, please visit our [Documentation](https://OpenDCAI.github.io/DataFlow-Doc/). ## 🧪 6. Experimental Results For Detailed Experiments setting, please visit our documentation. ### 📝 6.1 Text Pipeline #### 6.1.1 Pre-training data filter pipeline The `pre-training data processing pipeline` was applied to randomly sampled data from the RedPajama dataset, resulting in a final data retention rate of 13.65%. The analysis results using `QuratingScorer` are shown in the figure. As can be seen, the filtered pretraining data significantly outperforms the original data across four scoring dimensions: writing style, requirement for expert knowledge, factual content, and educational value. This demonstrates the effectiveness of the DataFlow pretraining data processing.
#### 6.1.2 SFT data filter pipeline We filtered 3k records from `alpaca` dataset and compared it with randomly selected 3k data from `alpaca` dataset by training it on Qwen2.5-7B. Results are:
### 🧠 6.2 Reasoning Pipeline We verify our reasoning pipeline by SFT on a Qwen2.5-32B-Instruct with Reasoning Pipeline synsthized data. We generated 1k and 5k SFT data pairs. Results are:
### 🗃️ 6.3 Text2SQL PipeLine We fine-tuned the Qwen2.5-Coder-7B-Instruct model using both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL), with data constructed via the DataFlow-Text2SQL Pipeline. Results are:
## 📄 7. Publications Our team has published the following papers that form core components of the DataFlow system: | Paper Title | DataFlow Component | Venue | Year | |-------------|-------------------|-------|------| | [MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification](https://arxiv.org/pdf/2502.13383) | Multimodal reasoning verification framework for data processing and evaluation | ACL | 2025 | | [Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration](https://arxiv.org/pdf/2410.08102) | Multi-actor collaborative data selection mechanism for enhanced data filtering and processing | ACL | 2025 | **Contributing Institutions**: PKU HKUST CAS Shanghai AI Lab Baichuan Ant Group ## 🏆 8. Awards & Achievements We are honored to have received **first-place awards** in two major international AI competitions, recognizing the excellence and robustness of DataFlow and its reasoning capabilities: | Competition | Track | Award | Organizer | Date | | ------------------------------------------------------------------- | ---------------------------------------------------------- | ------------------------- | --------------------------------------------------------- | --------------- | | **ICML 2025 Challenges on Automated Math Reasoning and Extensions** | Track 2: *Physics Reasoning with Diagrams and Expressions* | 🥇 **First Place Winner** | ICML AI for Math Workshop & AWS Codabench | July 18, 2025 | | **2025 Language and Intelligence Challenge (LIC)** | Track 2: *Beijing Academy of Artificial Intelligence* | 🥇 **First Prize** | Beijing Academy of Artificial Intelligence (BAAI) & Baidu | August 10, 2025 |
ICML 2025 Certificate
ICML 2025 Automated Math Reasoning Challenge — First Place Winner
LIC 2025 Certificate
BAAI Language & Intelligence Challenge 2025 — First Prize
## 💐 9. Acknowledgements We sincerely appreciate [MinerU](https://github.com/opendatalab/MinerU)'s outstanding contribution, particularly its robust text extraction capabilities from PDFs and documents, which greatly facilitate data loading. ## 🤝 10. Community & Support Join the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers! • 📮 [GitHub Issues](../../issues): Report bugs or suggest features • 🔧 [GitHub Pull Requests](../../pulls): Contribute code improvements • 💬 Join our community groups to connect with us and other contributors!
## 📜 11. Citation If you use DataFlow in your research, feel free to give us a cite. ```bibtex @misc{dataflow2025, author = {DataFlow Develop Team}, title = {DataFlow: A Unified Framework for Data-Centric AI}, year = {2025}, howpublished = {\url{https://github.com/OpenDCAI/DataFlow}}, note = {Accessed: 2025-07-08} } ``` ## 📊 12. Statistics
Star History Chart
---
Connect with the PKU-DCAI Research Team on Xiaohongshu: 26133106768