Initial commit

b77a7567 · wanglch · b77a7567 · b77a7567 · b77a7567 · b77a7567
Commit b77a7567 authored Jul 03, 2024 by wanglch
20 changed files
--- a/README.md
+++ b/README.md
+# XuanYuan
+轩辕大模型是度小满推出的大模型系列，持续贡献开源生态。
+## 论文
+- [CGCE: A Chinese Generative Chat Evaluation Benchmark for General and Financial Domains](https://arxiv.org/abs/2305.14471)
+## 模型结构
+最懂金融领域的开源大模型“轩辕”系列，继176B、70B之后推出更小参数版本——XuanYuan-13B。这一版本在保持强大功能的同时，采用了更小的参数配置，专注于提升在不同场景下的应用效果。同时，我们也开源了XuanYuan-13B-Chat模型的4bit和8bit量化版本，降低了硬件需求，方便在不同的设备上部署。
+<div align="center">
+    <img src="./imgs/transformer.jpg"/>
+</div>
+## 算法原理
+在模型训练中，团队在模型预训练阶段动态调整不同语种与领域知识的比例，融入了大量的专业金融语料，并在指令微调中灵活运用之前提出的Self-QA和混合训练方法，显著提升了模型在对话中的性能表现。此外，本次“轩辕13B”还通过强化学习训练，与人类偏好进行对齐。相比于原始模型，RLHF对齐后的模型，在文本创作、内容生成 、指令理解与遵循、安全性等方面都有较大的提升。
+<div align=center>
+    <img src="./imgs/transformer.png"/>
+</div>
+## 环境配置
+### Docker（方法一）
+[光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
+docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=64G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name xuanyuan <your imageID> bash
+cd /path/your_code_data/
+pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
+git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
+cd LLaMA-Factory
+pip install -e ".[torch,metrics]"
+```
+### Dockerfile（方法二）
+```
+cd /path/your_code_data/docker
+docker build --no-cache -t  xuanyuan:latest .
+docker run --shm-size=64G --name  xuanyuan -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v /path/your_code_data/:/path/your_code_data/ -it  xuanyuan bash
+git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
+cd LLaMA-Factory
+pip install -e ".[torch,metrics]"
+```
+### Anaconda（方法三）
+关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
+```
+DTK驱动：dtk24.04
+python：python3.10
+torch:2.1
+torchvision: 0.16.0
+deepspped: 0.12.3
+```
+`Tips：以上dtk驱动、python、paddle等DCU相关工具版本需要严格一一对应`
+关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
+```
+conda create -n  xuanyuan python=3.10
+conda activate  xuanyuan
+cd /path/your_code_data/
+pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple
+git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
+cd LLaMA-Factory
+pip install -e ".[torch,metrics]"
+```
+## 数据集
+迷你数据集 [fingpt_sentiment](./LLaMA-Factory/data/fingpt_sentiment.json) 
+预训练需要准备你的训练数据，需要将所有样本放到一个列表中并存入json文件中。每个样本对应一个字典，包含以下信息，示例如下所示：用于正常训练的完整数据集请按此目录结构进行制备：
+```
+  {
+    "instruction": "描述原子的结构。",
+    "input": "",
+    "output": "原子是物质的基本单位，它由三种基本粒子组成：质子、中子和电子。质子和中子形成原子核，位于原子中心，核外的电子围绕着原子核运动。\n\n原子结构具有层次性。原子核中，质子带正电，中子不带电（中性）。原子核非常小且致密，占据了原子总质量的绝大部分。电子带负电，通常围绕核运动，形成若干层次，称为壳层或电子层。电子数量与质子数量相等，使原子呈电中性。\n\n电子在每个壳层中都呈规律分布，并且不同壳层所能容纳的电子数也不同。在最里面的壳层一般只能容纳2个电子，其次一层最多可容纳8个电子，再往外的壳层可容纳的电子数逐层递增。\n\n原子核主要受到两种相互作用力的影响：强力和电磁力。强力的作用范围非常小，主要限制在原子核内，具有极强的吸引作用，使核子（质子和中子）紧密结合在一起。电磁力的作用范围较大，主要通过核外的电子与原子核相互作用，发挥作用。\n\n这就是原子的基本结构。原子内部结构复杂多样，不同元素的原子核中质子、中子数量不同，核外电子排布分布也不同，形成了丰富多彩的化学世界。"
+  },
+```
+## 训练
+训练需要使用Llama-Factory包，需要将本仓库中Llama-Factory的data数据替换git clone的Llama-Factory的data文件，替换后只保留git clone的Llama-Factory文件夹。
+根据实际情况在脚本中修改权重相关路径
+### 单机多卡
+```
+sh ds_zero3_work_dtk.sh
+```
+## 推理
+### 单机单卡
+```
+sh Xuanyuan_inference.sh
+```
+## result
+### 问答
+<div align=center>
+    <img src="./imgs/result.png"/>
+</div>
+### 精度
+测试数据： [fingpt_sentiment](./LLaMA-Factory/data/fingpt_sentiment.json)  ，使用的加速卡:K100。
+| device | train_loss | eval_loss
+| :------: | :------: |  :------: | 
+| K100 | 0.7087 | 0.1019 | 
+## 应用场景
+### 算法类别
+`问答`
+### 热点应用行业
+`金融,教育`
+## 预训练权重
+- [Duxiaoman-DI/XuanYuan-13B-Chat](https://modelscope.cn/models/Duxiaoman-DI/XuanYuan-13B-Chat/files)
+预训练权重快速下载中心：[SCNet AIModels](http://113.200.138.88:18080/aimodels)
+项目中的预训练权重可从快速下载通道下载： [XuanYuan-13B-Chat](http://113.200.138.88:18080/aimodels/XuanYuan-13B-Chat)
+## 源码仓库及问题反馈
+- https://developer.hpccube.com/codes/modelzoo/xuanyuan_pytorch
+## 参考资料
+- [轩辕大模型-魔搭](https://modelscope.cn/models/Duxiaoman-DI/XuanYuan-13B-Chat/summary)
+- [Xuanyuan github](https://github.com/Duxiaoman-DI/XuanYuan)
--- a/README_xuanyuan.md
+++ b/README_xuanyuan.md
--- a/Xuanyuan_inference.py
+++ b/Xuanyuan_inference.py
+import torch
+from transformers import LlamaForCausalLM, LlamaTokenizer
+model_name_or_path = "/home/wanglch/projects/XuanYuan/XuanYuan-13B-Chat" 
+tokenizer =  LlamaTokenizer.from_pretrained(model_name_or_path, use_fast=False, legacy=True, trust_remote_code=True)
+model = LlamaForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=True)
+model.eval()
+seps = [" ", "</s>"]
+roles = ["Human", "Assistant"]
+content = "互联网金融机构如何确认该笔贷款是由本人申请的？"
+prompt = "Human: " + content + " Assistant:"
+print(f"输入: {content}")
+inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
+outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7, top_p=0.95)
+outputs = tokenizer.decode(outputs.cpu()[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
+print(f"输出: {outputs}")
\ No newline at end of file
--- a/Xuanyuan_inference.sh
+++ b/Xuanyuan_inference.sh
+CUDA_VISIBLE_DEVICES=4,5 python Xuanyuan_inference.py
\ No newline at end of file
--- a/cli_demo.py
+++ b/cli_demo.py
+# -*- utf8 -*
+import argparse
+from conversation import get_conv_template
+try:
+    from vllm import LLM, SamplingParams
+    is_vllm_avaiable = True
+    print("use vllm.generate to infer...")
+except ImportError:
+    from transformers import LlamaForCausalLM, LlamaTokenizer
+    is_vllm_avaiable = False
+    print("use transformers.generate to infer...")
+def infer_vllm(llm, sampling_params, prompt):
+    assert llm is not None
+    assert sampling_params is not None
+    generation = llm.generate(prompt, sampling_params, use_tqdm=False)
+    outputs = generation[0].outputs[0].text.strip()
+    return outputs
+def infer(model, tokenizer, prompt):
+    assert model is not None
+    assert tokenizer is not None
+    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=args.max_new_tokens,
+        do_sample=True,
+        temperature=args.temperature,
+        top_p=args.top_p
+    )
+    outputs = tokenizer.decode(outputs.cpu()[0][len(inputs.input_ids[0]):], skip_special_tokens=True).strip()
+    return outputs
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Test XuanYuan-70B-chat with vLLM")
+    parser.add_argument("-c", "--checkpoint_path", type=str, help="Checkpoint path")
+    parser.add_argument("-n", "--max_new_tokens", type=int, default=1000)
+    parser.add_argument("-t", "--temperature", type=float, default=0.95)
+    parser.add_argument("-p", "--top_p", type=float, default=0.95)
+    args = parser.parse_args()
+    llm = None
+    sampling_params = None
+    model = None
+    tokenizer = None
+    if is_vllm_avaiable:
+        print("loading weight with vLLM...")
+        sampling_params = SamplingParams(
+            temperature=args.temperature,
+            top_p=args.top_p,
+            stop=list(["</s>"]),
+            max_tokens=args.max_new_tokens
+        )
+        llm = LLM(args.checkpoint_path, tensor_parallel_size=8)
+    else:
+        print("loading weight with transformers ...")
+        tokenizer = LlamaTokenizer.from_pretrained(args.checkpoint_path, use_fast=False, legacy=True)
+        model = LlamaForCausalLM.from_pretrained(args.checkpoint_path, device_map="auto")
+    conv = get_conv_template("XuanYuan-Chat")
+    print("########")
+    print("输入为: EXIT!! 表示退出")
+    print("输入为: CLEAR!! 表示清空上下文")
+    print("########")
+    while True:
+        content = input("输入: ")
+        if content.strip() == "EXIT!!":
+            print("exit....")
+            break
+        if content.strip() == "CLEAR!!":
+            conv = get_conv_template("XuanYuan-Chat")
+            print("clear...")
+            continue
+        conv.append_message(conv.roles[0], content.strip())
+        conv.append_message(conv.roles[1], None)
+        prompt = conv.get_prompt()
+        if is_vllm_avaiable:
+            outputs = infer_vllm(llm, sampling_params, prompt)
+        else:
+            outputs = infer(model, tokenizer, prompt)
+        print(f"输出: {outputs}")
+        conv.update_last_message(outputs)
--- a/conversation.py
+++ b/conversation.py
+"""
+refer: https://github.com/lm-sys/FastChat/tree/main/fastchat
+"""
+import dataclasses
+from enum import auto, IntEnum
+from typing import List, Dict
+class SeparatorStyle(IntEnum):
+    """Separator styles."""
+    ADD_COLON_TWO = auto()
+@dataclasses.dataclass
+class Conversation:
+    """A class that manages prompt templates and keeps all conversation history."""
+    name: str
+    system_template: str = "{system_message}"
+    system_message: str = ""
+    roles: List[str] = (("USER", "ASSISTANT"),)
+    messages: List[List[str]] = ()
+    offset: int = 0
+    sep_style: SeparatorStyle = SeparatorStyle.ADD_COLON_TWO
+    sep: str = "\n"
+    sep2: str = None
+    stop_str: str = None
+    stop_token_ids: List[int] = None
+    def get_prompt(self) -> str:
+        """Get the prompt for generation."""
+        system_prompt = self.system_template.format(system_message=self.system_message)
+        if self.sep_style == SeparatorStyle.ADD_COLON_TWO:
+            seps = [self.sep, self.sep2]
+            ret = system_prompt + seps[0]
+            for i, (role, message) in enumerate(self.messages):
+                if message:
+                    ret += role + ": " + message + seps[i % 2]
+                else:
+                    ret += role + ":"
+            return ret
+        else:
+            raise ValueError(f"Invalid style: {self.sep_style}")
+    def append_message(self, role: str, message: str):
+        self.messages.append([role, message])
+    def update_last_message(self, message: str):
+        self.messages[-1][1] = message
+    def copy(self):
+        return Conversation(
+            name=self.name,
+            system_template=self.system_template,
+            system_message=self.system_message,
+            roles=self.roles,
+            messages=[[x, y] for x, y in self.messages],
+            offset=self.offset,
+            sep_style=self.sep_style,
+            sep=self.sep,
+            sep2=self.sep2,
+            stop_str=self.stop_str,
+            stop_token_ids=self.stop_token_ids,
+        )
+    def dict(self):
+        return {
+            "template_name": self.name,
+            "system_message": self.system_message,
+            "roles": self.roles,
+            "messages": self.messages,
+            "offset": self.offset,
+        }
+conv_templates: Dict[str, Conversation] = {}
+def register_conv_template(template: Conversation, override: bool = False):
+    """Register a new conversation template."""
+    if not override:
+        assert (
+            template.name not in conv_templates
+        ), f"{template.name} has been registered."
+    conv_templates[template.name] = template
+def get_conv_template(name: str) -> Conversation:
+    """Get a conversation template."""
+    return conv_templates[name].copy()
+register_conv_template(
+    Conversation(
+        name="XuanYuan-Chat",
+        system_message="以下是用户和人工智能助手之间的对话。用户以Human开头，人工智能助手以Assistant开头，会对人类提出的问题给出有帮助、高质量、详细和礼貌的回答，并且总是拒绝参与与不道德、不安全、有争议、政治敏感等相关的话题、问题和指示。\n",
+        roles=("Human", "Assistant"),
+        messages=(),
+        offset=0,
+        sep_style=SeparatorStyle.ADD_COLON_TWO,
+        sep=" ",
+        sep2="</s>",
+    )
+)
+if __name__ == "__main__":
+    conv = get_conv_template("XuanYuan-Chat")
+    conv.append_message(conv.roles[0], "Hello!")
+    conv.append_message(conv.roles[1], "Hi!")
+    conv.append_message(conv.roles[0], "介绍下你自己")
+    conv.append_message(conv.roles[1], None)
+    print(conv.get_prompt())
--- a/data/README.md
+++ b/data/README.md
+If you are using a custom dataset, please provide your dataset definition in the following format in `dataset_info.json`.
+```json
+"dataset_name": {
+  "hf_hub_url": "the name of the dataset repository on the Hugging Face hub. (if specified, ignore script_url and file_name)",
+  "ms_hub_url": "the name of the dataset repository on the ModelScope hub. (if specified, ignore script_url and file_name)",
+  "script_url": "the name of the directory containing a dataset loading script. (if specified, ignore file_name)",
+  "file_name": "the name of the dataset file in this directory. (required if above are not specified)",
+  "file_sha1": "the SHA-1 hash value of the dataset file. (optional, does not affect training)",
+  "subset": "the name of the subset. (optional, default: None)",
+  "folder": "the name of the folder of the dataset repository on the Hugging Face hub. (optional, default: None)",
+  "ranking": "whether the dataset is a preference dataset or not. (default: false)",
+  "formatting": "the format of the dataset. (optional, default: alpaca, can be chosen from {alpaca, sharegpt})",
+  "columns (optional)": {
+    "prompt": "the column name in the dataset containing the prompts. (default: instruction)",
+    "query": "the column name in the dataset containing the queries. (default: input)",
+    "response": "the column name in the dataset containing the responses. (default: output)",
+    "history": "the column name in the dataset containing the histories. (default: None)",
+    "messages": "the column name in the dataset containing the messages. (default: conversations)",
+    "system": "the column name in the dataset containing the system prompts. (default: None)",
+    "tools": "the column name in the dataset containing the tool description. (default: None)"
+  },
+  "tags (optional, used for the sharegpt format)": {
+    "role_tag": "the key in the message represents the identity. (default: from)",
+    "content_tag": "the key in the message represents the content. (default: value)",
+    "user_tag": "the value of the role_tag represents the user. (default: human)",
+    "assistant_tag": "the value of the role_tag represents the assistant. (default: gpt)",
+    "observation_tag": "the value of the role_tag represents the tool results. (default: observation)",
+    "function_tag": "the value of the role_tag represents the function call. (default: function_call)",
+    "system_tag": "the value of the role_tag represents the system prompt. (default: system, can override system column)"
+  }
+}
+```
+Given above, you can use the custom dataset via specifying `--dataset dataset_name`.
+Currently we support dataset in **alpaca** or **sharegpt** format, the dataset in alpaca format should follow the below format:
+```json
+[
+  {
+    "instruction": "user instruction (required)",
+    "input": "user input (optional)",
+    "output": "model response (required)",
+    "system": "system prompt (optional)",
+    "history": [
+      ["user instruction in the first round (optional)", "model response in the first round (optional)"],
+      ["user instruction in the second round (optional)", "model response in the second round (optional)"]
+    ]
+  }
+]
+```
+Regarding the above dataset, the `columns` in `dataset_info.json` should be:
+```json
+"dataset_name": {
+  "columns": {
+    "prompt": "instruction",
+    "query": "input",
+    "response": "output",
+    "system": "system",
+    "history": "history"
+  }
+}
+```
+The `query` column will be concatenated with the `prompt` column and used as the user prompt, then the user prompt would be `prompt\nquery`. The `response` column represents the model response.
+The `system` column will be used as the system prompt. The `history` column is a list consisting string tuples representing prompt-response pairs in the history. Note that the responses in the history **will also be used for training**.
+For the pre-training datasets, only the `prompt` column will be used for training.
+For the preference datasets, the `response` column should be a string list whose length is 2, with the preferred answers appearing first, for example:
+```json
+{
+  "instruction": "user instruction",
+  "input": "user input",
+  "output": [
+    "chosen answer",
+    "rejected answer"
+  ]
+}
+```
+The dataset in sharegpt format should follow the below format:
+```json
+[
+  {
+    "conversations": [
+      {
+        "from": "human",
+        "value": "user instruction"
+      },
+      {
+        "from": "gpt",
+        "value": "model response"
+      }
+    ],
+    "system": "system prompt (optional)",
+    "tools": "tool description (optional)"
+  }
+]
+```
+Regarding the above dataset, the `columns` in `dataset_info.json` should be:
+```json
+"dataset_name": {
+  "columns": {
+    "messages": "conversations",
+    "system": "system",
+    "tools": "tools"
+  },
+  "tags": {
+    "role_tag": "from",
+    "content_tag": "value",
+    "user_tag": "human",
+    "assistant_tag": "gpt"
+  }
+}
+```
+where the `messages` column should be a list following the `u/a/u/a/u/a` order.
+Pre-training datasets and preference datasets are incompatible with the sharegpt format yet.
--- a/data/README_zh.md
+++ b/data/README_zh.md
+如果您使用自定义数据集，请务必在 `dataset_info.json` 文件中按照以下格式提供数据集定义。
+```json
+"数据集名称": {
+  "hf_hub_url": "Hugging Face 的数据集仓库地址（若指定，则忽略 script_url 和 file_name）",
+  "ms_hub_url": "ModelScope 的数据集仓库地址（若指定，则忽略 script_url 和 file_name）",
+  "script_url": "包含数据加载脚本的本地文件夹名称（若指定，则忽略 file_name）",
+  "file_name": "该目录下数据集文件的名称（若上述参数未指定，则此项必需）",
+  "file_sha1": "数据集文件的 SHA-1 哈希值（可选，留空不影响训练）",
+  "subset": "数据集子集的名称（可选，默认：None）",
+  "folder": "Hugging Face 仓库的文件夹名称（可选，默认：None）",
+  "ranking": "是否为偏好数据集（可选，默认：False）",
+  "formatting": "数据集格式（可选，默认：alpaca，可以为 alpaca 或 sharegpt）",
+  "columns（可选）": {
+    "prompt": "数据集代表提示词的表头名称（默认：instruction）",
+    "query": "数据集代表请求的表头名称（默认：input）",
+    "response": "数据集代表回答的表头名称（默认：output）",
+    "history": "数据集代表历史对话的表头名称（默认：None）",
+    "messages": "数据集代表消息列表的表头名称（默认：conversations）",
+    "system": "数据集代表系统提示的表头名称（默认：None）",
+    "tools": "数据集代表工具描述的表头名称（默认：None）"
+  },
+  "tags（可选，用于 sharegpt 格式）": {
+    "role_tag": "消息中代表发送者身份的键名（默认：from）",
+    "content_tag": "消息中代表文本内容的键名（默认：value）",
+    "user_tag": "消息中代表用户的 role_tag（默认：human）",
+    "assistant_tag": "消息中代表助手的 role_tag（默认：gpt）",
+    "observation_tag": "消息中代表工具返回结果的 role_tag（默认：observation）",
+    "function_tag": "消息中代表工具调用的 role_tag（默认：function_call）",
+    "system_tag": "消息中代表系统提示的 role_tag（默认：system，会覆盖 system 列）"
+  }
+}
+```
+添加后可通过指定 `--dataset 数据集名称` 参数使用自定义数据集。
+该项目目前支持两种格式的数据集：**alpaca** 和 **sharegpt**，其中 alpaca 格式的数据集按照以下方式组织：
+```json
+[
+  {
+    "instruction": "用户指令（必填）",
+    "input": "用户输入（选填）",
+    "output": "模型回答（必填）",
+    "system": "系统提示词（选填）",
+    "history": [
+      ["第一轮指令（选填）", "第一轮回答（选填）"],
+      ["第二轮指令（选填）", "第二轮回答（选填）"]
+    ]
+  }
+]
+```
+对于上述格式的数据，`dataset_info.json` 中的 `columns` 应为：
+```json
+"数据集名称": {
+  "columns": {
+    "prompt": "instruction",
+    "query": "input",
+    "response": "output",
+    "system": "system",
+    "history": "history"
+  }
+}
+```
+其中 `query` 列对应的内容会与 `prompt` 列对应的内容拼接后作为用户指令，即用户指令为 `prompt\nquery`。`response` 列对应的内容为模型回答。
+`system` 列对应的内容将被作为系统提示词。`history` 列是由多个字符串二元组构成的列表，分别代表历史消息中每轮的指令和回答。注意历史消息中的回答**也会被用于训练**。
+对于预训练数据集，仅 `prompt` 列中的内容会用于模型训练。
+对于偏好数据集，`response` 列应当是一个长度为 2 的字符串列表，排在前面的代表更优的回答，例如：
+```json
+{
+  "instruction": "用户指令",
+  "input": "用户输入",
+  "output": [
+    "优质回答",
+    "劣质回答"
+  ]
+}
+```
+而 sharegpt 格式的数据集按照以下方式组织：
+```json
+[
+  {
+    "conversations": [
+      {
+        "from": "human",
+        "value": "用户指令"
+      },
+      {
+        "from": "gpt",
+        "value": "模型回答"
+      }
+    ],
+    "system": "系统提示词（选填）",
+    "tools": "工具描述（选填）"
+  }
+]
+```
+对于上述格式的数据，`dataset_info.json` 中的 `columns` 应为：
+```json
+"数据集名称": {
+  "columns": {
+    "messages": "conversations",
+    "system": "system",
+    "tools": "tools"
+  },
+  "tags": {
+    "role_tag": "from",
+    "content_tag": "value",
+    "user_tag": "human",
+    "assistant_tag": "gpt"
+  }
+}
+```
+其中 `messages` 列应当是一个列表，且符合 `用户/模型/用户/模型/用户/模型` 的顺序。
+预训练数据集和偏好数据集尚不支持 sharegpt 格式。
--- a/data/belle_multiturn/belle_multiturn.py
+++ b/data/belle_multiturn/belle_multiturn.py
+import json
+import datasets
+_DESCRIPTION = "BELLE multiturn chat dataset."
+_CITATION = """\
+@article{belle2023exploring,
+  title={Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use Cases},
+  author={Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Lei Zhang, Baochang Ma, Xiangang Li},
+  journal={arXiv preprint arXiv:2303.14742},
+  year={2023}
+}
+"""
+_HOMEPAGE = "https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M"
+_LICENSE = "gpl-3.0"
+_URL = "https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M/resolve/main/multiturn_chat_0.8M.json"
+class BelleMultiturn(datasets.GeneratorBasedBuilder):
+    VERSION = datasets.Version("0.0.0")
+    def _info(self):
+        features = datasets.Features({
+            "conversations": [{"from": datasets.Value("string"), "value": datasets.Value("string")}]
+        })
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION
+        )
+    def _split_generators(self, dl_manager: datasets.DownloadManager):
+        file_path = dl_manager.download(_URL)
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={
+                    "filepath": file_path
+                }
+            )
+        ]
+    def _generate_examples(self, filepath: str):
+        with open(filepath, "r", encoding="utf-8") as f:
+            for key, row in enumerate(f):
+                data = json.loads(row)
+                conversations = []
+                prompt = data["instruction"].strip()
+                response = data["output"].strip()
+                assist_idx = prompt.rfind("Assistant:")
+                human_idx = prompt.rfind("Human:")
+                query = prompt[human_idx+6:assist_idx].strip()
+                prompt = prompt[:human_idx].strip()
+                conversations.insert(0, {"from": "gpt", "value": response})
+                conversations.insert(0, {"from": "human", "value": query})
+                while prompt.rfind("Assistant:") != -1:
+                    assist_idx = prompt.rfind("Assistant:")
+                    human_idx = prompt.rfind("Human:")
+                    if human_idx != -1:
+                        old_query = prompt[human_idx+6:assist_idx].strip()
+                        old_resp = prompt[assist_idx+10:].strip()
+                        conversations.insert(0, {"from": "gpt", "value": old_resp})
+                        conversations.insert(0, {"from": "human", "value": old_query})
+                    else:
+                        break
+                    prompt = prompt[:human_idx].strip()
+                yield key, {"conversations": conversations}
--- a/data/c4_demo.json
+++ b/data/c4_demo.json
--- a/data/dataset_info.json
+++ b/data/dataset_info.json
+{
+  "fingpt_sentiment": {
+    "file_name": "fingpt_sentiment.json",
+    "file_sha1": "7670f5c174b849e4908d9d0f4e3e61d8755c0142"
+  },
+  "alpaca_en": {
+    "file_name": "alpaca_data_en_52k.json",
+    "file_sha1": "607f94a7f581341e59685aef32f531095232cf23"
+  },
+  "alpaca_zh": {
+    "file_name": "alpaca_data_zh_51k.json",
+    "file_sha1": "2ba9827122c158dc256668d42bd1bcb8bc6b786e"
+  },
+  "alpaca_gpt4_en": {
+    "file_name": "alpaca_gpt4_data_en.json",
+    "file_sha1": "647f4ad447bd993e4b6b6223d1be15208bab694a"
+  },
+  "alpaca_gpt4_zh": {
+    "file_name": "alpaca_gpt4_data_zh.json",
+    "file_sha1": "3eaa3bda364ccdd59925d7448a698256c31ef845"
+  },
+  "identity": {
+    "file_name": "identity.json",
+    "file_sha1": "ffe3ecb58ab642da33fbb514d5e6188f1469ad40"
+  },
+  "oaast_sft": {
+    "file_name": "oaast_sft.json",
+    "file_sha1": "7baf5d43e67a91f9bbdf4e400dbe033b87e9757e",
+    "columns": {
+      "prompt": "instruction",
+      "query": "input",
+      "response": "output",
+      "history": "history"
+    }
+  },
+  "oaast_sft_zh": {
+    "file_name": "oaast_sft_zh.json",
+    "file_sha1": "a6a91f18f80f37b10ded9cf633fb50c033bf7b9f",
+    "columns": {
+      "prompt": "instruction",
+      "query": "input",
+      "response": "output",
+      "history": "history"
+    }
+  },
+  "lima": {
+    "file_name": "lima.json",
+    "file_sha1": "9db59f6b7007dc4b17529fc63379b9cd61640f37",
+    "columns": {
+      "prompt": "instruction",
+      "query": "input",
+      "response": "output",
+      "history": "history"
+    }
+  },
+  "glaive_toolcall": {
+    "file_name": "glaive_toolcall_10k.json",
+    "file_sha1": "a6917b85d209df98d31fdecb253c79ebc440f6f3",
+    "formatting": "sharegpt",
+    "columns": {
+      "messages": "conversations",
+      "tools": "tools"
+    }
+  },
+  "mllm_demo": {
+    "file_name": "mllm_demo.json",
+    "file_sha1": "b6709b23657d5c42a701f1c5574f3a6edaa40a20",
+    "formatting": "sharegpt",
+    "columns": {
+      "messages": "messages",
+      "images": "images"
+    },
+    "tags": {
+      "role_tag": "role",
+      "content_tag": "content",
+      "user_tag": "user",
+      "assistant_tag": "assistant"
+    }
+  },
+  "example": {
+    "script_url": "example_dataset",
+    "columns": {
+      "prompt": "instruction",
+      "query": "input",
+      "response": "output",
+      "history": "history"
+    }
+  },
+  "guanaco": {
+    "hf_hub_url": "JosephusCheung/GuanacoDataset",
+    "ms_hub_url": "AI-ModelScope/GuanacoDataset"
+  },
+  "belle_2m": {
+    "hf_hub_url": "BelleGroup/train_2M_CN",
+    "ms_hub_url": "AI-ModelScope/train_2M_CN"
+  },
+  "belle_1m": {
+    "hf_hub_url": "BelleGroup/train_1M_CN",
+    "ms_hub_url": "AI-ModelScope/train_1M_CN"
+  },
+  "belle_0.5m": {
+    "hf_hub_url": "BelleGroup/train_0.5M_CN",
+    "ms_hub_url": "AI-ModelScope/train_0.5M_CN"
+  },
+  "belle_dialog": {
+    "hf_hub_url": "BelleGroup/generated_chat_0.4M",
+    "ms_hub_url": "AI-ModelScope/generated_chat_0.4M"
+  },
+  "belle_math": {
+    "hf_hub_url": "BelleGroup/school_math_0.25M",
+    "ms_hub_url": "AI-ModelScope/school_math_0.25M"
+  },
+  "belle_multiturn": {
+    "script_url": "belle_multiturn",
+    "formatting": "sharegpt"
+  },
+  "ultra_chat": {
+    "script_url": "ultra_chat",
+    "formatting": "sharegpt"
+  },
+  "open_platypus": {
+    "hf_hub_url": "garage-bAInd/Open-Platypus",
+    "ms_hub_url": "AI-ModelScope/Open-Platypus"
+  },
+  "codealpaca": {
+    "hf_hub_url": "sahil2801/CodeAlpaca-20k",
+    "ms_hub_url": "AI-ModelScope/CodeAlpaca-20k"
+  },
+  "alpaca_cot": {
+    "hf_hub_url": "QingyiSi/Alpaca-CoT",
+    "ms_hub_url": "AI-ModelScope/Alpaca-CoT"
+  },
+  "openorca": {
+    "hf_hub_url": "Open-Orca/OpenOrca",
+    "ms_hub_url": "AI-ModelScope/OpenOrca",
+    "columns": {
+      "prompt": "question",
+      "response": "response",
+      "system": "system_prompt"
+    }
+  },
+  "slimorca": {
+    "hf_hub_url": "Open-Orca/SlimOrca",
+    "formatting": "sharegpt"
+  },
+  "mathinstruct": {
+    "hf_hub_url": "TIGER-Lab/MathInstruct",
+    "ms_hub_url": "AI-ModelScope/MathInstruct",
+    "columns": {
+      "prompt": "instruction",
+      "response": "output"
+    }
+  },
+  "firefly": {
+    "hf_hub_url": "YeungNLP/firefly-train-1.1M",
+    "columns": {
+      "prompt": "input",
+      "response": "target"
+    }
+  },
+  "wikiqa": {
+    "hf_hub_url": "wiki_qa",
+    "columns": {
+      "prompt": "question",
+      "response": "answer"
+    }
+  },
+  "webqa": {
+    "hf_hub_url": "suolyer/webqa",
+    "ms_hub_url": "AI-ModelScope/webqa",
+    "columns": {
+      "prompt": "input",
+      "response": "output"
+    }
+  },
+  "webnovel": {
+    "hf_hub_url": "zxbsmk/webnovel_cn",
+    "ms_hub_url": "AI-ModelScope/webnovel_cn"
+  },
+  "nectar_sft": {
+    "hf_hub_url": "mlinmg/SFT-Nectar",
+    "ms_hub_url": "AI-ModelScope/SFT-Nectar"
+  },
+  "deepctrl": {
+    "ms_hub_url": "deepctrl/deepctrl-sft-data"
+  },
+  "adgen": {
+    "hf_hub_url": "HasturOfficial/adgen",
+    "ms_hub_url": "AI-ModelScope/adgen",
+    "columns": {
+      "prompt": "content",
+      "response": "summary"
+    }
+  },
+  "sharegpt_hyper": {
+    "hf_hub_url": "totally-not-an-llm/sharegpt-hyperfiltered-3k",
+    "formatting": "sharegpt"
+  },
+  "sharegpt4": {
+    "hf_hub_url": "shibing624/sharegpt_gpt4",
+    "ms_hub_url": "AI-ModelScope/sharegpt_gpt4",
+    "formatting": "sharegpt"
+  },
+  "ultrachat_200k": {
+    "hf_hub_url": "HuggingFaceH4/ultrachat_200k",
+    "ms_hub_url": "AI-ModelScope/ultrachat_200k",
+    "formatting": "sharegpt",
+    "columns": {
+      "messages": "messages"
+    },
+    "tags": {
+      "role_tag": "role",
+      "content_tag": "content",
+      "user_tag": "user",
+      "assistant_tag": "assistant"
+    }
+  },
+  "agent_instruct": {
+    "hf_hub_url": "THUDM/AgentInstruct",
+    "ms_hub_url": "ZhipuAI/AgentInstruct",
+    "formatting": "sharegpt"
+  },
+  "lmsys_chat": {
+    "hf_hub_url": "lmsys/lmsys-chat-1m",
+    "ms_hub_url": "AI-ModelScope/lmsys-chat-1m",
+    "formatting": "sharegpt",
+    "columns": {
+      "messages": "conversation"
+    },
+    "tags": {
+      "role_tag": "role",
+      "content_tag": "content",
+      "user_tag": "human",
+      "assistant_tag": "assistant"
+    }
+  },
+  "evol_instruct": {
+    "hf_hub_url": "WizardLM/WizardLM_evol_instruct_V2_196k",
+    "ms_hub_url": "AI-ModelScope/WizardLM_evol_instruct_V2_196k",
+    "formatting": "sharegpt"
+  },
+  "glaive_toolcall_100k": {
+    "hf_hub_url": "hiyouga/glaive-function-calling-v2-sharegpt",
+    "formatting": "sharegpt",
+    "columns": {
+      "messages": "conversations",
+      "tools": "tools"
+    }
+  },
+  "cosmopedia": {
+    "hf_hub_url": "HuggingFaceTB/cosmopedia",
+    "columns": {
+      "prompt": "prompt",
+      "response": "text"
+    }
+  },
+  "oasst_de": {
+    "hf_hub_url": "mayflowergmbh/oasst_de"
+  },
+  "dolly_15k_de": {
+    "hf_hub_url": "mayflowergmbh/dolly-15k_de"
+  },
+  "alpaca-gpt4_de": {
+    "hf_hub_url": "mayflowergmbh/alpaca-gpt4_de"
+  },
+  "openschnabeltier_de": {
+    "hf_hub_url": "mayflowergmbh/openschnabeltier_de"
+  },
+  "evol_instruct_de": {
+    "hf_hub_url": "mayflowergmbh/evol-instruct_de"
+  },
+  "dolphin_de": {
+    "hf_hub_url": "mayflowergmbh/dolphin_de"
+  },
+  "booksum_de": {
+    "hf_hub_url": "mayflowergmbh/booksum_de"
+  },
+  "airoboros_de": {
+    "hf_hub_url": "mayflowergmbh/airoboros-3.0_de"
+  },
+  "ultrachat_de": {
+    "hf_hub_url": "mayflowergmbh/ultra-chat_de"
+  },
+  "hh_rlhf_en": {
+    "script_url": "hh_rlhf_en",
+    "columns": {
+      "prompt": "instruction",
+      "response": "output",
+      "history": "history"
+    },
+    "ranking": true
+  },
+  "oaast_rm": {
+    "file_name": "oaast_rm.json",
+    "file_sha1": "622d420e9b70003b210618253bd3d9d2891d86cb",
+    "columns": {
+      "prompt": "instruction",
+      "query": "input",
+      "response": "output",
+      "history": "history"
+    },
+    "ranking": true
+  },
+  "oaast_rm_zh": {
+    "file_name": "oaast_rm_zh.json",
+    "file_sha1": "1065af1f3784dd61be5e79713a35f427b713a232",
+    "columns": {
+      "prompt": "instruction",
+      "query": "input",
+      "response": "output",
+      "history": "history"
+    },
+    "ranking": true
+  },
+  "comparison_gpt4_en": {
+    "file_name": "comparison_gpt4_data_en.json",
+    "file_sha1": "96fa18313544e22444fe20eead7754b17da452ae",
+    "ranking": true
+  },
+  "comparison_gpt4_zh": {
+    "file_name": "comparison_gpt4_data_zh.json",
+    "file_sha1": "515b18ed497199131ddcc1af950345c11dc5c7fd",
+    "ranking": true
+  },
+  "orca_rlhf": {
+    "file_name": "orca_rlhf.json",
+    "file_sha1": "acc8f74d16fd1fc4f68e7d86eaa781c2c3f5ba8e",
+    "ranking": true,
+    "columns": {
+      "prompt": "question",
+      "response": "answer",
+      "system": "system"
+    }
+  },
+  "nectar_rm": {
+    "hf_hub_url": "mlinmg/RLAIF-Nectar",
+    "ms_hub_url": "AI-ModelScope/RLAIF-Nectar",
+    "ranking": true
+  },
+  "dpo_mix_en": {
+    "hf_hub_url": "hiyouga/DPO-En-Zh-20k",
+    "subset": "en",
+    "ranking": true,
+    "columns": {
+      "prompt": "prompt",
+      "response": "answer",
+      "system": "system",
+      "history": "history"
+    }
+  },
+  "dpo_mix_zh": {
+    "hf_hub_url": "hiyouga/DPO-En-Zh-20k",
+    "subset": "zh",
+    "ranking": true,
+    "columns": {
+      "prompt": "prompt",
+      "response": "answer",
+      "system": "system",
+      "history": "history"
+    }
+  },
+  "orca_dpo_de": {
+    "hf_hub_url": "mayflowergmbh/intel_orca_dpo_pairs_de",
+    "ranking": true
+  },
+  "wiki_demo": {
+    "file_name": "wiki_demo.txt",
+    "file_sha1": "e70375e28eda542a90c68213640cc371898ce181",
+    "columns": {
+      "prompt": "text"
+    }
+  },
+  "c4_demo": {
+    "file_name": "c4_demo.json",
+    "file_sha1": "a5a0c86759732f9a5238e447fecd74f28a66cca8",
+    "columns": {
+      "prompt": "text"
+    }
+  },
+  "refinedweb": {
+    "hf_hub_url": "tiiuae/falcon-refinedweb",
+    "columns": {
+      "prompt": "content"
+    }
+  },
+  "redpajama_v2": {
+    "hf_hub_url": "togethercomputer/RedPajama-Data-V2",
+    "columns": {
+      "prompt": "raw_content"
+    },
+    "subset": "default"
+  },
+  "wikipedia_en": {
+    "hf_hub_url": "olm/olm-wikipedia-20221220",
+    "ms_hub_url": "AI-ModelScope/olm-wikipedia-20221220",
+    "columns": {
+      "prompt": "text"
+    }
+  },
+  "wikipedia_zh": {
+    "hf_hub_url": "pleisto/wikipedia-cn-20230720-filtered",
+    "ms_hub_url": "AI-ModelScope/wikipedia-cn-20230720-filtered",
+    "columns": {
+      "prompt": "completion"
+    }
+  },
+  "pile": {
+    "hf_hub_url": "monology/pile-uncopyrighted",
+    "ms_hub_url": "AI-ModelScope/pile",
+    "columns": {
+      "prompt": "text"
+    }
+  },
+  "skypile": {
+    "hf_hub_url": "Skywork/SkyPile-150B",
+    "ms_hub_url": "AI-ModelScope/SkyPile-150B",
+    "columns": {
+      "prompt": "text"
+    }
+  },
+  "the_stack": {
+    "hf_hub_url": "bigcode/the-stack",
+    "ms_hub_url": "AI-ModelScope/the-stack",
+    "columns": {
+      "prompt": "content"
+    }
+  },
+  "starcoder_python": {
+    "hf_hub_url": "bigcode/starcoderdata",
+    "ms_hub_url": "AI-ModelScope/starcoderdata",
+    "columns": {
+      "prompt": "content"
+    },
+    "folder": "python"
+  }
+}
\ No newline at end of file
--- a/data/example_dataset/example_dataset.py
+++ b/data/example_dataset/example_dataset.py
+import json
+import datasets
+from typing import Any, Dict, List
+_DESCRIPTION = "An example of dataset."
+_CITATION = ""
+_HOMEPAGE = ""
+_LICENSE = ""
+_URL = "examples.json"
+class ExampleDataset(datasets.GeneratorBasedBuilder):
+    VERSION = datasets.Version("0.0.0")
+    def _info(self) -> datasets.DatasetInfo:
+        features = datasets.Features({
+            "instruction": datasets.Value("string"),
+            "input": datasets.Value("string"),
+            "output": datasets.Value("string"),
+            "history": datasets.Sequence(datasets.Sequence(datasets.Value("string")))
+        })
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION
+        )
+    def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]:
+        file_path = dl_manager.download(_URL)
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={
+                    "filepath": file_path
+                }
+            )
+        ]
+    def _generate_examples(self, filepath: str) -> Dict[int, Dict[str, Any]]:
+        example_dataset = json.load(open(filepath, "r", encoding="utf-8"))
+        for key, example in enumerate(example_dataset):
+            yield key, example
--- a/data/example_dataset/examples.json
+++ b/data/example_dataset/examples.json
+[
+  {
+    "instruction": "听起来很不错。人工智能可能在哪些方面面临挑战呢？",
+    "input": "",
+    "output": "人工智能面临的挑战包括数据隐私、安全和道德方面的问题，以及影响就业机会的自动化等问题。",
+    "history": [
+      ["你好，你能帮我解答一个问题吗？", "当然，请问有什么问题？"],
+      ["我想了解人工智能的未来发展方向，你有什么想法吗？", "人工智能在未来的发展方向可能包括更强大的机器学习算法，更先进的自然语言处理技术，以及更加智能的机器人。"]
+    ]
+  },
+  {
+    "instruction": "好的，谢谢你！",
+    "input": "",
+    "output": "不客气，有其他需要帮忙的地方可以继续问我。",
+    "history": [
+      ["你好，能告诉我今天天气怎么样吗？", "当然可以，请问您所在的城市是哪里？"],
+      ["我在纽约。", "纽约今天晴间多云，气温最高约26摄氏度，最低约18摄氏度，记得注意保暖喔。"]
+    ]
+  }
+]
--- a/data/fingpt_sentiment.json
+++ b/data/fingpt_sentiment.json
--- a/data/hh_rlhf_en/hh_rlhf_en.py
+++ b/data/hh_rlhf_en/hh_rlhf_en.py
+import json
+import datasets
+from typing import List
+_DESCRIPTION = "Human preference data about helpfulness and harmlessness."
+_CITATION = ""
+_HOMEPAGE = "https://huggingface.co/datasets/Anthropic/hh-rlhf"
+_LICENSE = "mit"
+_URL = "https://huggingface.co/datasets/Anthropic/hh-rlhf/resolve/main/"
+_URLS = {
+    "train": [
+        _URL + "harmless-base/train.jsonl.gz",
+        _URL + "helpful-base/train.jsonl.gz",
+        _URL + "helpful-online/train.jsonl.gz",
+        _URL + "helpful-rejection-sampled/train.jsonl.gz"
+    ],
+    "test": [
+        _URL + "harmless-base/test.jsonl.gz",
+        _URL + "helpful-base/test.jsonl.gz",
+        _URL + "helpful-online/test.jsonl.gz",
+        _URL + "helpful-rejection-sampled/test.jsonl.gz"
+    ]
+}
+class HhRlhfEn(datasets.GeneratorBasedBuilder):
+    VERSION = datasets.Version("0.0.0")
+    def _info(self) -> datasets.DatasetInfo:
+        features = datasets.Features({
+            "instruction": datasets.Value("string"),
+            "output": datasets.Sequence(datasets.Value("string")),
+            "history": datasets.Sequence(datasets.Sequence(datasets.Value("string")))
+        })
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION
+        )
+    def _split_generators(self, dl_manager: datasets.DownloadManager):
+        file_path = dl_manager.download_and_extract(_URLS)
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={
+                    "filepaths": file_path["train"]
+                }
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={
+                    "filepaths": file_path["test"]
+                }
+            )
+        ]
+    def _generate_examples(self, filepaths: List[str]):
+        key = 0
+        for filepath in filepaths:
+            with open(filepath, "r", encoding="utf-8") as f:
+                for row in f:
+                    data = json.loads(row)
+                    chosen = data["chosen"]
+                    rejected = data["rejected"]
+                    assist_idx = rejected.rfind("\n\nAssistant: ")
+                    r_reject = rejected[assist_idx+13:].strip()
+                    assist_idx = chosen.rfind("\n\nAssistant: ")
+                    r_accept = chosen[assist_idx+13:].strip()
+                    human_idx = chosen.rfind("\n\nHuman: ")
+                    query = chosen[human_idx+9:assist_idx].strip()
+                    prompt = chosen[:human_idx]
+                    history = []
+                    while prompt.rfind("\n\nAssistant: ") != -1:
+                        assist_idx = prompt.rfind("\n\nAssistant: ")
+                        human_idx = prompt.rfind("\n\nHuman: ")
+                        if human_idx != -1:
+                            old_query = prompt[human_idx+9:assist_idx].strip()
+                            old_resp = prompt[assist_idx+13:].strip()
+                            history.insert(0, (old_query, old_resp))
+                        else:
+                            break
+                        prompt = prompt[:human_idx]
+                    yield key, {
+                        "instruction": query,
+                        "output": [r_accept, r_reject],
+                        "history": history
+                    }
+                    key += 1
--- a/data/self_cognition.json
+++ b/data/self_cognition.json
--- a/data/ultra_chat/ultra_chat.py
+++ b/data/ultra_chat/ultra_chat.py
+import json
+import datasets
+from typing import List
+_DESCRIPTION = "UltraChat: Large-scale, Informative, and Diverse Multi-round Dialogue Data."
+_CITATION = """\
+@misc{UltraChat,
+  author = {Ding, Ning and Chen, Yulin and Xu, Bokai and Hu, Shengding and Qin, Yujia and Liu, Zhiyuan and Sun, Maosong and Zhou, Bowen},
+  title = {UltraChat: A Large-scale Auto-generated Multi-round Dialogue Data},
+  year = {2023},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\\url{https://github.com/thunlp/ultrachat}},
+}
+"""
+_HOMEPAGE = "https://huggingface.co/datasets/stingning/ultrachat"
+_LICENSE = "cc-by-nc-4.0"
+_BASE_DATA_URL = "https://huggingface.co/datasets/stingning/ultrachat/resolve/main/train_{idx}.jsonl"
+class UltraChat(datasets.GeneratorBasedBuilder):
+    VERSION = datasets.Version("0.0.0")
+    def _info(self):
+        features = datasets.Features({
+            "conversations": [{"from": datasets.Value("string"), "value": datasets.Value("string")}]
+        })
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION
+        )
+    def _split_generators(self, dl_manager: datasets.DownloadManager):
+        file_paths = [dl_manager.download(_BASE_DATA_URL.format(idx=idx)) for idx in range(10)] # multiple shards
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={
+                    "filepaths": file_paths
+                }
+            )
+        ]
+    def _generate_examples(self, filepaths: List[str]):
+        for filepath in filepaths:
+            with open(filepath, "r", encoding="utf-8") as f:
+                for row in f:
+                    try:
+                        data = json.loads(row)
+                    except:
+                        continue
+                    key: int = data["id"]
+                    content: List[str] = data["data"]
+                    if len(content) % 2 == 1:
+                        content.pop(-1)
+                    if len(content) < 2:
+                        continue
+                    conversations = [{
+                        "from": "human" if i % 2 == 0 else "gpt",
+                        "value": content[i]
+                    } for i in range(len(content))]
+                    yield key, {"conversations": conversations}
--- a/data/wiki_demo.txt
+++ b/data/wiki_demo.txt
--- a/docker/.gitkeep
+++ b/docker/.gitkeep
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
+ENV DEBIAN_FRONTEND=noninteractive
+COPY requirements.txt requirements.txt
+RUN pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com