Commit b77a7567 authored by wanglch's avatar wanglch
Browse files

Initial commit

parents
Pipeline #1305 canceled with stages
# XuanYuan
轩辕大模型是度小满推出的大模型系列,持续贡献开源生态。
## 论文
- [CGCE: A Chinese Generative Chat Evaluation Benchmark for General and Financial Domains](https://arxiv.org/abs/2305.14471)
## 模型结构
最懂金融领域的开源大模型“轩辕”系列,继176B、70B之后推出更小参数版本——XuanYuan-13B。这一版本在保持强大功能的同时,采用了更小的参数配置,专注于提升在不同场景下的应用效果。同时,我们也开源了XuanYuan-13B-Chat模型的4bit和8bit量化版本,降低了硬件需求,方便在不同的设备上部署。
<div align="center">
<img src="./imgs/transformer.jpg"/>
</div>
## 算法原理
在模型训练中,团队在模型预训练阶段动态调整不同语种与领域知识的比例,融入了大量的专业金融语料,并在指令微调中灵活运用之前提出的Self-QA和混合训练方法,显著提升了模型在对话中的性能表现。此外,本次“轩辕13B”还通过强化学习训练,与人类偏好进行对齐。相比于原始模型,RLHF对齐后的模型,在文本创作、内容生成 、指令理解与遵循、安全性等方面都有较大的提升。
<div align=center>
<img src="./imgs/transformer.png"/>
</div>
## 环境配置
### Docker(方法一)
[光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=64G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name xuanyuan <your imageID> bash
cd /path/your_code_data/
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
```
### Dockerfile(方法二)
```
cd /path/your_code_data/docker
docker build --no-cache -t xuanyuan:latest .
docker run --shm-size=64G --name xuanyuan -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v /path/your_code_data/:/path/your_code_data/ -it xuanyuan bash
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
```
### Anaconda(方法三)
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
```
DTK驱动:dtk24.04
python:python3.10
torch:2.1
torchvision: 0.16.0
deepspped: 0.12.3
```
`Tips:以上dtk驱动、python、paddle等DCU相关工具版本需要严格一一对应`
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
```
conda create -n xuanyuan python=3.10
conda activate xuanyuan
cd /path/your_code_data/
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
```
## 数据集
迷你数据集 [fingpt_sentiment](./LLaMA-Factory/data/fingpt_sentiment.json)
预训练需要准备你的训练数据,需要将所有样本放到一个列表中并存入json文件中。每个样本对应一个字典,包含以下信息,示例如下所示:用于正常训练的完整数据集请按此目录结构进行制备:
```
{
"instruction": "描述原子的结构。",
"input": "",
"output": "原子是物质的基本单位,它由三种基本粒子组成:质子、中子和电子。质子和中子形成原子核,位于原子中心,核外的电子围绕着原子核运动。\n\n原子结构具有层次性。原子核中,质子带正电,中子不带电(中性)。原子核非常小且致密,占据了原子总质量的绝大部分。电子带负电,通常围绕核运动,形成若干层次,称为壳层或电子层。电子数量与质子数量相等,使原子呈电中性。\n\n电子在每个壳层中都呈规律分布,并且不同壳层所能容纳的电子数也不同。在最里面的壳层一般只能容纳2个电子,其次一层最多可容纳8个电子,再往外的壳层可容纳的电子数逐层递增。\n\n原子核主要受到两种相互作用力的影响:强力和电磁力。强力的作用范围非常小,主要限制在原子核内,具有极强的吸引作用,使核子(质子和中子)紧密结合在一起。电磁力的作用范围较大,主要通过核外的电子与原子核相互作用,发挥作用。\n\n这就是原子的基本结构。原子内部结构复杂多样,不同元素的原子核中质子、中子数量不同,核外电子排布分布也不同,形成了丰富多彩的化学世界。"
},
```
## 训练
训练需要使用Llama-Factory包,需要将本仓库中Llama-Factory的data数据替换git clone的Llama-Factory的data文件,替换后只保留git clone的Llama-Factory文件夹。
根据实际情况在脚本中修改权重相关路径
### 单机多卡
```
sh ds_zero3_work_dtk.sh
```
## 推理
### 单机单卡
```
sh Xuanyuan_inference.sh
```
## result
### 问答
<div align=center>
<img src="./imgs/result.png"/>
</div>
### 精度
测试数据: [fingpt_sentiment](./LLaMA-Factory/data/fingpt_sentiment.json) ,使用的加速卡:K100。
| device | train_loss | eval_loss
| :------: | :------: | :------: |
| K100 | 0.7087 | 0.1019 |
## 应用场景
### 算法类别
`问答`
### 热点应用行业
`金融,教育`
## 预训练权重
- [Duxiaoman-DI/XuanYuan-13B-Chat](https://modelscope.cn/models/Duxiaoman-DI/XuanYuan-13B-Chat/files)
预训练权重快速下载中心:[SCNet AIModels](http://113.200.138.88:18080/aimodels)
项目中的预训练权重可从快速下载通道下载: [XuanYuan-13B-Chat](http://113.200.138.88:18080/aimodels/XuanYuan-13B-Chat)
## 源码仓库及问题反馈
- https://developer.hpccube.com/codes/modelzoo/xuanyuan_pytorch
## 参考资料
- [轩辕大模型-魔搭](https://modelscope.cn/models/Duxiaoman-DI/XuanYuan-13B-Chat/summary)
- [Xuanyuan github](https://github.com/Duxiaoman-DI/XuanYuan)
This diff is collapsed.
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
model_name_or_path = "/home/wanglch/projects/XuanYuan/XuanYuan-13B-Chat"
tokenizer = LlamaTokenizer.from_pretrained(model_name_or_path, use_fast=False, legacy=True, trust_remote_code=True)
model = LlamaForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=True)
model.eval()
seps = [" ", "</s>"]
roles = ["Human", "Assistant"]
content = "互联网金融机构如何确认该笔贷款是由本人申请的?"
prompt = "Human: " + content + " Assistant:"
print(f"输入: {content}")
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7, top_p=0.95)
outputs = tokenizer.decode(outputs.cpu()[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(f"输出: {outputs}")
\ No newline at end of file
CUDA_VISIBLE_DEVICES=4,5 python Xuanyuan_inference.py
\ No newline at end of file
# -*- utf8 -*
import argparse
from conversation import get_conv_template
try:
from vllm import LLM, SamplingParams
is_vllm_avaiable = True
print("use vllm.generate to infer...")
except ImportError:
from transformers import LlamaForCausalLM, LlamaTokenizer
is_vllm_avaiable = False
print("use transformers.generate to infer...")
def infer_vllm(llm, sampling_params, prompt):
assert llm is not None
assert sampling_params is not None
generation = llm.generate(prompt, sampling_params, use_tqdm=False)
outputs = generation[0].outputs[0].text.strip()
return outputs
def infer(model, tokenizer, prompt):
assert model is not None
assert tokenizer is not None
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=args.max_new_tokens,
do_sample=True,
temperature=args.temperature,
top_p=args.top_p
)
outputs = tokenizer.decode(outputs.cpu()[0][len(inputs.input_ids[0]):], skip_special_tokens=True).strip()
return outputs
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Test XuanYuan-70B-chat with vLLM")
parser.add_argument("-c", "--checkpoint_path", type=str, help="Checkpoint path")
parser.add_argument("-n", "--max_new_tokens", type=int, default=1000)
parser.add_argument("-t", "--temperature", type=float, default=0.95)
parser.add_argument("-p", "--top_p", type=float, default=0.95)
args = parser.parse_args()
llm = None
sampling_params = None
model = None
tokenizer = None
if is_vllm_avaiable:
print("loading weight with vLLM...")
sampling_params = SamplingParams(
temperature=args.temperature,
top_p=args.top_p,
stop=list(["</s>"]),
max_tokens=args.max_new_tokens
)
llm = LLM(args.checkpoint_path, tensor_parallel_size=8)
else:
print("loading weight with transformers ...")
tokenizer = LlamaTokenizer.from_pretrained(args.checkpoint_path, use_fast=False, legacy=True)
model = LlamaForCausalLM.from_pretrained(args.checkpoint_path, device_map="auto")
conv = get_conv_template("XuanYuan-Chat")
print("########")
print("输入为: EXIT!! 表示退出")
print("输入为: CLEAR!! 表示清空上下文")
print("########")
while True:
content = input("输入: ")
if content.strip() == "EXIT!!":
print("exit....")
break
if content.strip() == "CLEAR!!":
conv = get_conv_template("XuanYuan-Chat")
print("clear...")
continue
conv.append_message(conv.roles[0], content.strip())
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
if is_vllm_avaiable:
outputs = infer_vllm(llm, sampling_params, prompt)
else:
outputs = infer(model, tokenizer, prompt)
print(f"输出: {outputs}")
conv.update_last_message(outputs)
"""
refer: https://github.com/lm-sys/FastChat/tree/main/fastchat
"""
import dataclasses
from enum import auto, IntEnum
from typing import List, Dict
class SeparatorStyle(IntEnum):
"""Separator styles."""
ADD_COLON_TWO = auto()
@dataclasses.dataclass
class Conversation:
"""A class that manages prompt templates and keeps all conversation history."""
name: str
system_template: str = "{system_message}"
system_message: str = ""
roles: List[str] = (("USER", "ASSISTANT"),)
messages: List[List[str]] = ()
offset: int = 0
sep_style: SeparatorStyle = SeparatorStyle.ADD_COLON_TWO
sep: str = "\n"
sep2: str = None
stop_str: str = None
stop_token_ids: List[int] = None
def get_prompt(self) -> str:
"""Get the prompt for generation."""
system_prompt = self.system_template.format(system_message=self.system_message)
if self.sep_style == SeparatorStyle.ADD_COLON_TWO:
seps = [self.sep, self.sep2]
ret = system_prompt + seps[0]
for i, (role, message) in enumerate(self.messages):
if message:
ret += role + ": " + message + seps[i % 2]
else:
ret += role + ":"
return ret
else:
raise ValueError(f"Invalid style: {self.sep_style}")
def append_message(self, role: str, message: str):
self.messages.append([role, message])
def update_last_message(self, message: str):
self.messages[-1][1] = message
def copy(self):
return Conversation(
name=self.name,
system_template=self.system_template,
system_message=self.system_message,
roles=self.roles,
messages=[[x, y] for x, y in self.messages],
offset=self.offset,
sep_style=self.sep_style,
sep=self.sep,
sep2=self.sep2,
stop_str=self.stop_str,
stop_token_ids=self.stop_token_ids,
)
def dict(self):
return {
"template_name": self.name,
"system_message": self.system_message,
"roles": self.roles,
"messages": self.messages,
"offset": self.offset,
}
conv_templates: Dict[str, Conversation] = {}
def register_conv_template(template: Conversation, override: bool = False):
"""Register a new conversation template."""
if not override:
assert (
template.name not in conv_templates
), f"{template.name} has been registered."
conv_templates[template.name] = template
def get_conv_template(name: str) -> Conversation:
"""Get a conversation template."""
return conv_templates[name].copy()
register_conv_template(
Conversation(
name="XuanYuan-Chat",
system_message="以下是用户和人工智能助手之间的对话。用户以Human开头,人工智能助手以Assistant开头,会对人类提出的问题给出有帮助、高质量、详细和礼貌的回答,并且总是拒绝参与与不道德、不安全、有争议、政治敏感等相关的话题、问题和指示。\n",
roles=("Human", "Assistant"),
messages=(),
offset=0,
sep_style=SeparatorStyle.ADD_COLON_TWO,
sep=" ",
sep2="</s>",
)
)
if __name__ == "__main__":
conv = get_conv_template("XuanYuan-Chat")
conv.append_message(conv.roles[0], "Hello!")
conv.append_message(conv.roles[1], "Hi!")
conv.append_message(conv.roles[0], "介绍下你自己")
conv.append_message(conv.roles[1], None)
print(conv.get_prompt())
If you are using a custom dataset, please provide your dataset definition in the following format in `dataset_info.json`.
```json
"dataset_name": {
"hf_hub_url": "the name of the dataset repository on the Hugging Face hub. (if specified, ignore script_url and file_name)",
"ms_hub_url": "the name of the dataset repository on the ModelScope hub. (if specified, ignore script_url and file_name)",
"script_url": "the name of the directory containing a dataset loading script. (if specified, ignore file_name)",
"file_name": "the name of the dataset file in this directory. (required if above are not specified)",
"file_sha1": "the SHA-1 hash value of the dataset file. (optional, does not affect training)",
"subset": "the name of the subset. (optional, default: None)",
"folder": "the name of the folder of the dataset repository on the Hugging Face hub. (optional, default: None)",
"ranking": "whether the dataset is a preference dataset or not. (default: false)",
"formatting": "the format of the dataset. (optional, default: alpaca, can be chosen from {alpaca, sharegpt})",
"columns (optional)": {
"prompt": "the column name in the dataset containing the prompts. (default: instruction)",
"query": "the column name in the dataset containing the queries. (default: input)",
"response": "the column name in the dataset containing the responses. (default: output)",
"history": "the column name in the dataset containing the histories. (default: None)",
"messages": "the column name in the dataset containing the messages. (default: conversations)",
"system": "the column name in the dataset containing the system prompts. (default: None)",
"tools": "the column name in the dataset containing the tool description. (default: None)"
},
"tags (optional, used for the sharegpt format)": {
"role_tag": "the key in the message represents the identity. (default: from)",
"content_tag": "the key in the message represents the content. (default: value)",
"user_tag": "the value of the role_tag represents the user. (default: human)",
"assistant_tag": "the value of the role_tag represents the assistant. (default: gpt)",
"observation_tag": "the value of the role_tag represents the tool results. (default: observation)",
"function_tag": "the value of the role_tag represents the function call. (default: function_call)",
"system_tag": "the value of the role_tag represents the system prompt. (default: system, can override system column)"
}
}
```
Given above, you can use the custom dataset via specifying `--dataset dataset_name`.
Currently we support dataset in **alpaca** or **sharegpt** format, the dataset in alpaca format should follow the below format:
```json
[
{
"instruction": "user instruction (required)",
"input": "user input (optional)",
"output": "model response (required)",
"system": "system prompt (optional)",
"history": [
["user instruction in the first round (optional)", "model response in the first round (optional)"],
["user instruction in the second round (optional)", "model response in the second round (optional)"]
]
}
]
```
Regarding the above dataset, the `columns` in `dataset_info.json` should be:
```json
"dataset_name": {
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"system": "system",
"history": "history"
}
}
```
The `query` column will be concatenated with the `prompt` column and used as the user prompt, then the user prompt would be `prompt\nquery`. The `response` column represents the model response.
The `system` column will be used as the system prompt. The `history` column is a list consisting string tuples representing prompt-response pairs in the history. Note that the responses in the history **will also be used for training**.
For the pre-training datasets, only the `prompt` column will be used for training.
For the preference datasets, the `response` column should be a string list whose length is 2, with the preferred answers appearing first, for example:
```json
{
"instruction": "user instruction",
"input": "user input",
"output": [
"chosen answer",
"rejected answer"
]
}
```
The dataset in sharegpt format should follow the below format:
```json
[
{
"conversations": [
{
"from": "human",
"value": "user instruction"
},
{
"from": "gpt",
"value": "model response"
}
],
"system": "system prompt (optional)",
"tools": "tool description (optional)"
}
]
```
Regarding the above dataset, the `columns` in `dataset_info.json` should be:
```json
"dataset_name": {
"columns": {
"messages": "conversations",
"system": "system",
"tools": "tools"
},
"tags": {
"role_tag": "from",
"content_tag": "value",
"user_tag": "human",
"assistant_tag": "gpt"
}
}
```
where the `messages` column should be a list following the `u/a/u/a/u/a` order.
Pre-training datasets and preference datasets are incompatible with the sharegpt format yet.
如果您使用自定义数据集,请务必在 `dataset_info.json` 文件中按照以下格式提供数据集定义。
```json
"数据集名称": {
"hf_hub_url": "Hugging Face 的数据集仓库地址(若指定,则忽略 script_url 和 file_name)",
"ms_hub_url": "ModelScope 的数据集仓库地址(若指定,则忽略 script_url 和 file_name)",
"script_url": "包含数据加载脚本的本地文件夹名称(若指定,则忽略 file_name)",
"file_name": "该目录下数据集文件的名称(若上述参数未指定,则此项必需)",
"file_sha1": "数据集文件的 SHA-1 哈希值(可选,留空不影响训练)",
"subset": "数据集子集的名称(可选,默认:None)",
"folder": "Hugging Face 仓库的文件夹名称(可选,默认:None)",
"ranking": "是否为偏好数据集(可选,默认:False)",
"formatting": "数据集格式(可选,默认:alpaca,可以为 alpaca 或 sharegpt)",
"columns(可选)": {
"prompt": "数据集代表提示词的表头名称(默认:instruction)",
"query": "数据集代表请求的表头名称(默认:input)",
"response": "数据集代表回答的表头名称(默认:output)",
"history": "数据集代表历史对话的表头名称(默认:None)",
"messages": "数据集代表消息列表的表头名称(默认:conversations)",
"system": "数据集代表系统提示的表头名称(默认:None)",
"tools": "数据集代表工具描述的表头名称(默认:None)"
},
"tags(可选,用于 sharegpt 格式)": {
"role_tag": "消息中代表发送者身份的键名(默认:from)",
"content_tag": "消息中代表文本内容的键名(默认:value)",
"user_tag": "消息中代表用户的 role_tag(默认:human)",
"assistant_tag": "消息中代表助手的 role_tag(默认:gpt)",
"observation_tag": "消息中代表工具返回结果的 role_tag(默认:observation)",
"function_tag": "消息中代表工具调用的 role_tag(默认:function_call)",
"system_tag": "消息中代表系统提示的 role_tag(默认:system,会覆盖 system 列)"
}
}
```
添加后可通过指定 `--dataset 数据集名称` 参数使用自定义数据集。
该项目目前支持两种格式的数据集:**alpaca****sharegpt**,其中 alpaca 格式的数据集按照以下方式组织:
```json
[
{
"instruction": "用户指令(必填)",
"input": "用户输入(选填)",
"output": "模型回答(必填)",
"system": "系统提示词(选填)",
"history": [
["第一轮指令(选填)", "第一轮回答(选填)"],
["第二轮指令(选填)", "第二轮回答(选填)"]
]
}
]
```
对于上述格式的数据,`dataset_info.json` 中的 `columns` 应为:
```json
"数据集名称": {
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"system": "system",
"history": "history"
}
}
```
其中 `query` 列对应的内容会与 `prompt` 列对应的内容拼接后作为用户指令,即用户指令为 `prompt\nquery``response` 列对应的内容为模型回答。
`system` 列对应的内容将被作为系统提示词。`history` 列是由多个字符串二元组构成的列表,分别代表历史消息中每轮的指令和回答。注意历史消息中的回答**也会被用于训练**
对于预训练数据集,仅 `prompt` 列中的内容会用于模型训练。
对于偏好数据集,`response` 列应当是一个长度为 2 的字符串列表,排在前面的代表更优的回答,例如:
```json
{
"instruction": "用户指令",
"input": "用户输入",
"output": [
"优质回答",
"劣质回答"
]
}
```
而 sharegpt 格式的数据集按照以下方式组织:
```json
[
{
"conversations": [
{
"from": "human",
"value": "用户指令"
},
{
"from": "gpt",
"value": "模型回答"
}
],
"system": "系统提示词(选填)",
"tools": "工具描述(选填)"
}
]
```
对于上述格式的数据,`dataset_info.json` 中的 `columns` 应为:
```json
"数据集名称": {
"columns": {
"messages": "conversations",
"system": "system",
"tools": "tools"
},
"tags": {
"role_tag": "from",
"content_tag": "value",
"user_tag": "human",
"assistant_tag": "gpt"
}
}
```
其中 `messages` 列应当是一个列表,且符合 `用户/模型/用户/模型/用户/模型` 的顺序。
预训练数据集和偏好数据集尚不支持 sharegpt 格式。
import json
import datasets
_DESCRIPTION = "BELLE multiturn chat dataset."
_CITATION = """\
@article{belle2023exploring,
title={Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use Cases},
author={Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Lei Zhang, Baochang Ma, Xiangang Li},
journal={arXiv preprint arXiv:2303.14742},
year={2023}
}
"""
_HOMEPAGE = "https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M"
_LICENSE = "gpl-3.0"
_URL = "https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M/resolve/main/multiturn_chat_0.8M.json"
class BelleMultiturn(datasets.GeneratorBasedBuilder):
VERSION = datasets.Version("0.0.0")
def _info(self):
features = datasets.Features({
"conversations": [{"from": datasets.Value("string"), "value": datasets.Value("string")}]
})
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=features,
homepage=_HOMEPAGE,
license=_LICENSE,
citation=_CITATION
)
def _split_generators(self, dl_manager: datasets.DownloadManager):
file_path = dl_manager.download(_URL)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={
"filepath": file_path
}
)
]
def _generate_examples(self, filepath: str):
with open(filepath, "r", encoding="utf-8") as f:
for key, row in enumerate(f):
data = json.loads(row)
conversations = []
prompt = data["instruction"].strip()
response = data["output"].strip()
assist_idx = prompt.rfind("Assistant:")
human_idx = prompt.rfind("Human:")
query = prompt[human_idx+6:assist_idx].strip()
prompt = prompt[:human_idx].strip()
conversations.insert(0, {"from": "gpt", "value": response})
conversations.insert(0, {"from": "human", "value": query})
while prompt.rfind("Assistant:") != -1:
assist_idx = prompt.rfind("Assistant:")
human_idx = prompt.rfind("Human:")
if human_idx != -1:
old_query = prompt[human_idx+6:assist_idx].strip()
old_resp = prompt[assist_idx+10:].strip()
conversations.insert(0, {"from": "gpt", "value": old_resp})
conversations.insert(0, {"from": "human", "value": old_query})
else:
break
prompt = prompt[:human_idx].strip()
yield key, {"conversations": conversations}
This diff is collapsed.
{
"fingpt_sentiment": {
"file_name": "fingpt_sentiment.json",
"file_sha1": "7670f5c174b849e4908d9d0f4e3e61d8755c0142"
},
"alpaca_en": {
"file_name": "alpaca_data_en_52k.json",
"file_sha1": "607f94a7f581341e59685aef32f531095232cf23"
},
"alpaca_zh": {
"file_name": "alpaca_data_zh_51k.json",
"file_sha1": "2ba9827122c158dc256668d42bd1bcb8bc6b786e"
},
"alpaca_gpt4_en": {
"file_name": "alpaca_gpt4_data_en.json",
"file_sha1": "647f4ad447bd993e4b6b6223d1be15208bab694a"
},
"alpaca_gpt4_zh": {
"file_name": "alpaca_gpt4_data_zh.json",
"file_sha1": "3eaa3bda364ccdd59925d7448a698256c31ef845"
},
"identity": {
"file_name": "identity.json",
"file_sha1": "ffe3ecb58ab642da33fbb514d5e6188f1469ad40"
},
"oaast_sft": {
"file_name": "oaast_sft.json",
"file_sha1": "7baf5d43e67a91f9bbdf4e400dbe033b87e9757e",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"history": "history"
}
},
"oaast_sft_zh": {
"file_name": "oaast_sft_zh.json",
"file_sha1": "a6a91f18f80f37b10ded9cf633fb50c033bf7b9f",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"history": "history"
}
},
"lima": {
"file_name": "lima.json",
"file_sha1": "9db59f6b7007dc4b17529fc63379b9cd61640f37",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"history": "history"
}
},
"glaive_toolcall": {
"file_name": "glaive_toolcall_10k.json",
"file_sha1": "a6917b85d209df98d31fdecb253c79ebc440f6f3",
"formatting": "sharegpt",
"columns": {
"messages": "conversations",
"tools": "tools"
}
},
"mllm_demo": {
"file_name": "mllm_demo.json",
"file_sha1": "b6709b23657d5c42a701f1c5574f3a6edaa40a20",
"formatting": "sharegpt",
"columns": {
"messages": "messages",
"images": "images"
},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant"
}
},
"example": {
"script_url": "example_dataset",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"history": "history"
}
},
"guanaco": {
"hf_hub_url": "JosephusCheung/GuanacoDataset",
"ms_hub_url": "AI-ModelScope/GuanacoDataset"
},
"belle_2m": {
"hf_hub_url": "BelleGroup/train_2M_CN",
"ms_hub_url": "AI-ModelScope/train_2M_CN"
},
"belle_1m": {
"hf_hub_url": "BelleGroup/train_1M_CN",
"ms_hub_url": "AI-ModelScope/train_1M_CN"
},
"belle_0.5m": {
"hf_hub_url": "BelleGroup/train_0.5M_CN",
"ms_hub_url": "AI-ModelScope/train_0.5M_CN"
},
"belle_dialog": {
"hf_hub_url": "BelleGroup/generated_chat_0.4M",
"ms_hub_url": "AI-ModelScope/generated_chat_0.4M"
},
"belle_math": {
"hf_hub_url": "BelleGroup/school_math_0.25M",
"ms_hub_url": "AI-ModelScope/school_math_0.25M"
},
"belle_multiturn": {
"script_url": "belle_multiturn",
"formatting": "sharegpt"
},
"ultra_chat": {
"script_url": "ultra_chat",
"formatting": "sharegpt"
},
"open_platypus": {
"hf_hub_url": "garage-bAInd/Open-Platypus",
"ms_hub_url": "AI-ModelScope/Open-Platypus"
},
"codealpaca": {
"hf_hub_url": "sahil2801/CodeAlpaca-20k",
"ms_hub_url": "AI-ModelScope/CodeAlpaca-20k"
},
"alpaca_cot": {
"hf_hub_url": "QingyiSi/Alpaca-CoT",
"ms_hub_url": "AI-ModelScope/Alpaca-CoT"
},
"openorca": {
"hf_hub_url": "Open-Orca/OpenOrca",
"ms_hub_url": "AI-ModelScope/OpenOrca",
"columns": {
"prompt": "question",
"response": "response",
"system": "system_prompt"
}
},
"slimorca": {
"hf_hub_url": "Open-Orca/SlimOrca",
"formatting": "sharegpt"
},
"mathinstruct": {
"hf_hub_url": "TIGER-Lab/MathInstruct",
"ms_hub_url": "AI-ModelScope/MathInstruct",
"columns": {
"prompt": "instruction",
"response": "output"
}
},
"firefly": {
"hf_hub_url": "YeungNLP/firefly-train-1.1M",
"columns": {
"prompt": "input",
"response": "target"
}
},
"wikiqa": {
"hf_hub_url": "wiki_qa",
"columns": {
"prompt": "question",
"response": "answer"
}
},
"webqa": {
"hf_hub_url": "suolyer/webqa",
"ms_hub_url": "AI-ModelScope/webqa",
"columns": {
"prompt": "input",
"response": "output"
}
},
"webnovel": {
"hf_hub_url": "zxbsmk/webnovel_cn",
"ms_hub_url": "AI-ModelScope/webnovel_cn"
},
"nectar_sft": {
"hf_hub_url": "mlinmg/SFT-Nectar",
"ms_hub_url": "AI-ModelScope/SFT-Nectar"
},
"deepctrl": {
"ms_hub_url": "deepctrl/deepctrl-sft-data"
},
"adgen": {
"hf_hub_url": "HasturOfficial/adgen",
"ms_hub_url": "AI-ModelScope/adgen",
"columns": {
"prompt": "content",
"response": "summary"
}
},
"sharegpt_hyper": {
"hf_hub_url": "totally-not-an-llm/sharegpt-hyperfiltered-3k",
"formatting": "sharegpt"
},
"sharegpt4": {
"hf_hub_url": "shibing624/sharegpt_gpt4",
"ms_hub_url": "AI-ModelScope/sharegpt_gpt4",
"formatting": "sharegpt"
},
"ultrachat_200k": {
"hf_hub_url": "HuggingFaceH4/ultrachat_200k",
"ms_hub_url": "AI-ModelScope/ultrachat_200k",
"formatting": "sharegpt",
"columns": {
"messages": "messages"
},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant"
}
},
"agent_instruct": {
"hf_hub_url": "THUDM/AgentInstruct",
"ms_hub_url": "ZhipuAI/AgentInstruct",
"formatting": "sharegpt"
},
"lmsys_chat": {
"hf_hub_url": "lmsys/lmsys-chat-1m",
"ms_hub_url": "AI-ModelScope/lmsys-chat-1m",
"formatting": "sharegpt",
"columns": {
"messages": "conversation"
},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "human",
"assistant_tag": "assistant"
}
},
"evol_instruct": {
"hf_hub_url": "WizardLM/WizardLM_evol_instruct_V2_196k",
"ms_hub_url": "AI-ModelScope/WizardLM_evol_instruct_V2_196k",
"formatting": "sharegpt"
},
"glaive_toolcall_100k": {
"hf_hub_url": "hiyouga/glaive-function-calling-v2-sharegpt",
"formatting": "sharegpt",
"columns": {
"messages": "conversations",
"tools": "tools"
}
},
"cosmopedia": {
"hf_hub_url": "HuggingFaceTB/cosmopedia",
"columns": {
"prompt": "prompt",
"response": "text"
}
},
"oasst_de": {
"hf_hub_url": "mayflowergmbh/oasst_de"
},
"dolly_15k_de": {
"hf_hub_url": "mayflowergmbh/dolly-15k_de"
},
"alpaca-gpt4_de": {
"hf_hub_url": "mayflowergmbh/alpaca-gpt4_de"
},
"openschnabeltier_de": {
"hf_hub_url": "mayflowergmbh/openschnabeltier_de"
},
"evol_instruct_de": {
"hf_hub_url": "mayflowergmbh/evol-instruct_de"
},
"dolphin_de": {
"hf_hub_url": "mayflowergmbh/dolphin_de"
},
"booksum_de": {
"hf_hub_url": "mayflowergmbh/booksum_de"
},
"airoboros_de": {
"hf_hub_url": "mayflowergmbh/airoboros-3.0_de"
},
"ultrachat_de": {
"hf_hub_url": "mayflowergmbh/ultra-chat_de"
},
"hh_rlhf_en": {
"script_url": "hh_rlhf_en",
"columns": {
"prompt": "instruction",
"response": "output",
"history": "history"
},
"ranking": true
},
"oaast_rm": {
"file_name": "oaast_rm.json",
"file_sha1": "622d420e9b70003b210618253bd3d9d2891d86cb",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"history": "history"
},
"ranking": true
},
"oaast_rm_zh": {
"file_name": "oaast_rm_zh.json",
"file_sha1": "1065af1f3784dd61be5e79713a35f427b713a232",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"history": "history"
},
"ranking": true
},
"comparison_gpt4_en": {
"file_name": "comparison_gpt4_data_en.json",
"file_sha1": "96fa18313544e22444fe20eead7754b17da452ae",
"ranking": true
},
"comparison_gpt4_zh": {
"file_name": "comparison_gpt4_data_zh.json",
"file_sha1": "515b18ed497199131ddcc1af950345c11dc5c7fd",
"ranking": true
},
"orca_rlhf": {
"file_name": "orca_rlhf.json",
"file_sha1": "acc8f74d16fd1fc4f68e7d86eaa781c2c3f5ba8e",
"ranking": true,
"columns": {
"prompt": "question",
"response": "answer",
"system": "system"
}
},
"nectar_rm": {
"hf_hub_url": "mlinmg/RLAIF-Nectar",
"ms_hub_url": "AI-ModelScope/RLAIF-Nectar",
"ranking": true
},
"dpo_mix_en": {
"hf_hub_url": "hiyouga/DPO-En-Zh-20k",
"subset": "en",
"ranking": true,
"columns": {
"prompt": "prompt",
"response": "answer",
"system": "system",
"history": "history"
}
},
"dpo_mix_zh": {
"hf_hub_url": "hiyouga/DPO-En-Zh-20k",
"subset": "zh",
"ranking": true,
"columns": {
"prompt": "prompt",
"response": "answer",
"system": "system",
"history": "history"
}
},
"orca_dpo_de": {
"hf_hub_url": "mayflowergmbh/intel_orca_dpo_pairs_de",
"ranking": true
},
"wiki_demo": {
"file_name": "wiki_demo.txt",
"file_sha1": "e70375e28eda542a90c68213640cc371898ce181",
"columns": {
"prompt": "text"
}
},
"c4_demo": {
"file_name": "c4_demo.json",
"file_sha1": "a5a0c86759732f9a5238e447fecd74f28a66cca8",
"columns": {
"prompt": "text"
}
},
"refinedweb": {
"hf_hub_url": "tiiuae/falcon-refinedweb",
"columns": {
"prompt": "content"
}
},
"redpajama_v2": {
"hf_hub_url": "togethercomputer/RedPajama-Data-V2",
"columns": {
"prompt": "raw_content"
},
"subset": "default"
},
"wikipedia_en": {
"hf_hub_url": "olm/olm-wikipedia-20221220",
"ms_hub_url": "AI-ModelScope/olm-wikipedia-20221220",
"columns": {
"prompt": "text"
}
},
"wikipedia_zh": {
"hf_hub_url": "pleisto/wikipedia-cn-20230720-filtered",
"ms_hub_url": "AI-ModelScope/wikipedia-cn-20230720-filtered",
"columns": {
"prompt": "completion"
}
},
"pile": {
"hf_hub_url": "monology/pile-uncopyrighted",
"ms_hub_url": "AI-ModelScope/pile",
"columns": {
"prompt": "text"
}
},
"skypile": {
"hf_hub_url": "Skywork/SkyPile-150B",
"ms_hub_url": "AI-ModelScope/SkyPile-150B",
"columns": {
"prompt": "text"
}
},
"the_stack": {
"hf_hub_url": "bigcode/the-stack",
"ms_hub_url": "AI-ModelScope/the-stack",
"columns": {
"prompt": "content"
}
},
"starcoder_python": {
"hf_hub_url": "bigcode/starcoderdata",
"ms_hub_url": "AI-ModelScope/starcoderdata",
"columns": {
"prompt": "content"
},
"folder": "python"
}
}
\ No newline at end of file
import json
import datasets
from typing import Any, Dict, List
_DESCRIPTION = "An example of dataset."
_CITATION = ""
_HOMEPAGE = ""
_LICENSE = ""
_URL = "examples.json"
class ExampleDataset(datasets.GeneratorBasedBuilder):
VERSION = datasets.Version("0.0.0")
def _info(self) -> datasets.DatasetInfo:
features = datasets.Features({
"instruction": datasets.Value("string"),
"input": datasets.Value("string"),
"output": datasets.Value("string"),
"history": datasets.Sequence(datasets.Sequence(datasets.Value("string")))
})
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=features,
homepage=_HOMEPAGE,
license=_LICENSE,
citation=_CITATION
)
def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]:
file_path = dl_manager.download(_URL)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={
"filepath": file_path
}
)
]
def _generate_examples(self, filepath: str) -> Dict[int, Dict[str, Any]]:
example_dataset = json.load(open(filepath, "r", encoding="utf-8"))
for key, example in enumerate(example_dataset):
yield key, example
[
{
"instruction": "听起来很不错。人工智能可能在哪些方面面临挑战呢?",
"input": "",
"output": "人工智能面临的挑战包括数据隐私、安全和道德方面的问题,以及影响就业机会的自动化等问题。",
"history": [
["你好,你能帮我解答一个问题吗?", "当然,请问有什么问题?"],
["我想了解人工智能的未来发展方向,你有什么想法吗?", "人工智能在未来的发展方向可能包括更强大的机器学习算法,更先进的自然语言处理技术,以及更加智能的机器人。"]
]
},
{
"instruction": "好的,谢谢你!",
"input": "",
"output": "不客气,有其他需要帮忙的地方可以继续问我。",
"history": [
["你好,能告诉我今天天气怎么样吗?", "当然可以,请问您所在的城市是哪里?"],
["我在纽约。", "纽约今天晴间多云,气温最高约26摄氏度,最低约18摄氏度,记得注意保暖喔。"]
]
}
]
This diff is collapsed.
import json
import datasets
from typing import List
_DESCRIPTION = "Human preference data about helpfulness and harmlessness."
_CITATION = ""
_HOMEPAGE = "https://huggingface.co/datasets/Anthropic/hh-rlhf"
_LICENSE = "mit"
_URL = "https://huggingface.co/datasets/Anthropic/hh-rlhf/resolve/main/"
_URLS = {
"train": [
_URL + "harmless-base/train.jsonl.gz",
_URL + "helpful-base/train.jsonl.gz",
_URL + "helpful-online/train.jsonl.gz",
_URL + "helpful-rejection-sampled/train.jsonl.gz"
],
"test": [
_URL + "harmless-base/test.jsonl.gz",
_URL + "helpful-base/test.jsonl.gz",
_URL + "helpful-online/test.jsonl.gz",
_URL + "helpful-rejection-sampled/test.jsonl.gz"
]
}
class HhRlhfEn(datasets.GeneratorBasedBuilder):
VERSION = datasets.Version("0.0.0")
def _info(self) -> datasets.DatasetInfo:
features = datasets.Features({
"instruction": datasets.Value("string"),
"output": datasets.Sequence(datasets.Value("string")),
"history": datasets.Sequence(datasets.Sequence(datasets.Value("string")))
})
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=features,
homepage=_HOMEPAGE,
license=_LICENSE,
citation=_CITATION
)
def _split_generators(self, dl_manager: datasets.DownloadManager):
file_path = dl_manager.download_and_extract(_URLS)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={
"filepaths": file_path["train"]
}
),
datasets.SplitGenerator(
name=datasets.Split.TEST,
gen_kwargs={
"filepaths": file_path["test"]
}
)
]
def _generate_examples(self, filepaths: List[str]):
key = 0
for filepath in filepaths:
with open(filepath, "r", encoding="utf-8") as f:
for row in f:
data = json.loads(row)
chosen = data["chosen"]
rejected = data["rejected"]
assist_idx = rejected.rfind("\n\nAssistant: ")
r_reject = rejected[assist_idx+13:].strip()
assist_idx = chosen.rfind("\n\nAssistant: ")
r_accept = chosen[assist_idx+13:].strip()
human_idx = chosen.rfind("\n\nHuman: ")
query = chosen[human_idx+9:assist_idx].strip()
prompt = chosen[:human_idx]
history = []
while prompt.rfind("\n\nAssistant: ") != -1:
assist_idx = prompt.rfind("\n\nAssistant: ")
human_idx = prompt.rfind("\n\nHuman: ")
if human_idx != -1:
old_query = prompt[human_idx+9:assist_idx].strip()
old_resp = prompt[assist_idx+13:].strip()
history.insert(0, (old_query, old_resp))
else:
break
prompt = prompt[:human_idx]
yield key, {
"instruction": query,
"output": [r_accept, r_reject],
"history": history
}
key += 1
This diff is collapsed.
import json
import datasets
from typing import List
_DESCRIPTION = "UltraChat: Large-scale, Informative, and Diverse Multi-round Dialogue Data."
_CITATION = """\
@misc{UltraChat,
author = {Ding, Ning and Chen, Yulin and Xu, Bokai and Hu, Shengding and Qin, Yujia and Liu, Zhiyuan and Sun, Maosong and Zhou, Bowen},
title = {UltraChat: A Large-scale Auto-generated Multi-round Dialogue Data},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\\url{https://github.com/thunlp/ultrachat}},
}
"""
_HOMEPAGE = "https://huggingface.co/datasets/stingning/ultrachat"
_LICENSE = "cc-by-nc-4.0"
_BASE_DATA_URL = "https://huggingface.co/datasets/stingning/ultrachat/resolve/main/train_{idx}.jsonl"
class UltraChat(datasets.GeneratorBasedBuilder):
VERSION = datasets.Version("0.0.0")
def _info(self):
features = datasets.Features({
"conversations": [{"from": datasets.Value("string"), "value": datasets.Value("string")}]
})
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=features,
homepage=_HOMEPAGE,
license=_LICENSE,
citation=_CITATION
)
def _split_generators(self, dl_manager: datasets.DownloadManager):
file_paths = [dl_manager.download(_BASE_DATA_URL.format(idx=idx)) for idx in range(10)] # multiple shards
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={
"filepaths": file_paths
}
)
]
def _generate_examples(self, filepaths: List[str]):
for filepath in filepaths:
with open(filepath, "r", encoding="utf-8") as f:
for row in f:
try:
data = json.loads(row)
except:
continue
key: int = data["id"]
content: List[str] = data["data"]
if len(content) % 2 == 1:
content.pop(-1)
if len(content) < 2:
continue
conversations = [{
"from": "human" if i % 2 == 0 else "gpt",
"value": content[i]
} for i in range(len(content))]
yield key, {"conversations": conversations}
This diff is collapsed.
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
ENV DEBIAN_FRONTEND=noninteractive
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment