README.md 5.35 KB
Newer Older
wanglch's avatar
wanglch committed
1
2
# DISC-FinLLM

wanglch's avatar
wanglch committed
3
**DISC-FinLLM 是一个专门针对金融场景下为用户提供专业、智能、全面的金融咨询服务的金融领域大模型,由[复旦大学数据智能与社会计算实验室 (Fudan-DISC)](http://fudan-disc.com) 开发并开源。**
wanglch's avatar
wanglch committed
4
5
6

## 论文

dcuai's avatar
dcuai committed
7
- [DISC-FinLLM: A Chinese Financial Large Language Model based on Multiple Experts Fine-tuning](https://arxiv.org/abs/2310.15205)
wanglch's avatar
wanglch committed
8
9

## 模型结构
dcuai's avatar
dcuai committed
10
11
DISC-FinLLM是基于我们构建的高质量金融数据集DISC-Fin-SFT在通用领域中文大模型Baichuan-13B-Chat上进行LoRA指令微调得到的金融大模型。
Baichuan整体模型基于标准的Transformer结构
wanglch's avatar
wanglch committed
12
13
14
15
16
17

<div align="center">
    <img src="./images/transformer.jpg"/>
</div>

## 算法原理
dcuai's avatar
dcuai committed
18
DISC-FinLLM是以Baichuan-13B 为基座模型,通过数据集的四个部分,分别训练 4 个 LoRA 专家模组,如下图所示。部署时,用户只需更换在当前基座上的 LoRA 参数就可以切换功能。因此用户能够根据使用需求激活 / 停用模型的不同模组,而无需重新加载整个模型。
wanglch's avatar
wanglch committed
19
20

<div align=center>
dcuai's avatar
dcuai committed
21
    <img src="./images/lora_en.png"/>
wanglch's avatar
wanglch committed
22
23
24
25
26
27
28
</div>


## 环境配置
### Docker(方法一)
[光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤
```
dcuai's avatar
dcuai committed
29
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu22.04-dtk24.04.1-py3.10
wanglch's avatar
wanglch committed
30
31
32

docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=64G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name DISC-FinLLM <your imageID> bash

dcuai's avatar
dcuai committed
33
cd /path/your_code_data/
wanglch's avatar
wanglch committed
34
35

pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
dcuai's avatar
dcuai committed
36
pip install transformers=4.40.1
wanglch's avatar
wanglch committed
37

wanglch's avatar
wanglch committed
38
39
40
41
```

### Dockerfile(方法二)
```
dcuai's avatar
dcuai committed
42
cd /path/your_code_data/docker
wanglch's avatar
wanglch committed
43

dcuai's avatar
dcuai committed
44
docker build --no-cache -t disc-finllm:latest .
wanglch's avatar
wanglch committed
45

wanglch's avatar
wanglch committed
46
docker run --shm-size=64G --name disc-finllm -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v /path/your_code_data/:/path/your_code_data/ -it disc-finllm:latest bash
wanglch's avatar
wanglch committed
47
48

cd /path/your_code_data/
dcuai's avatar
dcuai committed
49
pip install transformers=4.40.1
wanglch's avatar
wanglch committed
50
51
52
53
54
55
```

### Anaconda(方法三)

关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
```
dcuai's avatar
dcuai committed
56
DTK驱动:dtk24.04.1
wanglch's avatar
wanglch committed
57
58
59
60
61
62
python:python3.10
torch:2.1
torchvision: 0.16.0
apex: 1.1.0
deepspped: 0.12.3
```
dcuai's avatar
dcuai committed
63
`Tips:以上dtk驱动、python、pytorch等DCU相关工具版本需要严格一一对应`
wanglch's avatar
wanglch committed
64
65
66
67

```
conda create -n DISC-FinLLM python=3.10

dcuai's avatar
dcuai committed
68
cd /path/your_code_data/
wanglch's avatar
wanglch committed
69
70

pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple
wanglch's avatar
wanglch committed
71

dcuai's avatar
dcuai committed
72
pip install transformers=4.40.1
wanglch's avatar
wanglch committed
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
```

## 数据集

**你可以在这里查看[资料分析评测](https://github.com/FudanDISC/DISC-FinLLM/tree/main/eval/computing_eval.json)、[时事分析评测](https://github.com/FudanDISC/DISC-FinLLM/tree/main/eval/retriever_eval.json)对应的数据集。**

### 自定义数据处理代码 
参考data_processor.py

```
import json

jsonl_file_path = '.../data/dataset_new.jsonl'
json_file_path = '../data/dataset_new.json'
data = []
with open(jsonl_file_path, 'r', encoding='utf-8') as file:
    for line in file:
        jsonl_data = json.loads(line)
        json_data = {
            "instruction": jsonl_data.get("context").split('\n')[0].replace('Instruction: ', ''),
            "input": jsonl_data.get("context").split('\n')[1].replace('Input: ', ''),
            "output": jsonl_data.get("target")
        }
        data.append(json_data)

with open(json_file_path, 'w', encoding='utf-8') as file:
    json.dump(data, file, ensure_ascii=False, indent=4)

print(data)
```

项目中已提供用于试验训练的迷你数据集,训练数据目录结构如下,用于正常训练的完整数据集请按此目录结构进行制备:
```
 ── data
    │   ├── computing_part.json
    │   ├── consulting_part.json
    │   ├── retrieval_part.json
    │   └── task_part.json
    │——————————
```

## 训练

### 单机多卡
```
bash multi_dcu_train.sh
```

### 单机单卡
```
bash sft_work_dtk.sh
```

## 推理
dcuai's avatar
dcuai committed
127
**运行推理代码时需将仓库文件夹FinLLM中的文件替换下载的本地模型文件,并且将cli_demo.py文件中的模型路径更换为本地模型路径。**
wanglch's avatar
wanglch committed
128
129
130
131
132
133
134
135
136
137
138
139
140

### 单机单卡
需将**cli_demo.py**中的**model_path**改为替换模型文件后的本地模型路径
```
python cli_demo.py
```

## result

<div align=center>
    <img src="./images/result.png"/>
</div>

dcuai's avatar
dcuai committed
141
142
143
144
145
146
147
148
149
150
### 精度
测试数据:[retrieval_part](data/retrieval_part.json),使用的加速卡:V100S/K100。

根据测试结果情况填写表格:
| device | train_loss | eval_los |
| :------: | :------: | :------: |
| V100s |  0.6173 | 0.6276 | 
| K100 | 0.6193 |  0.6269 | 


wanglch's avatar
wanglch committed
151
152
153

## 应用场景
### 算法类别
dcuai's avatar
dcuai committed
154
`文本分析`
wanglch's avatar
wanglch committed
155
156

### 热点应用行业
wanglch's avatar
wanglch committed
157
`金融,教育,政府,科研`
wanglch's avatar
wanglch committed
158
159
160
161
162

## 预训练权重

- [Hugging Face Go4miii/DISC-FinLLM](https://huggingface.co/Go4miii/DISC-FinLLM) 下载全参模型权重。

wanglch's avatar
wanglch committed
163
164
165
166

预训练权重快速下载中心:[SCNet AIModels](http://113.200.138.88:18080/aimodels)


wanglch's avatar
wanglch committed
167
项目中的预训练权重可从快速下载通道下载: [disc-finllm](http://113.200.138.88:18080/aimodels/disc-finllm)
wanglch's avatar
wanglch committed
168

wanglch's avatar
wanglch committed
169
## 源码仓库及问题反馈
wanglch's avatar
wanglch committed
170

wanglch's avatar
wanglch committed
171
172
173
- http://developer.hpccube.com/codes/modelzoo/disc-finllm_pytorch.git

## 参考资料
dcuai's avatar
dcuai committed
174
- https://github.com/FudanDISC/DISC-FinLLM
wanglch's avatar
wanglch committed
175