README.md 5.36 KB
Newer Older
wanglch's avatar
wanglch committed
1
2
# DISC-FinLLM

wanglch's avatar
wanglch committed
3
**DISC-FinLLM 是一个专门针对金融场景下为用户提供专业、智能、全面的金融咨询服务的金融领域大模型,由[复旦大学数据智能与社会计算实验室 (Fudan-DISC)](http://fudan-disc.com) 开发并开源。**
wanglch's avatar
wanglch committed
4
5
6

## 论文

dcuai's avatar
dcuai committed
7
- [DISC-FinLLM: A Chinese Financial Large Language Model based on Multiple Experts Fine-tuning](https://arxiv.org/abs/2310.15205)
wanglch's avatar
wanglch committed
8
9

## 模型结构
wanglch's avatar
wanglch committed
10
11
Baichuan-13B是由百川智能继Baichuan-7B之后开发的包含130亿参数模型,它在高质量的语料上训练了1.4万亿tokens,超过LLaMA-13B 40%。
Baichuan 2 是百川智能推出的新一代开源大语言模型,采用 2.6 万亿Tokens 的高质量语料训练。
wanglch's avatar
wanglch committed
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

<div align="center">
    <img src="./images/transformer.jpg"/>
</div>

## 算法原理
DISC-FinLLM是基于我们构建的高质量金融数据集DISC-Fin-SFT在通用领域中文大模型Baichuan-13B-Chat上进行LoRA指令微调得到的金融大模型。

<div align=center>
    <img src="./images/transformer.png"/>
</div>


## 环境配置
### Docker(方法一)
[光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu22.04-dtk23.10.1-py310

docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=64G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name DISC-FinLLM <your imageID> bash

dcuai's avatar
dcuai committed
33
cd /path/your_code_data/
wanglch's avatar
wanglch committed
34
35

pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
wanglch's avatar
wanglch committed
36
37

pip install bitsandbytes-0.43.0-py3-none-any.whl
wanglch's avatar
wanglch committed
38
39

pip install deepspeed-0.12.3+gitfe61783.abi0.dtk2310.torch2.1.0a0-cp310-cp310-manylinux2014_x86_64.whl
wanglch's avatar
wanglch committed
40
41
42
43
```

### Dockerfile(方法二)
```
dcuai's avatar
dcuai committed
44
cd /path/your_code_data/docker
wanglch's avatar
wanglch committed
45

dcuai's avatar
dcuai committed
46
docker build --no-cache -t disc-finllm:latest .
wanglch's avatar
wanglch committed
47

wanglch's avatar
wanglch committed
48
docker run --shm-size=64G --name disc-finllm -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v /path/your_code_data/:/path/your_code_data/ -it disc-finllm:latest bash
wanglch's avatar
wanglch committed
49
50
51
52
53
54

cd /path/your_code_data/

pip install bitsandbytes-0.43.0-py3-none-any.whl

pip install deepspeed-0.12.3+gitfe61783.abi0.dtk2310.torch2.1.0a0-cp310-cp310-manylinux2014_x86_64.whl
wanglch's avatar
wanglch committed
55
56
57
58
59
60
61
62
63
64
65
66
```

### Anaconda(方法三)

关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
```
DTK驱动:dtk23.10
python:python3.10
torch:2.1
torchvision: 0.16.0
apex: 1.1.0
deepspped: 0.12.3
wanglch's avatar
wanglch committed
67
bitsandbytes: 0.43.0
wanglch's avatar
wanglch committed
68
69
70
71
72
73
```
`Tips:以上dtk驱动、python、paddle等DCU相关工具版本需要严格一一对应`

```
conda create -n DISC-FinLLM python=3.10

dcuai's avatar
dcuai committed
74
cd /path/your_code_data/
wanglch's avatar
wanglch committed
75
76

pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple
wanglch's avatar
wanglch committed
77
78

pip install bitsandbytes-0.43.0-py3-none-any.whl
wanglch's avatar
wanglch committed
79
80

pip install deepspeed-0.12.3+gitfe61783.abi0.dtk2310.torch2.1.0a0-cp310-cp310-manylinux2014_x86_64.whl
wanglch's avatar
wanglch committed
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
```

## 数据集

**你可以在这里查看[资料分析评测](https://github.com/FudanDISC/DISC-FinLLM/tree/main/eval/computing_eval.json)、[时事分析评测](https://github.com/FudanDISC/DISC-FinLLM/tree/main/eval/retriever_eval.json)对应的数据集。**

### 自定义数据处理代码 
参考data_processor.py

```
import json

jsonl_file_path = '.../data/dataset_new.jsonl'
json_file_path = '../data/dataset_new.json'
data = []
with open(jsonl_file_path, 'r', encoding='utf-8') as file:
    for line in file:
        jsonl_data = json.loads(line)
        json_data = {
            "instruction": jsonl_data.get("context").split('\n')[0].replace('Instruction: ', ''),
            "input": jsonl_data.get("context").split('\n')[1].replace('Input: ', ''),
            "output": jsonl_data.get("target")
        }
        data.append(json_data)

with open(json_file_path, 'w', encoding='utf-8') as file:
    json.dump(data, file, ensure_ascii=False, indent=4)

print(data)
```

项目中已提供用于试验训练的迷你数据集,训练数据目录结构如下,用于正常训练的完整数据集请按此目录结构进行制备:
```
 ── data
    │   ├── computing_part.json
    │   ├── consulting_part.json
    │   ├── retrieval_part.json
    │   └── task_part.json
    │——————————
```

## 训练

### 单机多卡
```
bash multi_dcu_train.sh
```

### 单机单卡
```
bash sft_work_dtk.sh
```

## 推理
**运行推理代码时需将模型文件FinLLM中的文件替换下载的本地模型FinLLM文件,并且将cli_demo.py文件中的模型路径更换为本地模型路径。**

### 单机单卡
需将**cli_demo.py**中的**model_path**改为替换模型文件后的本地模型路径
```
python cli_demo.py
```

## result

<div align=center>
    <img src="./images/result.png"/>
</div>


## 应用场景
### 算法类别
dcuai's avatar
dcuai committed
152
`文本分析`
wanglch's avatar
wanglch committed
153
154

### 热点应用行业
wanglch's avatar
wanglch committed
155
`金融,教育,政府,科研`
wanglch's avatar
wanglch committed
156
157
158
159
160

## 预训练权重

- [Hugging Face Go4miii/DISC-FinLLM](https://huggingface.co/Go4miii/DISC-FinLLM) 下载全参模型权重。

wanglch's avatar
wanglch committed
161
162
163
164
165
166

预训练权重快速下载中心:[SCNet AIModels](http://113.200.138.88:18080/aimodels)


项目中的预训练权重可从快速下载通道下载: [DocOwl1.5-Omni](http://113.200.138.88:18080/aimodels/mplug-doclcal_1.5)

wanglch's avatar
wanglch committed
167
168
169
170
## 源码仓库及问题反馈
- http://developer.hpccube.com/codes/modelzoo/disc-finllm_pytorch.git

## 参考资料
dcuai's avatar
dcuai committed
171
- https://github.com/FudanDISC/DISC-FinLLM
wanglch's avatar
wanglch committed
172