README.md 5.63 KB
Newer Older
zhangwq5's avatar
zhangwq5 committed
1
2
3
4
5
6
7
8
# Multilingual E5
## 论文
`Multilingual E5 Text Embeddings: A Technical Report`
- https://arxiv.org/abs/2402.05672
## 模型结构
多语言E5模型基于多语言MiniLM和XLM-RoBERTa,通过对比预训练和监督微调构建,支持小、基础、大型和指令调整变体,适用于多语言信息检索和语义相似性任务。

<div align=center>
zhangwq5's avatar
zhangwq5 committed
9
    <img src="./doc/RoBERTa.png"/>
zhangwq5's avatar
zhangwq5 committed
10
11
12
13
14
15
</div>

## 算法原理
多语言E5模型采用两阶段训练:首先通过InfoNCE对比损失在约10亿多语言文本对上进行弱监督预训练,学习语义表示;随后在约160万高品质标注数据上进行监督微调,结合硬负样本挖掘和跨编码器知识蒸馏,优化嵌入空间的语义相似性和多语言检索性能。

<div align=center>
zhangwq5's avatar
zhangwq5 committed
16
    <img src="./doc/algorithm_principle.png"/>
zhangwq5's avatar
zhangwq5 committed
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
</div>

## 环境配置
### 硬件需求
DCU型号:K100_AI,节点数量:1台,卡数:4张。
### Docker(方法一)
```bash
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.8.5-ubuntu22.04-dtk25.04-rc7-das1.5-py3.10-20250612-fixpy-rocblas0611-rc2

docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro {imageID} bash

cd /your_code_path/multilingual-e5-large_pytorch
pip install transformers>=4.51.0
pip install sentence-transformers>=4.1.0
```
### Dockerfile(方法二)
此处提供dockerfile的使用方法
```
zhangwq5's avatar
zhangwq5 committed
35
36
37
38
39
40
41
cd docker
docker build --no-cache -t multilingual-e5:latest .
docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro {imageID} bash

cd /your_code_path/multilingual-e5_pytorch
pip install transformers>=4.51.0
pip install sentence-transformers>=2.7.0
zhangwq5's avatar
zhangwq5 committed
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
```
### Anaconda(方法三)
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
```bash
DTK: 25.04
python: 3.10
vllm: 0.8.5
torch: 2.4.1+das.opt2.dtk2504
deepspeed: 0.14.2+das.opt2.dtk2504
```
`Tips:以上dtk驱动、python、paddle等DCU相关工具版本需要严格一一对应`

其它非深度学习库安装方式如下:
```bash
pip install transformers>=4.51.0
pip install sentence-transformers>=2.7.0
```
## 数据集
zhangwq5's avatar
zhangwq5 committed
60
暂无
zhangwq5's avatar
zhangwq5 committed
61
## 训练
zhangwq5's avatar
zhangwq5 committed
62
暂无
zhangwq5's avatar
zhangwq5 committed
63
64
65
## 推理
### vllm推理方法
```
zhangwq5's avatar
zhangwq5 committed
66
67
68
## 必须添加HF_ENDPOINT环境变量
export HF_ENDPOINT=https://hf-mirror.com
## model_name_or_path 模型地址参数
zhangwq5's avatar
zhangwq5 committed
69
70
python ./infer/infer_vllm.py --model /path/your_model_path/
```
zhangwq5's avatar
zhangwq5 committed
71
72
73
74
75
76
77
### sentence-transformers推理方法
```
## 必须添加HF_ENDPOINT环境变量
export HF_ENDPOINT=https://hf-mirror.com
## model_name_or_path 模型地址参数
python ./infer/infer_sentence_transformers.py
```
zhangwq5's avatar
zhangwq5 committed
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
## result
```
提示: '你好,我的名字是' | 嵌入: [0.018951416015625, -0.0121612548828125, -0.042022705078125, -0.03936767578125, 0.007015228271484375, -0.040130615234375, -0.0189361572265625, 0.04925537109375, 0.037322998046875, -0.01776123046875, 0.035614013671875, 0.01861572265625, -0.048248291015625, -0.015716552734375, -0.032745361328125, -0.01061248779296875, ...] (大小=1024)

提示: '美国总统是' | 嵌入: [0.034271240234375, 0.0015573501586914062, -0.04266357421875, -0.0291290283203125, 0.01983642578125, -0.0435791015625, 0.02117919921875, 0.0745849609375, 0.062255859375, -0.002933502197265625, 0.0333251953125, 0.037200927734375, -0.0291748046875, -0.034210205078125, -0.01837158203125, -0.02392578125, ...] (大小=1024)

提示: '法国的首都是' | 嵌入: [0.051971435546875, 0.0068359375, -0.021087646484375, -0.0528564453125, 0.0175018310546875, -0.0198211669921875, 0.0147552490234375, 0.051300048828125, 0.057861328125, -0.017242431640625, 0.0195159912109375, 0.0260162353515625, -0.0477294921875, -0.0278167724609375, -0.04351806640625, -0.0135498046875, ...] (大小=1024)

提示: '人工智能的未来是' | 嵌入: [0.016876220703125, 0.0059814453125, -0.0308074951171875, -0.05712890625, 0.01332855224609375, -0.00024700164794921875, -0.00913238525390625, 0.08123779296875, 0.049835205078125, -0.026123046875, 0.039398193359375, -0.00975799560546875, -0.0128326416015625, -0.021697998046875, -0.033447265625, -0.0147857666015625, ...] (大小=1024)

所有嵌入已保存到: ./infer/embeddings_A800.npy
```

### 精度
```
# 运行acc.py之前,请分别在DCU和GPU上运行infer_vllm.py,得到各自的embedding数据
python ./infer/acc.py --gpu_embeddings /path/embeddings_A800.npy --dcu_embeddings /path/embeddings_dcu.npy
```
结果
```
abs_diff:[[1.52587891e-05 1.52587891e-05 3.05175781e-05 ... 2.67028809e-05
  1.22070312e-04 1.06811523e-04]
 [3.05175781e-05 1.33514404e-05 6.10351562e-05 ... 2.28881836e-05
  1.22070312e-04 3.05175781e-05]
 [3.05175781e-05 3.43322754e-05 9.15527344e-05 ... 0.00000000e+00
  3.05175781e-05 1.22070312e-04]
 [1.52587891e-05 3.05175781e-05 0.00000000e+00 ... 5.34057617e-05
  7.62939453e-05 3.81469727e-05]]
mean_abs_diff:[3.93284135e-05 4.01343568e-05 3.79525591e-05 5.03971823e-05]
```

DCU与GPU精度一致,推理框架:vllm。
## 应用场景
### 算法类别
`文本理解`
### 热点应用行业
`制造,零售,互联网`
## 预训练权重
- [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)
## 源码仓库及问题反馈
zhangwq5's avatar
zhangwq5 committed
118
- https://developer.sourcefind.cn/codes/modelzoo/multilingual-e5_pytorch
zhangwq5's avatar
zhangwq5 committed
119
120
## 参考资料
- https://github.com/microsoft/unilm/tree/master/e5
zwq330205812's avatar
zwq330205812 committed
121