# Jina-Embeddings-V3
## 论文
`jina-embeddings-v3: Multilingual Embeddings With Task LoRA`
- https://arxiv.org/abs/2409.10173
## 模型结构
jina-embeddings-v3 以 XLM-ROBERTa 为基础架构,通过集成旋转位置编码(RoPE)来支持长文本,并为检索、分类等不同任务配备了专用的低秩适配器(LoRA)以生成任务特定的嵌入向量。
## 算法原理
该模型采用三阶段训练流程,首先进行多语言掩码模型(MLM)预训练,然后基于海量文本对进行对比学习微调,最后冻结主模型并使用任务专属的数据和损失函数(如InfoNCE、CoSent)独立训练各个LoRA适配器,同时利用合成数据来修复特定的检索失败场景。
## 环境配置
### 硬件需求
DCU型号:K100_AI,节点数量:1台,卡数:1张。
### Docker(方法一)
```bash
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.8.5-ubuntu22.04-dtk25.04-rc7-das1.5-py3.10-20250612-fixpy-rocblas0611-rc2
docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro {imageID} bash
cd /your_code_path/jina-embeddings-v3_vllm
```
### Dockerfile(方法二)
此处提供dockerfile的使用方法
```bash
cd docker
docker build --no-cache -t multilingual-e5:latest .
docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro {imageID} bash
cd /your_code_path/jina-embeddings-v3_vllm
```
### Anaconda(方法三)
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
```bash
DTK: 25.04
python: 3.10
vllm: 0.8.5
torch: 2.4.1+das.opt2.dtk2504
```
`Tips:以上dtk驱动、pytorch等DCU相关工具版本需要严格一一对应`
其它非深度学习库安装方式如下:
```bash
pip install transformers>=4.51.1
```
## 数据集
暂无
## 训练
暂无
## 推理
### vllm推理方法
```bash
## 必须添加HF_ENDPOINT环境变量
export HF_ENDPOINT=https://hf-mirror.com
## model_name_or_path 模型地址参数
python ./infer/infer_vllm.py --model /path/your_model_path/
```
## result
```
Generated Outputs:
Only text matching task is supported for now. See #16120
------------------------------------------------------------
Prompt: 'Follow the white rabbit.'
Embeddings for text matching: [-0.142578125, -0.050537109375, 0.01336669921875, 0.046142578125, 0.0810546875, 0.03564453125, -0.00091552734375, 0.058837890625, -0.04833984375, -0.032958984375, -0.07275390625, 0.0625, -0.08154296875, 0.0634765625, -0.0849609375, -0.02685546875, ...] (size=1024)
------------------------------------------------------------
Prompt: 'Sigue al conejo blanco.'
Embeddings for text matching: [-0.048828125, -0.04833984375, -0.045166015625, -0.0255126953125, 0.1357421875, -0.0267333984375, -0.0021209716796875, 0.052734375, -0.08837890625, 0.006561279296875, -0.02978515625, 0.0017242431640625, -0.03955078125, 0.08544921875, -0.1181640625, 0.0634765625, ...] (size=1024)
------------------------------------------------------------
Prompt: 'Suis le lapin blanc.'
Embeddings for text matching: [-0.1328125, -0.0458984375, -0.08154296875, 0.0162353515625, 0.07421875, 0.01019287109375, 0.054931640625, 0.031005859375, -0.08837890625, 0.043212890625, 0.0439453125, 0.08154296875, -0.1318359375, -0.0167236328125, -0.0927734375, -0.0111083984375, ...] (size=1024)
------------------------------------------------------------
Prompt: '跟着白兔走。'
Embeddings for text matching: [-0.0213623046875, -0.146484375, 0.0128173828125, 0.0194091796875, 0.138671875, -0.04931640625, -0.10400390625, 0.0849609375, -0.08203125, 0.017578125, -0.030029296875, 0.134765625, -0.0908203125, -0.047119140625, -0.0625, 0.033203125, ...] (size=1024)
------------------------------------------------------------
Prompt: 'اتبع الأرنب الأبيض.'
Embeddings for text matching: [-0.095703125, -0.0478515625, -0.055419921875, 0.020263671875, 0.0712890625, -0.0086669921875, 0.04541015625, 0.038818359375, 0.021484375, 0.034423828125, -0.01019287109375, 0.00885009765625, -0.1015625, 0.04541015625, -0.11474609375, 0.02099609375, ...] (size=1024)
------------------------------------------------------------
Prompt: 'Folge dem weißen Kaninchen.'
Embeddings for text matching: [-0.07421875, -0.06787109375, -0.006988525390625, 0.00023555755615234375, 0.1455078125, 0.00689697265625, 0.0007781982421875, 0.0712890625, -0.138671875, 0.01513671875, -0.055908203125, 0.055908203125, -0.060546875, 0.08984375, -0.10107421875, 0.008544921875, ...] (size=1024)
所有嵌入已保存到: ./infer/embeddings_K100_AI.npy
```
### 精度
```
# 运行acc.py之前,请分别在DCU和GPU上运行infer_vllm.py,得到各自的embedding数据
python ./infer/acc.py --gpu_embeddings ./infer/embeddings_A800.npy --dcu_embeddings ./infer/embeddings_K100_AI.npy
```
结果
```
abs_diff:
[[0.00097656 0.00048828 0.00036621 ... 0.00024414 0.00079346 0. ]
[0. 0.00024414 0.00024414 ... 0.00036049 0.00012207 0. ]
[0.00097656 0. 0. ... 0.00024414 0.00061035 0.00012207]
[0.00158691 0.00097656 0.00073242 ... 0.00036621 0.00061035 0. ]
[0. 0. 0. ... 0. 0.00018311 0.00097656]
[0. 0.00097656 0.00057983 ... 0.00015259 0. 0.00061035]]
mean_abs_diff:
[0.00028698 0.00033706 0.00036549 0.00031435 0.00039574 0.00033835]
```
DCU与GPU精度一致,推理框架:vllm。
## 应用场景
### 算法类别
`文本理解`
### 热点应用行业
`制造,金融,教育`
## 预训练权重
- [jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3)
## 源码仓库及问题反馈
- https://developer.sourcefind.cn/codes/modelzoo/jina-embeddings-v3_vllm
## 参考资料
- https://github.com/jina-ai