README.md 4.91 KB
Newer Older
dcuai's avatar
dcuai committed
1
# Geneformer
dcuai's avatar
dcuai committed
2
## 论文
wangsen's avatar
init  
wangsen committed
3
4
Transfer learning enables predictions in network biology
https://www.nature.com/articles/s41586-023-06139-9
wangsen's avatar
wangsen committed
5
6
7



dcuai's avatar
dcuai committed
8
## 模型结构
wangsen's avatar
wangsen committed
9
![img](./media/image1.png)
wangsen's avatar
init  
wangsen committed
10
11
12
13
14

迁移学习通过利用在大规模通用数据集上预训练的深度学习模型,彻底改变了自然语言理解和计算机视觉等领域,然后可以对具有有限任务特定数据的大量下游任务进行微调。在这里,我们开发了一个基于上下文感知、注意力的深度学习模型Geneformer,该模型在大约3000万个单细胞转录组的大规模语料库上进行了预训练,以便在网络生物学数据有限的情况下进行特定于上下文的预测。在预训练过程中,Geneformer对网络动力学有了基本的了解,以完全自我监督的方式将网络层次编码在模型的注意力权重中。使用有限的任务特定数据对与染色质和网络动力学相关的下游任务进行微调,表明Geneformer始终提高了预测准确性。应用于有限患者数据的疾病建模,Geneformer确定了心肌病的候选治疗靶点。总体而言,Geneformer代表了一种预训练的深度学习模型,可以从中对广泛的下游应用进行微调,以加速发现关键的网络调节因子和候选治疗靶点。



dcuai's avatar
dcuai committed
15
### 算法原理
wangsen's avatar
init  
wangsen committed
16
预训练的Geneformer架构。每个单细胞转录组被编码成排序值编码[秩编码],然后通过6层transformer编码器单元进行编码,输入大小为2048(完全代表Geneformer-30M中排序值编码的93%),256个嵌入维度,每层四个注意力头,前馈大小为512。Geneformer在2048的输入大小上使用full dense 自注意力。可提取的输出包括上下文基因和细胞嵌入编码、上下文注意力权重和上下文预测
wangsen's avatar
wangsen committed
17
![img](./media/image2.png)
wangsen's avatar
wangsen committed
18

wangsen's avatar
init  
wangsen committed
19

dcuai's avatar
dcuai committed
20
21
## 环境配置
### Docker(方式一)
wangsen's avatar
init  
wangsen committed
22
23
24
25
26
推荐使用docker方式运行,提供拉取的docker镜像:
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
docker run -dit --shm-size 80g --network=host --name=geneformer --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /opt/hyhal/:/opt/hyhal/:ro image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10 /bin/bash
docker exec -it geneformer /bin/bash
wangsen's avatar
wangsen committed
27
28
```

wangsen's avatar
init  
wangsen committed
29
安装docker中没有的依赖:
wangsen's avatar
wangsen committed
30
31

```
wangsen's avatar
init  
wangsen committed
32
pip install -r requirements.txt  -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
wangsen's avatar
wangsen committed
33
34
35
36
```



dcuai's avatar
dcuai committed
37
### Dockerfile(方式二)
wangsen's avatar
wangsen committed
38
39
40


```
wangsen's avatar
init  
wangsen committed
41
42
43
44
docker build -t geneformer:latest .
docker run -dit --shm-size 80g --network=host --name=geneformer --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /opt/hyhal/:/opt/hyhal/:ro geneformer:latest /bin/bash
docker exec -it geneformer /bin/bash
```
wangsen's avatar
wangsen committed
45

wangsen's avatar
wangsen committed
46

wangsen's avatar
wangsen committed
47

dcuai's avatar
dcuai committed
48
### Anaconda(方法三)
wangsen's avatar
wangsen committed
49

wangsen's avatar
init  
wangsen committed
50
1.创建conda虚拟环境:
wangsen's avatar
wangsen committed
51

wangsen's avatar
init  
wangsen committed
52
53
54
55
```
conda create -n geneformer python=3.10
conda activate geneformer 
```
wangsen's avatar
wangsen committed
56

wangsen's avatar
init  
wangsen committed
57
58
59
2.关于本项目DCU显卡所需的工具包、深度学习库等均可从光合开发者社区下载安装。
- [DTK 24.04.1](https://cancon.hpccube.com:65024/directlink/1/DTK-24.04.1/Ubuntu20.04.1/DTK-24.04.1-Ubuntu20.04.1-x86_64.tar.gz)
- [Pytorch 2.1](https://cancon.hpccube.com:65024/directlink/4/pytorch/DAS1.2/torch-2.1.0+das.opt1.dtk24042-cp310-cp310-manylinux_2_28_x86_64.whl)
wangsen's avatar
wangsen committed
60
61


wangsen's avatar
wangsen committed
62
Tips:以上dtk驱动、torch等工具版本需要严格一一对应。
wangsen's avatar
wangsen committed
63

wangsen's avatar
wangsen committed
64

wangsen's avatar
init  
wangsen committed
65
66
67
3. 其它依赖库参照requirements.txt安装:
```
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
wangsen's avatar
wangsen committed
68
```
wangsen's avatar
wangsen committed
69

dcuai's avatar
dcuai committed
70
71
## 数据集
### 安装git-lfs 
wangsen's avatar
init  
wangsen committed
72
73
74
75
```
sudo apt-get update
sudo apt-get install git-lfs
```
wangsen's avatar
wangsen committed
76

wangsen's avatar
wangsen committed
77
本次使用数据集[Geneformer](http://113.200.138.88:18080/aidatasets/ctheodoris/Genecorpus-30M.git)
wangsen's avatar
wangsen committed
78

dcuai's avatar
dcuai committed
79
### 数据集下载
wangsen's avatar
init  
wangsen committed
80
81
82
83
```
#git clone https://hf-mirror.com/datasets/ctheodoris/Genecorpus-30M 
mkdir -p /path/to/
cd /path/to
wangsen's avatar
wangsen committed
84
git lfs  clone  http://113.200.138.88:18080/aidatasets/ctheodoris/Genecorpus-30M.git
wangsen's avatar
init  
wangsen committed
85
```
wangsen's avatar
wangsen committed
86

root's avatar
root committed
87
88
89
90




dcuai's avatar
dcuai committed
91
## 训练
root's avatar
root committed
92

wangsen's avatar
wangsen committed
93
## geneformer模型下载
wangsen's avatar
wangsen committed
94

root's avatar
root committed
95
模型[Geneformer](http://113.200.138.88:18080/aimodels/ctheodoris/Geneformer.git)下载以及安装geneformer
wangsen's avatar
init  
wangsen committed
96
97
98
 
```
cd /path/to
root's avatar
root committed
99
git lfs clone  -b pr146_branch   http://113.200.138.88:18080/aimodels/ctheodoris/Geneformer.git
wangsen's avatar
init  
wangsen committed
100
cd Geneformer
wangsen's avatar
wangsen committed
101
pip install -e . 
wangsen's avatar
init  
wangsen committed
102
```
wangsen's avatar
wangsen committed
103

dcuai's avatar
dcuai committed
104
### 微调 gene classification
wangsen's avatar
init  
wangsen committed
105
106
```
cd geneformer/
wangsen's avatar
wangsen committed
107
python  train_cell.py
wangsen's avatar
init  
wangsen committed
108
109
```
详情可以参考 Geneformer/examples/cell_classification.ipynb
wangsen's avatar
wangsen committed
110

root's avatar
root committed
111
112
## result 
![Alt text](./media/image5.png)
wangsen's avatar
wangsen committed
113

dcuai's avatar
dcuai committed
114
### 精度
root's avatar
root committed
115

wangsen's avatar
wangsen committed
116
117
测试数据: cell_type_train_data.dataset,使用加速卡:k100ai-64G,单卡训练  
测试结果
wangsen's avatar
wangsen committed
118
119
120
121
122
123
| device| acc | 
| :------: | :------: |
| k100-ai | 0.991 |
| gpu-a800 | 0.990 |


dcuai's avatar
dcuai committed
124
## 推理
root's avatar
root committed
125
126
暂无

root's avatar
root committed
127

dcuai's avatar
dcuai committed
128
## 应用场景
root's avatar
root committed
129

dcuai's avatar
dcuai committed
130
131
### 算法类别
ai for science
root's avatar
root committed
132

dcuai's avatar
dcuai committed
133
### 热点应用行业
root's avatar
root committed
134
135
136

科研  基因预测  医疗

dcuai's avatar
dcuai committed
137
## 源码仓库及问题反馈
root's avatar
root committed
138
139
140

http://developer.sourcefind.cn/codes/modelzoo/geneformer.git

dcuai's avatar
dcuai committed
141
## 参考资料
root's avatar
root committed
142

wangsen's avatar
wangsen committed
143
https://hf-mirror.com/ctheodoris/Geneformer  
root's avatar
root committed
144

root's avatar
root committed
145

wangsen's avatar
wangsen committed
146
147
148