README.md 4.71 KB
Newer Older
dcuai's avatar
dcuai committed
1
# Geneformer
wangsen's avatar
init  
wangsen committed
2
3
4
# 论文
Transfer learning enables predictions in network biology
https://www.nature.com/articles/s41586-023-06139-9
wangsen's avatar
wangsen committed
5
6
7



wangsen's avatar
init  
wangsen committed
8
# 模型结构
wangsen's avatar
wangsen committed
9
![img](./media/image1.png)
wangsen's avatar
init  
wangsen committed
10
11
12
13
14
15
16

迁移学习通过利用在大规模通用数据集上预训练的深度学习模型,彻底改变了自然语言理解和计算机视觉等领域,然后可以对具有有限任务特定数据的大量下游任务进行微调。在这里,我们开发了一个基于上下文感知、注意力的深度学习模型Geneformer,该模型在大约3000万个单细胞转录组的大规模语料库上进行了预训练,以便在网络生物学数据有限的情况下进行特定于上下文的预测。在预训练过程中,Geneformer对网络动力学有了基本的了解,以完全自我监督的方式将网络层次编码在模型的注意力权重中。使用有限的任务特定数据对与染色质和网络动力学相关的下游任务进行微调,表明Geneformer始终提高了预测准确性。应用于有限患者数据的疾病建模,Geneformer确定了心肌病的候选治疗靶点。总体而言,Geneformer代表了一种预训练的深度学习模型,可以从中对广泛的下游应用进行微调,以加速发现关键的网络调节因子和候选治疗靶点。



# 算法原理
预训练的Geneformer架构。每个单细胞转录组被编码成排序值编码[秩编码],然后通过6层transformer编码器单元进行编码,输入大小为2048(完全代表Geneformer-30M中排序值编码的93%),256个嵌入维度,每层四个注意力头,前馈大小为512。Geneformer在2048的输入大小上使用full dense 自注意力。可提取的输出包括上下文基因和细胞嵌入编码、上下文注意力权重和上下文预测
wangsen's avatar
wangsen committed
17
![img](./media/image2.png)
wangsen's avatar
wangsen committed
18

wangsen's avatar
init  
wangsen committed
19
20
21
22
23
24
25
26

# 环境配置
Docker(方式一)
推荐使用docker方式运行,提供拉取的docker镜像:
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
docker run -dit --shm-size 80g --network=host --name=geneformer --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /opt/hyhal/:/opt/hyhal/:ro image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10 /bin/bash
docker exec -it geneformer /bin/bash
wangsen's avatar
wangsen committed
27
28
```

wangsen's avatar
init  
wangsen committed
29
安装docker中没有的依赖:
wangsen's avatar
wangsen committed
30
31

```
wangsen's avatar
init  
wangsen committed
32
pip install -r requirements.txt  -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
wangsen's avatar
wangsen committed
33
34
35
36
```



wangsen's avatar
init  
wangsen committed
37
Dockerfile(方式二)
wangsen's avatar
wangsen committed
38
39
40


```
wangsen's avatar
init  
wangsen committed
41
42
43
44
docker build -t geneformer:latest .
docker run -dit --shm-size 80g --network=host --name=geneformer --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /opt/hyhal/:/opt/hyhal/:ro geneformer:latest /bin/bash
docker exec -it geneformer /bin/bash
```
wangsen's avatar
wangsen committed
45

wangsen's avatar
wangsen committed
46

wangsen's avatar
wangsen committed
47

wangsen's avatar
init  
wangsen committed
48
Conda(方式三)
wangsen's avatar
wangsen committed
49

wangsen's avatar
init  
wangsen committed
50
1.创建conda虚拟环境:
wangsen's avatar
wangsen committed
51

wangsen's avatar
init  
wangsen committed
52
53
54
55
```
conda create -n geneformer python=3.10
conda activate geneformer 
```
wangsen's avatar
wangsen committed
56

wangsen's avatar
init  
wangsen committed
57
58
59
2.关于本项目DCU显卡所需的工具包、深度学习库等均可从光合开发者社区下载安装。
- [DTK 24.04.1](https://cancon.hpccube.com:65024/directlink/1/DTK-24.04.1/Ubuntu20.04.1/DTK-24.04.1-Ubuntu20.04.1-x86_64.tar.gz)
- [Pytorch 2.1](https://cancon.hpccube.com:65024/directlink/4/pytorch/DAS1.2/torch-2.1.0+das.opt1.dtk24042-cp310-cp310-manylinux_2_28_x86_64.whl)
wangsen's avatar
wangsen committed
60
61


wangsen's avatar
wangsen committed
62
Tips:以上dtk驱动、torch等工具版本需要严格一一对应。
wangsen's avatar
wangsen committed
63

wangsen's avatar
wangsen committed
64

wangsen's avatar
init  
wangsen committed
65
66
67
3. 其它依赖库参照requirements.txt安装:
```
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
wangsen's avatar
wangsen committed
68
```
wangsen's avatar
wangsen committed
69

root's avatar
root committed
70
# 数据集
wangsen's avatar
init  
wangsen committed
71
72
73
74
75
## 安装git-lfs 
```
sudo apt-get update
sudo apt-get install git-lfs
```
wangsen's avatar
wangsen committed
76

wangsen's avatar
wangsen committed
77
本次使用数据集[Geneformer](http://113.200.138.88:18080/aidatasets/ctheodoris/Genecorpus-30M.git)
wangsen's avatar
wangsen committed
78

root's avatar
root committed
79
## 数据集下载
wangsen's avatar
init  
wangsen committed
80
81
82
83
```
#git clone https://hf-mirror.com/datasets/ctheodoris/Genecorpus-30M 
mkdir -p /path/to/
cd /path/to
wangsen's avatar
wangsen committed
84
git lfs  clone  http://113.200.138.88:18080/aidatasets/ctheodoris/Genecorpus-30M.git
wangsen's avatar
init  
wangsen committed
85
```
wangsen's avatar
wangsen committed
86

root's avatar
root committed
87
88
89
90
91
92




# 模型训练

wangsen's avatar
wangsen committed
93
## geneformer模型下载
wangsen's avatar
wangsen committed
94

root's avatar
root committed
95
模型[Geneformer](http://113.200.138.88:18080/aimodels/ctheodoris/Geneformer.git)下载以及安装geneformer
wangsen's avatar
init  
wangsen committed
96
97
98
 
```
cd /path/to
root's avatar
root committed
99
git lfs clone  -b pr146_branch   http://113.200.138.88:18080/aimodels/ctheodoris/Geneformer.git
wangsen's avatar
init  
wangsen committed
100
cd Geneformer
wangsen's avatar
wangsen committed
101
pip install -e . 
wangsen's avatar
init  
wangsen committed
102
```
wangsen's avatar
wangsen committed
103

root's avatar
root committed
104
## 微调 gene classification
wangsen's avatar
init  
wangsen committed
105
106
```
cd geneformer/
wangsen's avatar
wangsen committed
107
python  train_cell.py
wangsen's avatar
init  
wangsen committed
108
109
```
详情可以参考 Geneformer/examples/cell_classification.ipynb
wangsen's avatar
wangsen committed
110

root's avatar
root committed
111
112
## result 
![Alt text](./media/image5.png)
wangsen's avatar
wangsen committed
113

root's avatar
root committed
114
115
116
117
118
119


# 推理
暂无


root's avatar
root committed
120
121
122
123
124
125
126
127
128
129
130
131
# 应用场景

## 算法类别
ai for science

# 行业
科研

# 热点应用行业

科研  基因预测  医疗

root's avatar
root committed
132
133
134
135
136
137
# 源码仓库及问题反馈

http://developer.sourcefind.cn/codes/modelzoo/geneformer.git

# 参考资料

wangsen's avatar
wangsen committed
138
https://hf-mirror.com/ctheodoris/Geneformer  
root's avatar
root committed
139

root's avatar
root committed
140

wangsen's avatar
wangsen committed
141
142
143