"scripts/clean_training_data/janitor_util.cpp" did not exist on "f79927898ef49d2f8c77196a13c085bd2422dab2"
README.md 4.37 KB
Newer Older
wangsen's avatar
wangsen committed
1

wangsen's avatar
init  
wangsen committed
2
3
4
# 论文
Transfer learning enables predictions in network biology
https://www.nature.com/articles/s41586-023-06139-9
wangsen's avatar
wangsen committed
5
6
7



wangsen's avatar
init  
wangsen committed
8
# 模型结构
wangsen's avatar
wangsen committed
9
![img](./media/image1.png)
wangsen's avatar
init  
wangsen committed
10
11
12
13
14
15
16

迁移学习通过利用在大规模通用数据集上预训练的深度学习模型,彻底改变了自然语言理解和计算机视觉等领域,然后可以对具有有限任务特定数据的大量下游任务进行微调。在这里,我们开发了一个基于上下文感知、注意力的深度学习模型Geneformer,该模型在大约3000万个单细胞转录组的大规模语料库上进行了预训练,以便在网络生物学数据有限的情况下进行特定于上下文的预测。在预训练过程中,Geneformer对网络动力学有了基本的了解,以完全自我监督的方式将网络层次编码在模型的注意力权重中。使用有限的任务特定数据对与染色质和网络动力学相关的下游任务进行微调,表明Geneformer始终提高了预测准确性。应用于有限患者数据的疾病建模,Geneformer确定了心肌病的候选治疗靶点。总体而言,Geneformer代表了一种预训练的深度学习模型,可以从中对广泛的下游应用进行微调,以加速发现关键的网络调节因子和候选治疗靶点。



# 算法原理
预训练的Geneformer架构。每个单细胞转录组被编码成排序值编码[秩编码],然后通过6层transformer编码器单元进行编码,输入大小为2048(完全代表Geneformer-30M中排序值编码的93%),256个嵌入维度,每层四个注意力头,前馈大小为512。Geneformer在2048的输入大小上使用full dense 自注意力。可提取的输出包括上下文基因和细胞嵌入编码、上下文注意力权重和上下文预测
wangsen's avatar
wangsen committed
17
![img](./media/image2.png)
wangsen's avatar
wangsen committed
18

wangsen's avatar
init  
wangsen committed
19
20
21
22
23
24
25
26

# 环境配置
Docker(方式一)
推荐使用docker方式运行,提供拉取的docker镜像:
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
docker run -dit --shm-size 80g --network=host --name=geneformer --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /opt/hyhal/:/opt/hyhal/:ro image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10 /bin/bash
docker exec -it geneformer /bin/bash
wangsen's avatar
wangsen committed
27
28
```

wangsen's avatar
init  
wangsen committed
29
安装docker中没有的依赖:
wangsen's avatar
wangsen committed
30
31

```
wangsen's avatar
init  
wangsen committed
32
pip install -r requirements.txt  -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
wangsen's avatar
wangsen committed
33
34
35
36
```



wangsen's avatar
init  
wangsen committed
37
Dockerfile(方式二)
wangsen's avatar
wangsen committed
38
39
40


```
wangsen's avatar
init  
wangsen committed
41
42
43
44
docker build -t geneformer:latest .
docker run -dit --shm-size 80g --network=host --name=geneformer --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /opt/hyhal/:/opt/hyhal/:ro geneformer:latest /bin/bash
docker exec -it geneformer /bin/bash
```
wangsen's avatar
wangsen committed
45

wangsen's avatar
wangsen committed
46

wangsen's avatar
wangsen committed
47

wangsen's avatar
init  
wangsen committed
48
Conda(方式三)
wangsen's avatar
wangsen committed
49

wangsen's avatar
init  
wangsen committed
50
1.创建conda虚拟环境:
wangsen's avatar
wangsen committed
51

wangsen's avatar
init  
wangsen committed
52
53
54
55
```
conda create -n geneformer python=3.10
conda activate geneformer 
```
wangsen's avatar
wangsen committed
56

wangsen's avatar
init  
wangsen committed
57
58
59
2.关于本项目DCU显卡所需的工具包、深度学习库等均可从光合开发者社区下载安装。
- [DTK 24.04.1](https://cancon.hpccube.com:65024/directlink/1/DTK-24.04.1/Ubuntu20.04.1/DTK-24.04.1-Ubuntu20.04.1-x86_64.tar.gz)
- [Pytorch 2.1](https://cancon.hpccube.com:65024/directlink/4/pytorch/DAS1.2/torch-2.1.0+das.opt1.dtk24042-cp310-cp310-manylinux_2_28_x86_64.whl)
wangsen's avatar
wangsen committed
60
61


wangsen's avatar
wangsen committed
62
Tips:以上dtk驱动、torch等工具版本需要严格一一对应。
wangsen's avatar
wangsen committed
63

wangsen's avatar
wangsen committed
64

wangsen's avatar
init  
wangsen committed
65
66
67
3. 其它依赖库参照requirements.txt安装:
```
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
wangsen's avatar
wangsen committed
68
```
wangsen's avatar
wangsen committed
69

wangsen's avatar
wangsen committed
70
71


wangsen's avatar
init  
wangsen committed
72
73
74
75
76
77
# 下载
## 安装git-lfs 
```
sudo apt-get update
sudo apt-get install git-lfs
```
wangsen's avatar
wangsen committed
78
79


root's avatar
root committed
80
## 数据集下载
wangsen's avatar
init  
wangsen committed
81
82
83
84
```
#git clone https://hf-mirror.com/datasets/ctheodoris/Genecorpus-30M 
mkdir -p /path/to/
cd /path/to
root's avatar
root committed
85
git lfs  clone  http://113.200.138.88:18080/aimodels/ctheodoris/Geneformer.git
wangsen's avatar
init  
wangsen committed
86
```
wangsen's avatar
wangsen committed
87

wangsen's avatar
wangsen committed
88
## geneformer模型下载
wangsen's avatar
wangsen committed
89

wangsen's avatar
wangsen committed
90
模型下载以及安装geneformer
wangsen's avatar
init  
wangsen committed
91
92
93
 
```
cd /path/to
root's avatar
root committed
94
git lfs clone  -b pr146_branch   http://113.200.138.88:18080/aimodels/ctheodoris/Geneformer.git
wangsen's avatar
init  
wangsen committed
95
cd Geneformer
wangsen's avatar
wangsen committed
96
pip install -e . 
wangsen's avatar
init  
wangsen committed
97
```
wangsen's avatar
wangsen committed
98
99
100



wangsen's avatar
init  
wangsen committed
101
# 模型训练
wangsen's avatar
wangsen committed
102

wangsen's avatar
init  
wangsen committed
103
104
105
单卡运行 gene classification
```
cd geneformer/
wangsen's avatar
wangsen committed
106
python  train_cell.py
wangsen's avatar
init  
wangsen committed
107
108
```
详情可以参考 Geneformer/examples/cell_classification.ipynb
wangsen's avatar
wangsen committed
109
110


root's avatar
root committed
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126


# 推理
暂无


# 源码仓库及问题反馈

```
http://developer.sourcefind.cn/codes/modelzoo/geneformer.git
```


# 参考资料

```
wangsen's avatar
wangsen committed
127
https://hf-mirror.com/ctheodoris/Geneformer  
root's avatar
root committed
128
129
```

wangsen's avatar
wangsen committed
130
131
132