README.md 4.28 KB
Newer Older
wangsen's avatar
wangsen committed
1

wangsen's avatar
init  
wangsen committed
2
3
4
# 论文
Transfer learning enables predictions in network biology
https://www.nature.com/articles/s41586-023-06139-9
wangsen's avatar
wangsen committed
5
6
7



wangsen's avatar
init  
wangsen committed
8
9
10
11
12
13
14
15
16
# 模型结构

迁移学习通过利用在大规模通用数据集上预训练的深度学习模型,彻底改变了自然语言理解和计算机视觉等领域,然后可以对具有有限任务特定数据的大量下游任务进行微调。在这里,我们开发了一个基于上下文感知、注意力的深度学习模型Geneformer,该模型在大约3000万个单细胞转录组的大规模语料库上进行了预训练,以便在网络生物学数据有限的情况下进行特定于上下文的预测。在预训练过程中,Geneformer对网络动力学有了基本的了解,以完全自我监督的方式将网络层次编码在模型的注意力权重中。使用有限的任务特定数据对与染色质和网络动力学相关的下游任务进行微调,表明Geneformer始终提高了预测准确性。应用于有限患者数据的疾病建模,Geneformer确定了心肌病的候选治疗靶点。总体而言,Geneformer代表了一种预训练的深度学习模型,可以从中对广泛的下游应用进行微调,以加速发现关键的网络调节因子和候选治疗靶点。



# 算法原理
预训练的Geneformer架构。每个单细胞转录组被编码成排序值编码[秩编码],然后通过6层transformer编码器单元进行编码,输入大小为2048(完全代表Geneformer-30M中排序值编码的93%),256个嵌入维度,每层四个注意力头,前馈大小为512。Geneformer在2048的输入大小上使用full dense 自注意力。可提取的输出包括上下文基因和细胞嵌入编码、上下文注意力权重和上下文预测
![Alt text](image.png)
wangsen's avatar
wangsen committed
17

wangsen's avatar
init  
wangsen committed
18
19
20
21
22
23
24
25

# 环境配置
Docker(方式一)
推荐使用docker方式运行,提供拉取的docker镜像:
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
docker run -dit --shm-size 80g --network=host --name=geneformer --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /opt/hyhal/:/opt/hyhal/:ro image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10 /bin/bash
docker exec -it geneformer /bin/bash
wangsen's avatar
wangsen committed
26
27
```

wangsen's avatar
init  
wangsen committed
28
安装docker中没有的依赖:
wangsen's avatar
wangsen committed
29
30

```
wangsen's avatar
init  
wangsen committed
31
pip install -r requirements.txt  -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
wangsen's avatar
wangsen committed
32
33
34
35
```



wangsen's avatar
init  
wangsen committed
36
Dockerfile(方式二)
wangsen's avatar
wangsen committed
37
38
39


```
wangsen's avatar
init  
wangsen committed
40
41
42
docker build -t geneformer:latest .
docker run -dit --shm-size 80g --network=host --name=geneformer --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /opt/hyhal/:/opt/hyhal/:ro geneformer:latest /bin/bash
docker exec -it geneformer /bin/bash
wangsen's avatar
wangsen committed
43

wangsen's avatar
init  
wangsen committed
44
```
wangsen's avatar
wangsen committed
45

wangsen's avatar
wangsen committed
46

wangsen's avatar
wangsen committed
47

wangsen's avatar
init  
wangsen committed
48
Conda(方式三)
wangsen's avatar
wangsen committed
49

wangsen's avatar
init  
wangsen committed
50
1.创建conda虚拟环境:
wangsen's avatar
wangsen committed
51

wangsen's avatar
init  
wangsen committed
52
53
54
55
```
conda create -n geneformer python=3.10
conda activate geneformer 
```
wangsen's avatar
wangsen committed
56

wangsen's avatar
init  
wangsen committed
57
58
59
2.关于本项目DCU显卡所需的工具包、深度学习库等均可从光合开发者社区下载安装。
- [DTK 24.04.1](https://cancon.hpccube.com:65024/directlink/1/DTK-24.04.1/Ubuntu20.04.1/DTK-24.04.1-Ubuntu20.04.1-x86_64.tar.gz)
- [Pytorch 2.1](https://cancon.hpccube.com:65024/directlink/4/pytorch/DAS1.2/torch-2.1.0+das.opt1.dtk24042-cp310-cp310-manylinux_2_28_x86_64.whl)
wangsen's avatar
wangsen committed
60
61


wangsen's avatar
init  
wangsen committed
62
Tips:以上dtk驱动、python、deepspeed等工具版本需要严格一一对应。
wangsen's avatar
wangsen committed
63

wangsen's avatar
wangsen committed
64

wangsen's avatar
init  
wangsen committed
65
66
67
3. 其它依赖库参照requirements.txt安装:
```
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
wangsen's avatar
wangsen committed
68
```
wangsen's avatar
wangsen committed
69

wangsen's avatar
wangsen committed
70
71


wangsen's avatar
init  
wangsen committed
72
73
74
75
76
77
# 下载
## 安装git-lfs 
```
sudo apt-get update
sudo apt-get install git-lfs
```
wangsen's avatar
wangsen committed
78
79


wangsen's avatar
init  
wangsen committed
80
81
82
83
84
85
86
## 下载数据集
```
#git clone https://hf-mirror.com/datasets/ctheodoris/Genecorpus-30M 
mkdir -p /path/to/
cd /path/to
git clone  https://hf-mirror.com/datasets/ctheodoris/Genecorpus-30M
```
wangsen's avatar
wangsen committed
87
88


wangsen's avatar
init  
wangsen committed
89
## 模型下载
wangsen's avatar
wangsen committed
90
91
92



wangsen's avatar
init  
wangsen committed
93
### geneformer模型下载
wangsen's avatar
wangsen committed
94

wangsen's avatar
init  
wangsen committed
95
模型下载以及安装geneformls
wangsen's avatar
wangsen committed
96

wangsen's avatar
init  
wangsen committed
97
98
99
100
101
102
103
 
```
cd /path/to
git clone  -b pr146_branch   https://hf-mirror.com/ctheodoris/Geneformer
cd Geneformer
python install -e . 
```
wangsen's avatar
wangsen committed
104
105
106
107
108





wangsen's avatar
init  
wangsen committed
109
# 模型训练
wangsen's avatar
wangsen committed
110

wangsen's avatar
init  
wangsen committed
111
112
单卡运行 gene classification
```
wangsen's avatar
wangsen committed
113
114


wangsen's avatar
init  
wangsen committed
115
116
cd geneformer/
python  train.py
wangsen's avatar
wangsen committed
117

wangsen's avatar
init  
wangsen committed
118
119
```
详情可以参考 Geneformer/examples/cell_classification.ipynb
wangsen's avatar
wangsen committed
120

wangsen's avatar
init  
wangsen committed
121
122
```
python train_cell.py    # 替换py文件中dataset的路径
wangsen's avatar
wangsen committed
123

wangsen's avatar
init  
wangsen committed
124
```
wangsen's avatar
wangsen committed
125
126
127
128
129
130
131


# 参考
https://hf-mirror.com/ctheodoris/Geneformer