README.md 4.69 KB
Newer Older
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
1
# Uni-Fold
2

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
3
## 论文
Rayyyyy's avatar
Rayyyyy committed
4
5
`Uni-Fold: An Open-Source Platform for Developing Protein Folding Models beyond AlphaFold`
- https://www.biorxiv.org/content/biorxiv/early/2022/08/06/2022.08.04.502811.full.pdf
zhangqha's avatar
zhangqha committed
6
7

## 模型结构
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
8
模型核心是一个基于Transformer架构的神经网络,包括两个主要组件:Sequence to Sequence Model和Structure Model,这两个组件通过迭代训练进行优化,以提高其预测准确性。
zhangqha's avatar
zhangqha committed
9

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
10
![img](./alphafold2.png)
zhangqha's avatar
zhangqha committed
11

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
12
13
## 算法原理
通过从蛋白质序列和结构数据中提取信息,使用神经网络模型来预测蛋白质三维结构。
zhangqha's avatar
zhangqha committed
14

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
15
![img](./alphafold2_1.png)
zhangqha's avatar
zhangqha committed
16

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
17
## 环境配置
zhangqha's avatar
zhangqha committed
18
提供[光源](https://www.sourcefind.cn/#/service-details)拉取的训练的docker镜像:
19

zhangqha's avatar
zhangqha committed
20
```
yuhai's avatar
yuhai committed
21
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:unifold-torch2.1.0-dtk24.04.2-ubuntu20.04-py310
22

yuhai's avatar
yuhai committed
23
docker run -it --network=host  -v <unifold代码路径>:<容器内unifold代码路径> --shm-size=32G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video  -v /opt/hyhal:/opt/hyhal:ro -v /opt/dtk-24.04.2-runtime/:/opt/dtk/:ro  --name <镜像名称> <上面获得的镜像ID> bash
Rayyyyy's avatar
Rayyyyy committed
24

yuhai's avatar
yuhai committed
25
26
mv -f <容器内unifold代码路径>/run_monomer.sh  /root/Uni-Fold/run_monomer.sh
mv -f <容器内unifold代码路径>/run_multimer.sh /root/Uni-Fold/run_multimer.sh
yuhai's avatar
yuhai committed
27
cd /root/Uni-Fold
zhangqha's avatar
zhangqha committed
28
```
29

yuhai's avatar
yuhai committed
30
进行homosearch需要挂载AF2数据集至容器内`/alphafold`,若只为测试,本镜像无需挂载AF2数据集。
yuhai's avatar
yuhai committed
31
因为本镜像在昆山K100_AI的eco卡节点测试,dtk24.04.2不能正常使用,测试时挂载dtk-24.04.2-runtime至/opt/dtk。若确认本节点dtk24.04.2能正常使用可以选择不挂载,并修改/root目录下的/env.sh及Uni-Fold目录中的/opt/dtk为/opt/dtk-24.04.2
yuhai's avatar
yuhai committed
32

yuhai's avatar
yuhai committed
33
34
35
36
37
镜像版本依赖:
* DTK驱动:dtk24.04.2
* Pytorch: 2.1.0
* unifold: 2.2.0
* unicore: 0.0.1
yuhai's avatar
yuhai committed
38
* python: 3.10
39

yuhai's avatar
yuhai committed
40
41
42
测试目录:
`/root/Uni-Fold`

yuhai's avatar
yuhai committed
43
安装requirement.txt中的工具,镜像中已经安装好加载方式:
44

yuhai's avatar
yuhai committed
45
46
```
source /root/env.sh
zhangqha's avatar
zhangqha committed
47
```
Rayyyyy's avatar
Rayyyyy committed
48

yuhai's avatar
yuhai committed
49
50
若报错`rm: cannot remove 'software': No such file or directory`等不必理会,其中的rm命令只是希望删除两个空文件夹。

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
51
## 数据集
Rayyyyy's avatar
Rayyyyy committed
52
推荐使用[AlphaFold2](http://113.200.138.88:18080/aidatasets/project-dependency/alphafold)中的开源数据集,包括BFD、MGnify、PDB70、Uniclust、Uniref90等,数据集大小约2.62TB。数据集格式如下:
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
53
```
Rayyyyy's avatar
Rayyyyy committed
54
55
$DOWNLOAD_DIR/
    bfd/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
56
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
Rayyyyy's avatar
Rayyyyy committed
57
58
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
59
        ...
Rayyyyy's avatar
Rayyyyy committed
60
    mgnify/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
61
        mgy_clusters_2022_05.fa
Rayyyyy's avatar
Rayyyyy committed
62
    params/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
63
64
65
66
        params_model_1.npz
        params_model_2.npz
        params_model_3.npz
        ...
Rayyyyy's avatar
Rayyyyy committed
67
    pdb70/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
68
69
70
71
        pdb_filter.dat
        pdb70_hhm.ffindex
        pdb70_hhm.ffdata
        ...
Rayyyyy's avatar
Rayyyyy committed
72
    pdb_mmcif/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
73
74
75
76
77
78
        mmcif_files/
            100d.cif
            101d.cif
            101m.cif
            ...
        obsolete.dat
Rayyyyy's avatar
Rayyyyy committed
79
    pdb_seqres/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
80
        pdb_seqres.txt
Rayyyyy's avatar
Rayyyyy committed
81
    small_bfd/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
82
        bfd-first_non_consensus_sequences.fasta
Rayyyyy's avatar
Rayyyyy committed
83
    uniref30/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
84
85
86
87
        UniRef30_2021_03_hhm.ffindex
        UniRef30_2021_03_hhm.ffdata
        UniRef30_2021_03_cs219.ffindex
        ...
Rayyyyy's avatar
Rayyyyy committed
88
    uniprot/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
89
        uniprot.fasta
Rayyyyy's avatar
Rayyyyy committed
90
    uniref90/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
91
92
        uniref90.fasta
```
Rayyyyy's avatar
Rayyyyy committed
93

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
94
此处提供了一个脚本download_all_data.sh用于下载使用的数据集和模型文件:
Rayyyyy's avatar
Rayyyyy committed
95

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
96
97
98
99
100
```
bash scripts/download/download_all_data.sh /path/to/database/directory
```

## 推理
zhangqha's avatar
zhangqha committed
101
### 安装
102
#### 安装Uni-Core-main(如使用镜像,则无需再次安装)
zhangqha's avatar
zhangqha committed
103
```
zhangqha's avatar
zhangqha committed
104
cd Uni-Core-main
105

zhangqha's avatar
zhangqha committed
106
python3 setup.py install
zhangqha's avatar
zhangqha committed
107
```
Rayyyyy's avatar
Rayyyyy committed
108

109
#### 安装Uni-Fold-main(如使用镜像,则无需再次安装)
zhangqha's avatar
zhangqha committed
110
```
zhangqha's avatar
zhangqha committed
111
pip install -e .
zhangqha's avatar
zhangqha committed
112
```
Rayyyyy's avatar
Rayyyyy committed
113

dcuai's avatar
dcuai committed
114
### 多卡测试
yuhai's avatar
yuhai committed
115
116
117
118

如果没有挂载AF2数据集,会出现多体、单体测试脚本运行homosearch中报错`ValueError: Could not find hmmsearch database /alphafold/pdb_seqres/pdb_seqres.txt`并停止搜索、直接进行推理的情况,由于单本镜像内置了部分序列的搜索结果,因此可以无视此报错进行正常推理。
内置搜索结果的目录见`/root/Uni-Fold/data`,其中有H1036、H1072、T1024的搜索结果,若要选择其他序列进行推理,请挂载AF2数据集并进行homosearch(非常耗时)。

zhangqha's avatar
zhangqha committed
119
#### 多聚体参考脚本,需要根据实际情况修改路径配置
zhangqha's avatar
zhangqha committed
120
```
yuhai's avatar
yuhai committed
121
bash run_multimer.sh
zhangqha's avatar
zhangqha committed
122
```
Rayyyyy's avatar
Rayyyyy committed
123

zhangqha's avatar
zhangqha committed
124
#### 单聚体参考脚本,需要根据实际情况修改路径配置
zhangqha's avatar
zhangqha committed
125
```
yuhai's avatar
yuhai committed
126
bash run_monomer.sh
zhangqha's avatar
zhangqha committed
127
```
Rayyyyy's avatar
Rayyyyy committed
128

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
129
## result
zhangqha's avatar
zhangqha committed
130
![img](./result_pdb.png)
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
131

dcuai's avatar
dcuai committed
132
### 精度
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
133
134


Rayyyyy's avatar
Rayyyyy committed
135
## 应用场景
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
136
### 算法类别
dcuai's avatar
dcuai committed
137
蛋白质结构预测
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
138
139
140
141

### 热点应用行业
医疗,科研,教育

Rayyyyy's avatar
Rayyyyy committed
142
143
## 源码仓库及问题反馈
- https://developer.hpccube.com/codes/modelzoo/uni-fold
zhangqha's avatar
zhangqha committed
144

dcuai's avatar
dcuai committed
145
## 参考资料
Rayyyyy's avatar
Rayyyyy committed
146
- https://github.com/dptech-corp/Uni-Fold