README.md 6.04 KB
Newer Older
dcuai's avatar
dcuai committed
1
# AF2
mashun1's avatar
mashun1 committed
2

zhuwenwen's avatar
zhuwenwen committed
3
4
## 论文
- [https://www.nature.com/articles/s41586-021-03819-2](https://www.nature.com/articles/s41586-021-03819-2)
Augustin-Zidek's avatar
Augustin-Zidek committed
5

zhuwenwen's avatar
zhuwenwen committed
6
7
## 模型结构
模型核心是一个基于Transformer架构的神经网络,包括两个主要组件:Sequence to Sequence Model和Structure Model,这两个组件通过迭代训练进行优化,以提高其预测准确性。
Augustin-Zidek's avatar
Augustin-Zidek committed
8

zhuwenwen's avatar
zhuwenwen committed
9
![img](./docs/alphafold2.png)
Augustin-Zidek's avatar
Augustin-Zidek committed
10

zhuwenwen's avatar
zhuwenwen committed
11
12
## 算法原理
AlphaFold2通过从蛋白质序列和结构数据中提取信息,使用神经网络模型来预测蛋白质三维结构。
Augustin Zidek's avatar
Augustin Zidek committed
13

zhuwenwen's avatar
zhuwenwen committed
14
![img](./docs/alphafold2_1.png)
Augustin Zidek's avatar
Augustin Zidek committed
15

mashun1's avatar
mashun1 committed
16

mashun1's avatar
mashun1 committed
17
18
## 环境配置

mashun1's avatar
docker  
mashun1 committed
19
### Docker
mashun1's avatar
mashun1 committed
20

mashun1's avatar
fix bug  
mashun1 committed
21
    # 使用该方法不需要下载本仓库,镜像中已包含可运行代码,只需要挂载相应的数据文件
mashun1's avatar
mashun1 committed
22

mashun1's avatar
mashun1 committed
23
24
25
26
    docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:alphafold2-dtk24.04.1-py310

    docker run --shm-size 100g --network=host --name=alphafold2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 本地数据地址:镜像数据地址 -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash

mashun1's avatar
fix bug  
mashun1 committed
27
    cd /app/alphafold2/dcu_build
mashun1's avatar
mashun1 committed
28

mashun1's avatar
fix bug  
mashun1 committed
29
30
    # 此过程较为耗时
    bash build.sh
mashun1's avatar
mashun1 committed
31

mashun1's avatar
docker  
mashun1 committed
32
33
34
35
    source env.sh

    export PATH=/app/softwares/hh-suite/build/bin:/app/softwares/hh-suite/build/scripts:$PATH

mashun1's avatar
mashun1 committed
36

mashun1's avatar
fix bug  
mashun1 committed
37
### Dockerfile
mashun1's avatar
mashun1 committed
38

mashun1's avatar
fix bug  
mashun1 committed
39
    # 下载本仓库
mashun1's avatar
mashun1 committed
40

mashun1's avatar
fix bug  
mashun1 committed
41
    docker build -t alphafold2:v1 .
mashun1's avatar
mashun1 committed
42

mashun1's avatar
fix bug  
mashun1 committed
43
    docker run --shm-size 100g --network=host --name=alphafold2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 本地数据地址:镜像数据地址 -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
mashun1's avatar
mashun1 committed
44

mashun1's avatar
fix bug  
mashun1 committed
45
    cd /app/alphafold2/dcu_build
mashun1's avatar
mashun1 committed
46

mashun1's avatar
fix bug  
mashun1 committed
47
48
    # 此过程较为耗时
    bash build.sh
mashun1's avatar
mashun1 committed
49

mashun1's avatar
fix bug  
mashun1 committed
50
    source env.sh
Augustin-Zidek's avatar
Augustin-Zidek committed
51

zhuwenwen's avatar
zhuwenwen committed
52
53
## 数据集
推荐使用AlphaFold2中的开源数据集,包括BFD、MGnify、PDB70、Uniclust、Uniref90等,数据集大小约2.62TB。数据集格式如下:
Augustin-Zidek's avatar
Augustin-Zidek committed
54
```
zhuwenwen's avatar
zhuwenwen committed
55
56
57
58
59
60
61
$DOWNLOAD_DIR/                             
    bfd/  
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata 
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex                           
        ...
    mgnify/                                
Augustin Zidek's avatar
Augustin Zidek committed
62
        mgy_clusters_2022_05.fa
zhuwenwen's avatar
zhuwenwen committed
63
64
65
66
67
68
69
70
71
72
73
    params/                                
        params_model_1.npz
        params_model_2.npz
        params_model_3.npz
        ...
    pdb70/                                
        pdb_filter.dat
        pdb70_hhm.ffindex
        pdb70_hhm.ffdata
        ...
    pdb_mmcif/                            
Augustin-Zidek's avatar
Augustin-Zidek committed
74
        mmcif_files/
zhuwenwen's avatar
zhuwenwen committed
75
76
77
78
            100d.cif
            101d.cif
            101m.cif
            ...
Augustin-Zidek's avatar
Augustin-Zidek committed
79
        obsolete.dat
zhuwenwen's avatar
zhuwenwen committed
80
    pdb_seqres/                            
81
        pdb_seqres.txt
zhuwenwen's avatar
zhuwenwen committed
82
    small_bfd/                           
83
        bfd-first_non_consensus_sequences.fasta
zhuwenwen's avatar
zhuwenwen committed
84
85
86
87
88
89
    uniref30/                            
        UniRef30_2021_03_hhm.ffindex
        UniRef30_2021_03_hhm.ffdata
        UniRef30_2021_03_cs219.ffindex
        ...
    uniprot/                               
90
        uniprot.fasta
zhuwenwen's avatar
zhuwenwen committed
91
    uniref90/                             
Augustin-Zidek's avatar
Augustin-Zidek committed
92
93
94
        uniref90.fasta
```

zhuwenwen's avatar
zhuwenwen committed
95
此处提供了一个脚本download_all_data.sh用于下载使用的数据集和模型文件:
Augustin-Zidek's avatar
Augustin-Zidek committed
96

zhuwenwen's avatar
zhuwenwen committed
97
    ./scripts/download_all_data.sh 数据集下载目录
98

chenzk's avatar
chenzk committed
99
100


zhuwenwen's avatar
zhuwenwen committed
101
## 推理
mashun1's avatar
mashun1 committed
102
103
104

注意:在运行前请修改相应脚本中的参数。

zhuwenwen's avatar
zhuwenwen committed
105
分别提供了基于Jax的单体和多体的推理脚本.
106
```bash
zhuwenwen's avatar
zhuwenwen committed
107
    # 进入工程目录
mashun1's avatar
mashun1 committed
108
    cd /app/alphafold2
109
110
```

zhuwenwen's avatar
zhuwenwen committed
111
### 单体
112
```bash
mashun1's avatar
mashun1 committed
113
    bash run_monomer.sh
114
```
zhuwenwen's avatar
zhuwenwen committed
115
单体推理参数说明:download_dir为数据集下载目录,monomer.fasta为推理的单体序列;`--output_dir`为输出目录;`model_names`为推理的模型名称,`--model_preset=monomer`为单体模型配置;`--run_relax=true`为进行relax操作;`--use_gpu_relax=true`为使用gpu进行relax操作(速度更快,但可能不太稳定),`--use_gpu_relax=false`为使用CPU进行relax操作(速度慢,但稳定);若添加--use_precomputed_msas=true则可以加载已有的MSAs,否则默认运行MSA工具。
116

zhuwenwen's avatar
zhuwenwen committed
117
### 多体
118
```bash
mashun1's avatar
mashun1 committed
119
    bash run_multimer.sh
120
```
zhuwenwen's avatar
zhuwenwen committed
121
多体推理参数说明:multimer.fasta为推理的多体序列,`--model_preset=multimer`为多体模型配置;`--num_multimer_predictions_per_model`为每个模型预测数量,其他参数同单体推理参数说明一致。
122

zhuwenwen's avatar
zhuwenwen committed
123
124
## result
`--output_dir`目录结构如下:
Augustin-Zidek's avatar
Augustin-Zidek committed
125
```
126
<target_name>/
Augustin-Zidek's avatar
Augustin-Zidek committed
127
128
129
130
131
132
133
134
    features.pkl
    ranked_{0,1,2,3,4}.pdb
    ranking_debug.json
    relaxed_model_{1,2,3,4,5}.pdb
    result_model_{1,2,3,4,5}.pkl
    timings.json
    unrelaxed_model_{1,2,3,4,5}.pdb
    msas/
zhuwenwen's avatar
zhuwenwen committed
135
        bfd_uniclust_hits.a3m
Augustin-Zidek's avatar
Augustin-Zidek committed
136
137
        mgnify_hits.sto
        uniref90_hits.sto
zhuwenwen's avatar
zhuwenwen committed
138
        ...
Augustin-Zidek's avatar
Augustin-Zidek committed
139
140
```

zhuwenwen's avatar
zhuwenwen committed
141
[查看蛋白质3D结构](https://www.pdbus.org/3d-view)
mashun1's avatar
mashun1 committed
142
143
144
145
146
147

ID: 8U23

蓝色的为预测结构,黄色为真实结构

![alt text](image.png)
Augustin-Zidek's avatar
Augustin-Zidek committed
148

chenzk's avatar
chenzk committed
149
### 精度
zhuwenwen's avatar
zhuwenwen committed
150
测试数据:[casp15](https://www.predictioncenter.org/casp15/targetlist.cgi)[uniprot](https://www.uniprot.org/)
zhuwenwen's avatar
zhuwenwen committed
151
使用的加速卡:1张 Z100L-32G
Augustin-Zidek's avatar
Augustin-Zidek committed
152

zhuwenwen's avatar
zhuwenwen committed
153
154
155
156
157
158
1、plddts/iptm+ptm

单体见<target_name>/ranking_debug.json中的`plddts`,多体见<target_name>/ranking_debug.json中的`iptm+ptm`


2、其它精度值计算:[https://zhanggroup.org/TM-score/](https://zhanggroup.org/TM-score/)
159

zhuwenwen's avatar
zhuwenwen committed
160
准确性数据:
zhuwenwen's avatar
zhuwenwen committed
161
| 数据类型 | 序列类型 | 序列 | 长度 | GDT-TS | GDT-HA | plddts/iptm+ptm | TM score | MaxSub | RMSD |
zhuwenwen's avatar
zhuwenwen committed
162
163
164
165
| :------: | :------: | :------: |:------: |:------: | :------: | :------: | :------: |:------: |:------: |
| fp32 | 单体 | T1029 | 125 | 0.434 | 0.256 | 93.984 | 0.471 | 0.297 | 7.202 |
| fp32 | 单体 | T1024 | 408 | 0.664 | 0.470 | 87.076 | 0.829 | 0.518 | 3.516 |
| fp32 | 多体 | H1106 | 236 | 0.203 | 0.144 | 0.860 | 0.181 | 0.151 | 20.457 |
Augustin Zidek's avatar
Augustin Zidek committed
166

mashun1's avatar
mashun1 committed
167
168


zhuwenwen's avatar
zhuwenwen committed
169
## 应用场景
Augustin Zidek's avatar
Augustin Zidek committed
170

zhuwenwen's avatar
zhuwenwen committed
171
### 算法类别
chenzk's avatar
chenzk committed
172
蛋白质预测
Augustin Zidek's avatar
Augustin Zidek committed
173

zhuwenwen's avatar
zhuwenwen committed
174
175
### 热点应用行业
医疗,科研,教育
Augustin Zidek's avatar
Augustin Zidek committed
176

chenzk's avatar
chenzk committed
177
178
## 预训练权重

zhuwenwen's avatar
zhuwenwen committed
179
## 源码仓库及问题反馈
chenzk's avatar
chenzk committed
180
* [https://developer.sourcefind.cn/codes/modelzoo/alphafold2_jax](https://developer.sourcefind.cn/codes/modelzoo/alphafold2_jax)
DeepMind's avatar
DeepMind committed
181

zhuwenwen's avatar
zhuwenwen committed
182
## 参考
mashun1's avatar
mashun1 committed
183
* [https://github.com/deepmind/alphafold](https://github.com/deepmind/alphafold)
mashun1's avatar
mashun1 committed
184