README.md 5.2 KB
Newer Older
zhuwenwen's avatar
zhuwenwen committed
1
2
3
4
<!--
 * @Author: zhuww
 * @email: zhuww@sugon.com
 * @Date: 2023-04-06 18:04:07
5
 * @LastEditTime: 2023-11-23 16:01:01
zhuwenwen's avatar
zhuwenwen committed
6
7
8
9
-->
# AF2
## 论文
- [https://www.nature.com/articles/s41586-021-03819-2](https://www.nature.com/articles/s41586-021-03819-2)
Augustin-Zidek's avatar
Augustin-Zidek committed
10

zhuwenwen's avatar
zhuwenwen committed
11
12
## 模型结构
模型核心是一个基于Transformer架构的神经网络,包括两个主要组件:Sequence to Sequence Model和Structure Model,这两个组件通过迭代训练进行优化,以提高其预测准确性。
Augustin-Zidek's avatar
Augustin-Zidek committed
13

zhuwenwen's avatar
zhuwenwen committed
14
![img](./docs/alphafold2.png)
Augustin-Zidek's avatar
Augustin-Zidek committed
15

zhuwenwen's avatar
zhuwenwen committed
16
17
## 算法原理
AlphaFold2通过从蛋白质序列和结构数据中提取信息,使用神经网络模型来预测蛋白质三维结构。
Augustin Zidek's avatar
Augustin Zidek committed
18

zhuwenwen's avatar
zhuwenwen committed
19
![img](./docs/alphafold2_1.png)
Augustin Zidek's avatar
Augustin Zidek committed
20

zhuwenwen's avatar
zhuwenwen committed
21
22
23
24
25
26
27
28
29
## 环境配置
提供[光源](https://www.sourcefind.cn/#/service-details)拉取推理的docker镜像:
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:alphafold2-2.3.2-dtk-23.10-py38
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker run -it --name alphafold --shm-size=32G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v <Host Path>:<Container Path> <Image ID> /bin/bash
```
30

zhuwenwen's avatar
zhuwenwen committed
31
32
33
34
35
镜像版本依赖:
* DTK驱动:dtk23.10
* Jax: 0.3.25
* TensorFlow2: 2.11.0
* python: python3.8
Augustin-Zidek's avatar
Augustin-Zidek committed
36

zhuwenwen's avatar
zhuwenwen committed
37
激活镜像环境:
Augustin Zidek's avatar
Augustin Zidek committed
38

zhuwenwen's avatar
zhuwenwen committed
39
`source /opt/dtk-23.10/env.sh`
Augustin-Zidek's avatar
Augustin-Zidek committed
40

zhuwenwen's avatar
zhuwenwen committed
41
42
## 数据集
推荐使用AlphaFold2中的开源数据集,包括BFD、MGnify、PDB70、Uniclust、Uniref90等,数据集大小约2.62TB。数据集格式如下:
Augustin-Zidek's avatar
Augustin-Zidek committed
43
```
zhuwenwen's avatar
zhuwenwen committed
44
45
46
47
48
49
50
$DOWNLOAD_DIR/                             
    bfd/  
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata 
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex                           
        ...
    mgnify/                                
Augustin Zidek's avatar
Augustin Zidek committed
51
        mgy_clusters_2022_05.fa
zhuwenwen's avatar
zhuwenwen committed
52
53
54
55
56
57
58
59
60
61
62
    params/                                
        params_model_1.npz
        params_model_2.npz
        params_model_3.npz
        ...
    pdb70/                                
        pdb_filter.dat
        pdb70_hhm.ffindex
        pdb70_hhm.ffdata
        ...
    pdb_mmcif/                            
Augustin-Zidek's avatar
Augustin-Zidek committed
63
        mmcif_files/
zhuwenwen's avatar
zhuwenwen committed
64
65
66
67
            100d.cif
            101d.cif
            101m.cif
            ...
Augustin-Zidek's avatar
Augustin-Zidek committed
68
        obsolete.dat
zhuwenwen's avatar
zhuwenwen committed
69
    pdb_seqres/                            
70
        pdb_seqres.txt
zhuwenwen's avatar
zhuwenwen committed
71
    small_bfd/                           
72
        bfd-first_non_consensus_sequences.fasta
zhuwenwen's avatar
zhuwenwen committed
73
74
75
76
77
78
    uniref30/                            
        UniRef30_2021_03_hhm.ffindex
        UniRef30_2021_03_hhm.ffdata
        UniRef30_2021_03_cs219.ffindex
        ...
    uniprot/                               
79
        uniprot.fasta
zhuwenwen's avatar
zhuwenwen committed
80
    uniref90/                             
Augustin-Zidek's avatar
Augustin-Zidek committed
81
82
83
        uniref90.fasta
```

zhuwenwen's avatar
zhuwenwen committed
84
此处提供了一个脚本download_all_data.sh用于下载使用的数据集和模型文件:
Augustin-Zidek's avatar
Augustin-Zidek committed
85

zhuwenwen's avatar
zhuwenwen committed
86
    ./scripts/download_all_data.sh 数据集下载目录
87

zhuwenwen's avatar
zhuwenwen committed
88
89
## 推理
分别提供了基于Jax的单体和多体的推理脚本.
90
```bash
zhuwenwen's avatar
zhuwenwen committed
91
92
    git clone http://developer.hpccube.com/codes/modelzoo/alphafold2_jax.git  # 选择需要的分支下载
    cd alphafold2_jax
93
94
```

zhuwenwen's avatar
zhuwenwen committed
95
### 单体
96
```bash
zhuwenwen's avatar
zhuwenwen committed
97
    ./run_monomer.sh
98
```
99
单体推理参数说明:download_dir为数据集下载目录,monomer.fasta为推理的单体序列;`--output_dir`为输出目录;`model_names`为推理的模型名称,`--model_preset=monomer`为单体模型配置;`--run_relax=true`为进行relax操作;`--use_gpu_relax=true`为使用gpu进行relax操作(速度更快,但可能不太稳定),`--use_gpu_relax=false`为使用CPU进行relax操作(速度慢,但稳定)。
100

zhuwenwen's avatar
zhuwenwen committed
101
### 多体
102
```bash
zhuwenwen's avatar
zhuwenwen committed
103
    ./run_multimer.sh
104
```
zhuwenwen's avatar
zhuwenwen committed
105
多体推理参数说明:multimer.fasta为推理的多体序列,`--model_preset=multimer`为多体模型配置;`--num_multimer_predictions_per_model`为每个模型预测数量,其他参数同单体推理参数说明一致。
106

zhuwenwen's avatar
zhuwenwen committed
107
108
## result
`--output_dir`目录结构如下:
Augustin-Zidek's avatar
Augustin-Zidek committed
109
```
110
<target_name>/
Augustin-Zidek's avatar
Augustin-Zidek committed
111
112
113
114
115
116
117
118
    features.pkl
    ranked_{0,1,2,3,4}.pdb
    ranking_debug.json
    relaxed_model_{1,2,3,4,5}.pdb
    result_model_{1,2,3,4,5}.pkl
    timings.json
    unrelaxed_model_{1,2,3,4,5}.pdb
    msas/
zhuwenwen's avatar
zhuwenwen committed
119
        bfd_uniclust_hits.a3m
Augustin-Zidek's avatar
Augustin-Zidek committed
120
121
        mgnify_hits.sto
        uniref90_hits.sto
zhuwenwen's avatar
zhuwenwen committed
122
        ...
Augustin-Zidek's avatar
Augustin-Zidek committed
123
124
```

zhuwenwen's avatar
zhuwenwen committed
125
126
查看蛋白质3D结构:[https://www.pdbus.org/3d-view](https://www.pdbus.org/3d-view)
![img](./docs/result_pdb.png)
Augustin-Zidek's avatar
Augustin-Zidek committed
127

zhuwenwen's avatar
zhuwenwen committed
128
129
130
## 精度
测试数据:[casp14](https://www.predictioncenter.org/casp14/targetlist.cgi)[uniprot](https://www.uniprot.org/)
使用的加速卡:1张 Z100L-32G
Augustin-Zidek's avatar
Augustin-Zidek committed
131

132
plddts:见<target_name>/ranking_debug.json中的`plddts`
133

zhuwenwen's avatar
zhuwenwen committed
134
准确性数据:
135
136
137
138
139
| 数据类型 | 序列类型 | 序列标签 | 序列长度 | LDDT |
| :------: | :------: | :------: | :------: |:------: |
| fp32 | 单体 | T1026 | 172 | 75.050 |
| fp32 | 单体 | T1053 | 580 | 92.316 | 
| fp32 | 单体 | T1091 | 863 | 85.083 |
Augustin Zidek's avatar
Augustin Zidek committed
140

zhuwenwen's avatar
zhuwenwen committed
141
## 应用场景
Augustin Zidek's avatar
Augustin Zidek committed
142

zhuwenwen's avatar
zhuwenwen committed
143
144
### 算法类别
NLP
Augustin Zidek's avatar
Augustin Zidek committed
145

zhuwenwen's avatar
zhuwenwen committed
146
147
### 热点应用行业
医疗,科研,教育
Augustin Zidek's avatar
Augustin Zidek committed
148

zhuwenwen's avatar
zhuwenwen committed
149
150
## 源码仓库及问题反馈
* [https://developer.hpccube.com/codes/modelzoo/alphafold2_jax](https://developer.hpccube.com/codes/modelzoo/alphafold2_jax)
DeepMind's avatar
DeepMind committed
151

zhuwenwen's avatar
zhuwenwen committed
152
153
## 参考
* [https://github.com/deepmind/alphafold](https://github.com/deepmind/alphafold)