README.md 6.37 KB
Newer Older
zhuwenwen's avatar
zhuwenwen committed
1
2
3
4
<!--
 * @Author: zhuww
 * @email: zhuww@sugon.com
 * @Date: 2023-04-06 18:04:07
zhuwenwen's avatar
zhuwenwen committed
5
 * @LastEditTime: 2023-08-24 09:34:01
zhuwenwen's avatar
zhuwenwen committed
6
-->
zhuwenwen's avatar
zhuwenwen committed
7
# AF2
zhuwenwen's avatar
zhuwenwen committed
8
9
10
## 论文
- [https://www.nature.com/articles/s41586-021-03819-2](https://www.nature.com/articles/s41586-021-03819-2)

zhuwenwen's avatar
zhuwenwen committed
11
12
13
## 模型结构
模型核心是一个基于Transformer架构的神经网络,包括两个主要组件:Sequence to Sequence Model和Structure Model,这两个组件通过迭代训练进行优化,以提高其预测准确性。

zhuwenwen's avatar
zhuwenwen committed
14
15
![img](./docs/alphafold2.png)

zhuwenwen's avatar
zhuwenwen committed
16
17
## 算法原理
AlphaFold2通过从蛋白质序列和结构数据中提取信息,使用神经网络模型来预测蛋白质三维结构。
zhuwenwen's avatar
zhuwenwen committed
18

zhuwenwen's avatar
zhuwenwen committed
19
20
![img](./docs/alphafold2_1.png)

zhuwenwen's avatar
zhuwenwen committed
21
## 环境配置
zhuwenwen's avatar
zhuwenwen committed
22
提供[光源](https://www.sourcefind.cn/#/service-details)拉取推理的docker镜像:
zhuwenwen's avatar
zhuwenwen committed
23
24
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:alphafold2-2.2.1-centos7.6-dtk-22.04.2-py38
zhuwenwen's avatar
zhuwenwen committed
25
26
27
28
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker run -it --name alphafold --shm-size=32G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v <Host Path>:<Container Path> <Image ID> /bin/bash
zhuwenwen's avatar
zhuwenwen committed
29
```
zhuwenwen's avatar
zhuwenwen committed
30

zhuwenwen's avatar
zhuwenwen committed
31
32
33
镜像版本依赖:
* DTK驱动:dtk22.04.2
* Jax: 0.3.14
zhuwenwen's avatar
zhuwenwen committed
34
* TensorFlow2: 2.10.0
zhuwenwen's avatar
zhuwenwen committed
35
36
* python: python3.8

zhuwenwen's avatar
zhuwenwen committed
37
激活镜像环境:
dcuai's avatar
dcuai committed
38

zhuwenwen's avatar
zhuwenwen committed
39
`source /opt/dtk-22.04.2/env.sh`
dcuai's avatar
dcuai committed
40

zhuwenwen's avatar
zhuwenwen committed
41
42
`source /opt/openmm-hip/env.sh`

zhuwenwen's avatar
zhuwenwen committed
43
44
测试目录:

zhuwenwen's avatar
zhuwenwen committed
45
`/opt/docker/tests/alphafold`
zhuwenwen's avatar
zhuwenwen committed
46

zhuwenwen's avatar
zhuwenwen committed
47
## 数据集
zhuwenwen's avatar
zhuwenwen committed
48
推荐使用AlphaFold2中的开源数据集,包括BFD、MGnify、PDB70、Uniclust、Uniref90等,数据集大小约2.2TB。数据集格式如下:
zhuwenwen's avatar
zhuwenwen committed
49
```
zhuwenwen's avatar
zhuwenwen committed
50
51
52
53
54
55
56
$DOWNLOAD_DIR/                             
    bfd/  
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata 
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex                           
        ...
    mgnify/                                
zhuwenwen's avatar
zhuwenwen committed
57
        mgy_clusters_2018_12.fa
zhuwenwen's avatar
zhuwenwen committed
58
59
60
61
62
63
64
65
66
67
68
    params/                                
        params_model_1.npz
        params_model_2.npz
        params_model_3.npz
        ...
    pdb70/                                
        pdb_filter.dat
        pdb70_hhm.ffindex
        pdb70_hhm.ffdata
        ...
    pdb_mmcif/                            
zhuwenwen's avatar
zhuwenwen committed
69
        mmcif_files/
zhuwenwen's avatar
zhuwenwen committed
70
71
72
73
            100d.cif
            101d.cif
            101m.cif
            ...
zhuwenwen's avatar
zhuwenwen committed
74
        obsolete.dat
zhuwenwen's avatar
zhuwenwen committed
75
    pdb_seqres/                            
zhuwenwen's avatar
zhuwenwen committed
76
        pdb_seqres.txt
zhuwenwen's avatar
zhuwenwen committed
77
    small_bfd/                           
zhuwenwen's avatar
zhuwenwen committed
78
        bfd-first_non_consensus_sequences.fasta
zhuwenwen's avatar
zhuwenwen committed
79
    uniclust30/                            
zhuwenwen's avatar
zhuwenwen committed
80
        uniclust30_2018_08/
zhuwenwen's avatar
zhuwenwen committed
81
82
83
84
85
            uniclust30_2018_08_md5sum
            uniclust30_2018_08_hhm_db.index
            uniclust30_2018_08_hhm_db
            ...
    uniprot/                               
zhuwenwen's avatar
zhuwenwen committed
86
        uniprot.fasta
zhuwenwen's avatar
zhuwenwen committed
87
    uniref90/                             
zhuwenwen's avatar
zhuwenwen committed
88
89
        uniref90.fasta
```
zhuwenwen's avatar
zhuwenwen committed
90
91

此处提供了一个脚本download_all_data.sh用于下载使用的数据集和模型文件:
zhuwenwen's avatar
zhuwenwen committed
92

zhuwenwen's avatar
zhuwenwen committed
93
94
95
96
    ./scripts/download_all_data.sh 数据集下载目录

## 推理
分别提供了基于Jax的单体和多体的推理脚本.
zhuwenwen's avatar
zhuwenwen committed
97
设置`run_alphafold.py`中DOWNLOAD_DIR路径和output_dir路径。确保输出目录存在,并且您有足够的权限对其进行写入。
dcuai's avatar
dcuai committed
98

zhuwenwen's avatar
zhuwenwen committed
99
100
    # Set to target of download all databases
    DOWNLOAD_DIR = '/path/to/database'
zhuwenwen's avatar
zhuwenwen committed
101
    
zhuwenwen's avatar
zhuwenwen committed
102
103
104
    # Path to a directory that will store the results.
    output_dir = '/path/to/output_dir'

zhuwenwen's avatar
zhuwenwen committed
105
### 单体
zhuwenwen's avatar
zhuwenwen committed
106
107
108
109
110
111
112
113
114
115
116

    python3 run_alphafold.py \
    --fasta_paths=monomer.fasta \
    --output_dir=./ \
    --max_template_date=2020-05-14 \
    --model_preset=monomer \
    --run_relax=true \
    --use_gpu_relax=true

或者使用`./run_monomer.sh`

zhuwenwen's avatar
zhuwenwen committed
117
#### 单体推理参数说明
zhuwenwen's avatar
zhuwenwen committed
118
monomer.fasta为推理的单体序列;`--output_dir`为输出目录;`--model_preset`选择模型配置;`--run_relax=true`为进行relax操作;`--use_gpu_relax=true`为使用gpu进行relax操作(速度更快,但可能不太稳定),`--use_gpu_relax=false`为使用CPU进行relax操作(速度慢,但稳定);若添加--use_precomputed_msas=true则可以加载已经搜索对齐的序列,否则默认进行搜索对齐;
zhuwenwen's avatar
zhuwenwen committed
119

zhuwenwen's avatar
zhuwenwen committed
120
### 多体
zhuwenwen's avatar
zhuwenwen committed
121
122
123
124
125
126
127
128
129
130
131
132
133
134

    python3 run_alphafold.py \
    --fasta_paths=multimer.fasta \
    --output_dir=./ \
    --uniprot_database_path=/data/uniprot/uniprot_trembl.fasta \
    --pdb_seqres_database_path=/data/pdb_seqres/pdb_seqres.txt \
    --pdb70_database_path= \
    --max_template_date=2020-05-14 \
    --model_preset=multimer \
    --run_relax=true \
    --use_gpu_relax=true

或者使用`./run_multimer.sh`

zhuwenwen's avatar
zhuwenwen committed
135
#### 多体推理参数说明
zhuwenwen's avatar
zhuwenwen committed
136
137
multimer.fasta为推理的多体序列,data为数据集下载路径,其他参数同单体推理参数说明一致。

zhuwenwen's avatar
zhuwenwen committed
138
## result
zhuwenwen's avatar
zhuwenwen committed
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
`--output_dir`目录结构如下:
```
<target_name>/
    features.pkl
    ranked_{0,1,2,3,4}.pdb
    ranking_debug.json
    relaxed_model_{1,2,3,4,5}.pdb
    result_model_{1,2,3,4,5}.pkl
    timings.json
    unrelaxed_model_{1,2,3,4,5}.pdb
    msas/
        bfd_uniclust_hits.a3m
        mgnify_hits.sto
        uniref90_hits.sto
        ...
```

zhuwenwen's avatar
zhuwenwen committed
156
查看蛋白质3D结构:[https://www.pdbus.org/3d-view](https://www.pdbus.org/3d-view)
zhuwenwen's avatar
zhuwenwen committed
157
158
![img](./docs/result_pdb.png)

zhuwenwen's avatar
zhuwenwen committed
159
## 精度
zhuwenwen's avatar
zhuwenwen committed
160
161
162
测试数据:[casp14](https://www.predictioncenter.org/casp14/targetlist.cgi)[uniprot](https://www.uniprot.org/)
使用的加速卡:1张 DCU 1代-16G

zhuwenwen's avatar
zhuwenwen committed
163
164
1、计算lddt的值

zhuwenwen's avatar
zhuwenwen committed
165
    python3 pkl2plddt.py
zhuwenwen's avatar
zhuwenwen committed
166
167
168
169
170
    其中,data_path为推理生成的pkl文件路径。


2、其它精度值计算:[https://zhanggroup.org/TM-score/](https://zhanggroup.org/TM-score/)

zhuwenwen's avatar
zhuwenwen committed
171
172
173
174
175
176
177
准确性数据:
| 数据类型 | 序列类型 | 序列标签 | 序列长度 | GDT-TS | GDT-HA | LDDT | TM score | MaxSub | RMSD |
| :------: | :------: | :------: | :------: |:------: |:------: | :------: | :------: | :------: |:------: |
| fp32 | 单体 | T1026 | 172 | 0.849 | 0.658 | 75.050 | 0.901 | 0.851 | 1.6 |
| fp32 | 单体 | T1053 | 580 | 0.941 | 0.789 | 92.316 | 0.985 | 0.935 | 1.1 |
| fp32 | 单体 | T1091 | 863 | 0.492 | 0.332 | 85.083 | 0.740 | 0.388 | 6.7 |

zhuwenwen's avatar
zhuwenwen committed
178
179
180
181
182
## 应用场景

### 算法类别
NLP

zhuwenwen's avatar
zhuwenwen committed
183
184
185
### 热点应用行业
医疗,科研,教育

zhuwenwen's avatar
zhuwenwen committed
186
## 源码仓库及问题反馈
zhuwenwen's avatar
zhuwenwen committed
187
* [https://developer.hpccube.com/codes/modelzoo/alphafold2_jax](https://developer.hpccube.com/codes/modelzoo/alphafold2_jax)
zhuwenwen's avatar
zhuwenwen committed
188

zhuwenwen's avatar
zhuwenwen committed
189
190
191
## 参考
* [https://github.com/deepmind/alphafold](https://github.com/deepmind/alphafold)