README.md 7.82 KB
Newer Older
zhuwenwen's avatar
zhuwenwen committed
1
2
3
4
<!--
 * @Author: zhuww
 * @email: zhuww@sugon.com
 * @Date: 2023-04-06 18:04:07
zhuwenwen's avatar
zhuwenwen committed
5
 * @LastEditTime: 2023-12-26 15:54:01
zhuwenwen's avatar
zhuwenwen committed
6
7
8
9
-->
# AF2
## 论文
- [https://www.nature.com/articles/s41586-021-03819-2](https://www.nature.com/articles/s41586-021-03819-2)
Augustin-Zidek's avatar
Augustin-Zidek committed
10

zhuwenwen's avatar
zhuwenwen committed
11
12
## 模型结构
模型核心是一个基于Transformer架构的神经网络,包括两个主要组件:Sequence to Sequence Model和Structure Model,这两个组件通过迭代训练进行优化,以提高其预测准确性。
Augustin-Zidek's avatar
Augustin-Zidek committed
13

zhuwenwen's avatar
zhuwenwen committed
14
![img](./docs/alphafold2.png)
Augustin-Zidek's avatar
Augustin-Zidek committed
15

zhuwenwen's avatar
zhuwenwen committed
16
17
## 算法原理
AlphaFold2通过从蛋白质序列和结构数据中提取信息,使用神经网络模型来预测蛋白质三维结构。
Augustin Zidek's avatar
Augustin Zidek committed
18

zhuwenwen's avatar
zhuwenwen committed
19
![img](./docs/alphafold2_1.png)
Augustin Zidek's avatar
Augustin Zidek committed
20

mashun1's avatar
mashun1 committed
21

zhuwenwen's avatar
zhuwenwen committed
22
## 环境配置
mashun1's avatar
mashun1 committed
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70

### Docker
    
    docker pull image.sourcefind.cn:5000/dcu/admin/base/jax:0.4.23-ubuntu20.04-dtk24.04.1-py3.10

    docker run --shm-size 50g --network=host --name=alphafold2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash

    # 1. 一般依赖项安装
    pip install -r requirements_dcu.txt

    pip install dm-haiku==0.0.11 flax==0.7.1 jmp==0.0.2 tabulate==0.8.9 --no-deps jax

    pip install orbax==0.1.6 orbax-checkpoint==0.1.6 optax==0.2.2

    python setup.py install

    # 2、hh-suite 

    git clone https://github.com/soedinglab/hh-suite.git
    mkdir -p hh-suite/build && cd hh-suite/build
    cmake -DCMAKE_INSTALL_PREFIX=. ..
    make -j 4 && make install
    export PATH="$(pwd)/bin:$(pwd)/scripts:$PATH"

    wget https://github.com/TimoLassmann/kalign/archive/refs/tags/v3.4.0.zip
    unzip 3.4.0.zip && cd kalign-3.4.0
    mkdir build 
    cd build
    cmake .. 
    make 
    make test 
    make install

    # 3. openmm + pdbfixer

    sudo apt install doxygen

    wget https://github.com/openmm/openmm/archive/refs/tags/8.0.0.zip

    unzip 8.0.0.zip && cd openmm-8.0.0 && mkdir build && cd build

    cmake .. && make && sudo make install && sudo make PythonInstall

    wget https://github.com/openmm/pdbfixer/archive/refs/tags/1.9.zip

    unzip 1.9.zip && cd pdbfixer-1.9 && python setup.py install 

<!-- ## 环境配置
zhuwenwen's avatar
zhuwenwen committed
71
提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像:
zhuwenwen's avatar
zhuwenwen committed
72
```
zhuwenwen's avatar
zhuwenwen committed
73
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:alphafold2-2.3.2-dtk23.10-py38
zhuwenwen's avatar
zhuwenwen committed
74
75
76
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
zhuwenwen's avatar
zhuwenwen committed
77
docker run -it --name alphafold --privileged --shm-size=32G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v <Host Path>:<Container Path> <Image ID> /bin/bash
zhuwenwen's avatar
zhuwenwen committed
78
```
79

zhuwenwen's avatar
zhuwenwen committed
80
81
82
83
镜像版本依赖:
* DTK驱动:dtk23.10
* Jax: 0.3.25
* TensorFlow2: 2.11.0
mashun1's avatar
mashun1 committed
84
* python: python3.8 -->
Augustin-Zidek's avatar
Augustin-Zidek committed
85

zhuwenwen's avatar
zhuwenwen committed
86
87
## 数据集
推荐使用AlphaFold2中的开源数据集,包括BFD、MGnify、PDB70、Uniclust、Uniref90等,数据集大小约2.62TB。数据集格式如下:
Augustin-Zidek's avatar
Augustin-Zidek committed
88
```
zhuwenwen's avatar
zhuwenwen committed
89
90
91
92
93
94
95
$DOWNLOAD_DIR/                             
    bfd/  
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata 
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex                           
        ...
    mgnify/                                
Augustin Zidek's avatar
Augustin Zidek committed
96
        mgy_clusters_2022_05.fa
zhuwenwen's avatar
zhuwenwen committed
97
98
99
100
101
102
103
104
105
106
107
    params/                                
        params_model_1.npz
        params_model_2.npz
        params_model_3.npz
        ...
    pdb70/                                
        pdb_filter.dat
        pdb70_hhm.ffindex
        pdb70_hhm.ffdata
        ...
    pdb_mmcif/                            
Augustin-Zidek's avatar
Augustin-Zidek committed
108
        mmcif_files/
zhuwenwen's avatar
zhuwenwen committed
109
110
111
112
            100d.cif
            101d.cif
            101m.cif
            ...
Augustin-Zidek's avatar
Augustin-Zidek committed
113
        obsolete.dat
zhuwenwen's avatar
zhuwenwen committed
114
    pdb_seqres/                            
115
        pdb_seqres.txt
zhuwenwen's avatar
zhuwenwen committed
116
    small_bfd/                           
117
        bfd-first_non_consensus_sequences.fasta
zhuwenwen's avatar
zhuwenwen committed
118
119
120
121
122
123
    uniref30/                            
        UniRef30_2021_03_hhm.ffindex
        UniRef30_2021_03_hhm.ffdata
        UniRef30_2021_03_cs219.ffindex
        ...
    uniprot/                               
124
        uniprot.fasta
zhuwenwen's avatar
zhuwenwen committed
125
    uniref90/                             
Augustin-Zidek's avatar
Augustin-Zidek committed
126
127
128
        uniref90.fasta
```

zhuwenwen's avatar
zhuwenwen committed
129
此处提供了一个脚本download_all_data.sh用于下载使用的数据集和模型文件:
Augustin-Zidek's avatar
Augustin-Zidek committed
130

zhuwenwen's avatar
zhuwenwen committed
131
    ./scripts/download_all_data.sh 数据集下载目录
132

chenzk's avatar
chenzk committed
133
134
135
数据集快速下载中心:[SCNet AIDatasets](http://113.200.138.88:18080/aidatasets) ,项目中数据集可从快速下载通道下载:[alphafold](http://113.200.138.88:18080/aidatasets/project-dependency/alphafold)


zhuwenwen's avatar
zhuwenwen committed
136
137
## 推理
分别提供了基于Jax的单体和多体的推理脚本.
138
```bash
zhuwenwen's avatar
zhuwenwen committed
139
    # 进入工程目录
zhuwenwen's avatar
zhuwenwen committed
140
    cd alphafold2_jax
141
142
```

zhuwenwen's avatar
zhuwenwen committed
143
### 单体
144
```bash
zhuwenwen's avatar
zhuwenwen committed
145
    ./run_monomer.sh
146
```
zhuwenwen's avatar
zhuwenwen committed
147
单体推理参数说明:download_dir为数据集下载目录,monomer.fasta为推理的单体序列;`--output_dir`为输出目录;`model_names`为推理的模型名称,`--model_preset=monomer`为单体模型配置;`--run_relax=true`为进行relax操作;`--use_gpu_relax=true`为使用gpu进行relax操作(速度更快,但可能不太稳定),`--use_gpu_relax=false`为使用CPU进行relax操作(速度慢,但稳定);若添加--use_precomputed_msas=true则可以加载已有的MSAs,否则默认运行MSA工具。
148

zhuwenwen's avatar
zhuwenwen committed
149
### 多体
150
```bash
zhuwenwen's avatar
zhuwenwen committed
151
    ./run_multimer.sh
152
```
zhuwenwen's avatar
zhuwenwen committed
153
多体推理参数说明:multimer.fasta为推理的多体序列,`--model_preset=multimer`为多体模型配置;`--num_multimer_predictions_per_model`为每个模型预测数量,其他参数同单体推理参数说明一致。
154

zhuwenwen's avatar
zhuwenwen committed
155
156
## result
`--output_dir`目录结构如下:
Augustin-Zidek's avatar
Augustin-Zidek committed
157
```
158
<target_name>/
Augustin-Zidek's avatar
Augustin-Zidek committed
159
160
161
162
163
164
165
166
    features.pkl
    ranked_{0,1,2,3,4}.pdb
    ranking_debug.json
    relaxed_model_{1,2,3,4,5}.pdb
    result_model_{1,2,3,4,5}.pkl
    timings.json
    unrelaxed_model_{1,2,3,4,5}.pdb
    msas/
zhuwenwen's avatar
zhuwenwen committed
167
        bfd_uniclust_hits.a3m
Augustin-Zidek's avatar
Augustin-Zidek committed
168
169
        mgnify_hits.sto
        uniref90_hits.sto
zhuwenwen's avatar
zhuwenwen committed
170
        ...
Augustin-Zidek's avatar
Augustin-Zidek committed
171
172
```

zhuwenwen's avatar
zhuwenwen committed
173
[查看蛋白质3D结构](https://www.pdbus.org/3d-view)
zhuwenwen's avatar
zhuwenwen committed
174
<div style="display: flex; justify-content: center; align-items: center;">
zhuwenwen's avatar
zhuwenwen committed
175
176
  <img src="./docs/result_pdb.png" alt="Image">
  <div style="position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%); background: rgba(0, 0, 0, 0.5); color: #fff; padding: 10px;">
chenzk's avatar
chenzk committed
177
    红色为真实结构,蓝色为预测结构。
zhuwenwen's avatar
zhuwenwen committed
178
179
  </div>
</div>
Augustin-Zidek's avatar
Augustin-Zidek committed
180

chenzk's avatar
chenzk committed
181
### 精度
zhuwenwen's avatar
zhuwenwen committed
182
测试数据:[casp15](https://www.predictioncenter.org/casp15/targetlist.cgi)[uniprot](https://www.uniprot.org/)
zhuwenwen's avatar
zhuwenwen committed
183
使用的加速卡:1张 Z100L-32G
Augustin-Zidek's avatar
Augustin-Zidek committed
184

zhuwenwen's avatar
zhuwenwen committed
185
186
187
188
189
190
1、plddts/iptm+ptm

单体见<target_name>/ranking_debug.json中的`plddts`,多体见<target_name>/ranking_debug.json中的`iptm+ptm`


2、其它精度值计算:[https://zhanggroup.org/TM-score/](https://zhanggroup.org/TM-score/)
191

zhuwenwen's avatar
zhuwenwen committed
192
准确性数据:
zhuwenwen's avatar
zhuwenwen committed
193
| 数据类型 | 序列类型 | 序列 | 长度 | GDT-TS | GDT-HA | plddts/iptm+ptm | TM score | MaxSub | RMSD |
zhuwenwen's avatar
zhuwenwen committed
194
195
196
197
| :------: | :------: | :------: |:------: |:------: | :------: | :------: | :------: |:------: |:------: |
| fp32 | 单体 | T1029 | 125 | 0.434 | 0.256 | 93.984 | 0.471 | 0.297 | 7.202 |
| fp32 | 单体 | T1024 | 408 | 0.664 | 0.470 | 87.076 | 0.829 | 0.518 | 3.516 |
| fp32 | 多体 | H1106 | 236 | 0.203 | 0.144 | 0.860 | 0.181 | 0.151 | 20.457 |
Augustin Zidek's avatar
Augustin Zidek committed
198

zhuwenwen's avatar
zhuwenwen committed
199
## 应用场景
Augustin Zidek's avatar
Augustin Zidek committed
200

zhuwenwen's avatar
zhuwenwen committed
201
### 算法类别
chenzk's avatar
chenzk committed
202
蛋白质预测
Augustin Zidek's avatar
Augustin Zidek committed
203

zhuwenwen's avatar
zhuwenwen committed
204
205
### 热点应用行业
医疗,科研,教育
Augustin Zidek's avatar
Augustin Zidek committed
206

chenzk's avatar
chenzk committed
207
208
209
## 预训练权重
预训练权重快速下载中心:[SCNet AIModels](http://113.200.138.88:18080/aimodels) ,项目中的预训练权重可从快速下载通道下载:[alphafold](http://113.200.138.88:18080/aimodels/findsource-dependency/alphafold-params)

zhuwenwen's avatar
zhuwenwen committed
210
211
## 源码仓库及问题反馈
* [https://developer.hpccube.com/codes/modelzoo/alphafold2_jax](https://developer.hpccube.com/codes/modelzoo/alphafold2_jax)
DeepMind's avatar
DeepMind committed
212

zhuwenwen's avatar
zhuwenwen committed
213
## 参考
mashun1's avatar
mashun1 committed
214
* [https://github.com/deepmind/alphafold](https://github.com/deepmind/alphafold)