README.md 7.68 KB
Newer Older
zhuwenwen's avatar
zhuwenwen committed
1
# AF2
mashun1's avatar
mashun1 committed
2

zhuwenwen's avatar
zhuwenwen committed
3
4
## 论文
- [https://www.nature.com/articles/s41586-021-03819-2](https://www.nature.com/articles/s41586-021-03819-2)
Augustin-Zidek's avatar
Augustin-Zidek committed
5

zhuwenwen's avatar
zhuwenwen committed
6
7
## 模型结构
模型核心是一个基于Transformer架构的神经网络,包括两个主要组件:Sequence to Sequence Model和Structure Model,这两个组件通过迭代训练进行优化,以提高其预测准确性。
Augustin-Zidek's avatar
Augustin-Zidek committed
8

zhuwenwen's avatar
zhuwenwen committed
9
![img](./docs/alphafold2.png)
Augustin-Zidek's avatar
Augustin-Zidek committed
10

zhuwenwen's avatar
zhuwenwen committed
11
12
## 算法原理
AlphaFold2通过从蛋白质序列和结构数据中提取信息,使用神经网络模型来预测蛋白质三维结构。
Augustin Zidek's avatar
Augustin Zidek committed
13

zhuwenwen's avatar
zhuwenwen committed
14
![img](./docs/alphafold2_1.png)
Augustin Zidek's avatar
Augustin Zidek committed
15

mashun1's avatar
mashun1 committed
16

mashun1's avatar
mashun1 committed
17
18
19
20
21
## 环境配置

### Docker(方法一)

    # 使用该方法不需要下载本仓库,镜像中已包含可运行代码,但需要挂载相应的数据文件
mashun1's avatar
mashun1 committed
22

mashun1's avatar
mashun1 committed
23
24
25
26
    docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:alphafold2-dtk24.04.1-py310

    docker run --shm-size 100g --network=host --name=alphafold2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 本地数据地址:镜像数据地址 -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash

mashun1's avatar
mashun1 committed
27
    # 该镜像在intel-cpu上编译,若在其他cpu上使用,需要重新编译hh-suite
mashun1's avatar
mashun1 committed
28
    cd /app/softwares/hh-suite && rm -rf build
mashun1's avatar
mashun1 committed
29
30
31
32
33
    mkdir build && cd build 
    cmake -DHAVE_AVX2=1 -DCMAKE_INSTALL_PREFIX=. ..
    make -j 4 && sudo make install
    export PATH="$(pwd)/bin:$(pwd)/scripts:$PATH"

mashun1's avatar
mashun1 committed
34
### Docker(方法二)
mashun1's avatar
mashun1 committed
35
36
37
38
39
40
41
42
43
44
45
46
47
48
    
    docker pull image.sourcefind.cn:5000/dcu/admin/base/jax:0.4.23-ubuntu20.04-dtk24.04.1-py3.10

    docker run --shm-size 50g --network=host --name=alphafold2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash

    # 1. 一般依赖项安装
    pip install -r requirements_dcu.txt

    pip install dm-haiku==0.0.11 flax==0.7.1 jmp==0.0.2 tabulate==0.8.9 --no-deps jax

    pip install orbax==0.1.6 orbax-checkpoint==0.1.6 optax==0.2.2

    python setup.py install

mashun1's avatar
mashun1 committed
49
50
    sudo apt install hmmer -y

mashun1's avatar
mashun1 committed
51
52
53
54
55
56
57
58
59
    # 2、hh-suite 

    git clone https://github.com/soedinglab/hh-suite.git
    mkdir -p hh-suite/build && cd hh-suite/build
    cmake -DCMAKE_INSTALL_PREFIX=. ..
    make -j 4 && make install
    export PATH="$(pwd)/bin:$(pwd)/scripts:$PATH"

    wget https://github.com/TimoLassmann/kalign/archive/refs/tags/v3.4.0.zip
mashun1's avatar
mashun1 committed
60
    unzip v3.4.0.zip && cd kalign-3.4.0
mashun1's avatar
mashun1 committed
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
    mkdir build 
    cd build
    cmake .. 
    make 
    make test 
    make install

    # 3. openmm + pdbfixer

    sudo apt install doxygen

    wget https://github.com/openmm/openmm/archive/refs/tags/8.0.0.zip

    unzip 8.0.0.zip && cd openmm-8.0.0 && mkdir build && cd build

    cmake .. && make && sudo make install && sudo make PythonInstall

    wget https://github.com/openmm/pdbfixer/archive/refs/tags/1.9.zip

mashun1's avatar
mashun1 committed
80
    unzip 1.9.zip && cd pdbfixer-1.9 && python setup.py install 
mashun1's avatar
mashun1 committed
81

Augustin-Zidek's avatar
Augustin-Zidek committed
82

zhuwenwen's avatar
zhuwenwen committed
83
84
## 数据集
推荐使用AlphaFold2中的开源数据集,包括BFD、MGnify、PDB70、Uniclust、Uniref90等,数据集大小约2.62TB。数据集格式如下:
Augustin-Zidek's avatar
Augustin-Zidek committed
85
```
zhuwenwen's avatar
zhuwenwen committed
86
87
88
89
90
91
92
$DOWNLOAD_DIR/                             
    bfd/  
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata 
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex                           
        ...
    mgnify/                                
Augustin Zidek's avatar
Augustin Zidek committed
93
        mgy_clusters_2022_05.fa
zhuwenwen's avatar
zhuwenwen committed
94
95
96
97
98
99
100
101
102
103
104
    params/                                
        params_model_1.npz
        params_model_2.npz
        params_model_3.npz
        ...
    pdb70/                                
        pdb_filter.dat
        pdb70_hhm.ffindex
        pdb70_hhm.ffdata
        ...
    pdb_mmcif/                            
Augustin-Zidek's avatar
Augustin-Zidek committed
105
        mmcif_files/
zhuwenwen's avatar
zhuwenwen committed
106
107
108
109
            100d.cif
            101d.cif
            101m.cif
            ...
Augustin-Zidek's avatar
Augustin-Zidek committed
110
        obsolete.dat
zhuwenwen's avatar
zhuwenwen committed
111
    pdb_seqres/                            
112
        pdb_seqres.txt
zhuwenwen's avatar
zhuwenwen committed
113
    small_bfd/                           
114
        bfd-first_non_consensus_sequences.fasta
zhuwenwen's avatar
zhuwenwen committed
115
116
117
118
119
120
    uniref30/                            
        UniRef30_2021_03_hhm.ffindex
        UniRef30_2021_03_hhm.ffdata
        UniRef30_2021_03_cs219.ffindex
        ...
    uniprot/                               
121
        uniprot.fasta
zhuwenwen's avatar
zhuwenwen committed
122
    uniref90/                             
Augustin-Zidek's avatar
Augustin-Zidek committed
123
124
125
        uniref90.fasta
```

zhuwenwen's avatar
zhuwenwen committed
126
此处提供了一个脚本download_all_data.sh用于下载使用的数据集和模型文件:
Augustin-Zidek's avatar
Augustin-Zidek committed
127

zhuwenwen's avatar
zhuwenwen committed
128
    ./scripts/download_all_data.sh 数据集下载目录
129

chenzk's avatar
chenzk committed
130
131
132
数据集快速下载中心:[SCNet AIDatasets](http://113.200.138.88:18080/aidatasets) ,项目中数据集可从快速下载通道下载:[alphafold](http://113.200.138.88:18080/aidatasets/project-dependency/alphafold)


zhuwenwen's avatar
zhuwenwen committed
133
## 推理
mashun1's avatar
mashun1 committed
134
135
136

注意:在运行前请修改相应脚本中的参数。

zhuwenwen's avatar
zhuwenwen committed
137
分别提供了基于Jax的单体和多体的推理脚本.
138
```bash
zhuwenwen's avatar
zhuwenwen committed
139
    # 进入工程目录
mashun1's avatar
mashun1 committed
140
    cd /app/alphafold2
141
142
```

zhuwenwen's avatar
zhuwenwen committed
143
### 单体
144
```bash
mashun1's avatar
mashun1 committed
145
    bash run_monomer.sh
146
```
zhuwenwen's avatar
zhuwenwen committed
147
单体推理参数说明:download_dir为数据集下载目录,monomer.fasta为推理的单体序列;`--output_dir`为输出目录;`model_names`为推理的模型名称,`--model_preset=monomer`为单体模型配置;`--run_relax=true`为进行relax操作;`--use_gpu_relax=true`为使用gpu进行relax操作(速度更快,但可能不太稳定),`--use_gpu_relax=false`为使用CPU进行relax操作(速度慢,但稳定);若添加--use_precomputed_msas=true则可以加载已有的MSAs,否则默认运行MSA工具。
148

zhuwenwen's avatar
zhuwenwen committed
149
### 多体
150
```bash
mashun1's avatar
mashun1 committed
151
    bash run_multimer.sh
152
```
zhuwenwen's avatar
zhuwenwen committed
153
多体推理参数说明:multimer.fasta为推理的多体序列,`--model_preset=multimer`为多体模型配置;`--num_multimer_predictions_per_model`为每个模型预测数量,其他参数同单体推理参数说明一致。
154

zhuwenwen's avatar
zhuwenwen committed
155
156
## result
`--output_dir`目录结构如下:
Augustin-Zidek's avatar
Augustin-Zidek committed
157
```
158
<target_name>/
Augustin-Zidek's avatar
Augustin-Zidek committed
159
160
161
162
163
164
165
166
    features.pkl
    ranked_{0,1,2,3,4}.pdb
    ranking_debug.json
    relaxed_model_{1,2,3,4,5}.pdb
    result_model_{1,2,3,4,5}.pkl
    timings.json
    unrelaxed_model_{1,2,3,4,5}.pdb
    msas/
zhuwenwen's avatar
zhuwenwen committed
167
        bfd_uniclust_hits.a3m
Augustin-Zidek's avatar
Augustin-Zidek committed
168
169
        mgnify_hits.sto
        uniref90_hits.sto
zhuwenwen's avatar
zhuwenwen committed
170
        ...
Augustin-Zidek's avatar
Augustin-Zidek committed
171
172
```

zhuwenwen's avatar
zhuwenwen committed
173
[查看蛋白质3D结构](https://www.pdbus.org/3d-view)
mashun1's avatar
mashun1 committed
174
175
176
177
178
179

ID: 8U23

蓝色的为预测结构,黄色为真实结构

![alt text](image.png)
Augustin-Zidek's avatar
Augustin-Zidek committed
180

chenzk's avatar
chenzk committed
181
### 精度
zhuwenwen's avatar
zhuwenwen committed
182
测试数据:[casp15](https://www.predictioncenter.org/casp15/targetlist.cgi)[uniprot](https://www.uniprot.org/)
zhuwenwen's avatar
zhuwenwen committed
183
使用的加速卡:1张 Z100L-32G
Augustin-Zidek's avatar
Augustin-Zidek committed
184

zhuwenwen's avatar
zhuwenwen committed
185
186
187
188
189
190
1、plddts/iptm+ptm

单体见<target_name>/ranking_debug.json中的`plddts`,多体见<target_name>/ranking_debug.json中的`iptm+ptm`


2、其它精度值计算:[https://zhanggroup.org/TM-score/](https://zhanggroup.org/TM-score/)
191

zhuwenwen's avatar
zhuwenwen committed
192
准确性数据:
zhuwenwen's avatar
zhuwenwen committed
193
| 数据类型 | 序列类型 | 序列 | 长度 | GDT-TS | GDT-HA | plddts/iptm+ptm | TM score | MaxSub | RMSD |
zhuwenwen's avatar
zhuwenwen committed
194
195
196
197
| :------: | :------: | :------: |:------: |:------: | :------: | :------: | :------: |:------: |:------: |
| fp32 | 单体 | T1029 | 125 | 0.434 | 0.256 | 93.984 | 0.471 | 0.297 | 7.202 |
| fp32 | 单体 | T1024 | 408 | 0.664 | 0.470 | 87.076 | 0.829 | 0.518 | 3.516 |
| fp32 | 多体 | H1106 | 236 | 0.203 | 0.144 | 0.860 | 0.181 | 0.151 | 20.457 |
Augustin Zidek's avatar
Augustin Zidek committed
198

mashun1's avatar
mashun1 committed
199
200


zhuwenwen's avatar
zhuwenwen committed
201
## 应用场景
Augustin Zidek's avatar
Augustin Zidek committed
202

zhuwenwen's avatar
zhuwenwen committed
203
### 算法类别
chenzk's avatar
chenzk committed
204
蛋白质预测
Augustin Zidek's avatar
Augustin Zidek committed
205

zhuwenwen's avatar
zhuwenwen committed
206
207
### 热点应用行业
医疗,科研,教育
Augustin Zidek's avatar
Augustin Zidek committed
208

chenzk's avatar
chenzk committed
209
210
211
## 预训练权重
预训练权重快速下载中心:[SCNet AIModels](http://113.200.138.88:18080/aimodels) ,项目中的预训练权重可从快速下载通道下载:[alphafold](http://113.200.138.88:18080/aimodels/findsource-dependency/alphafold-params)

zhuwenwen's avatar
zhuwenwen committed
212
213
## 源码仓库及问题反馈
* [https://developer.hpccube.com/codes/modelzoo/alphafold2_jax](https://developer.hpccube.com/codes/modelzoo/alphafold2_jax)
DeepMind's avatar
DeepMind committed
214

zhuwenwen's avatar
zhuwenwen committed
215
## 参考
mashun1's avatar
mashun1 committed
216
* [https://github.com/deepmind/alphafold](https://github.com/deepmind/alphafold)
mashun1's avatar
mashun1 committed
217