README.md 5.79 KB
Newer Older
zhuwenwen's avatar
zhuwenwen committed
1
2
3
4
<!--
 * @Author: zhuww
 * @email: zhuww@sugon.com
 * @Date: 2023-03-31 17:09:07
zhuwenwen's avatar
zhuwenwen committed
5
 * @LastEditTime: 2023-08-24 09:07:01
zhuwenwen's avatar
zhuwenwen committed
6
7
-->
# FastFold
zhuwenwen's avatar
zhuwenwen committed
8
9
10
## 论文
- [https://arxiv.org/abs/2203.00854](https://arxiv.org/abs/2203.00854)

zhuwenwen's avatar
zhuwenwen committed
11
12
13
## 模型结构
模型基于Transformer架构,主要结构包括Evofomer(48 blocks)和Struture module(8 blocks)两个模块。

zhuwenwen's avatar
zhuwenwen committed
14
15
## 算法原理
FastFold通过搜索同源序列和模板进行特征构造,基于蛋白质结构预测模型,进行推理的性能优化,预测蛋白质的结构。
zhuwenwen's avatar
zhuwenwen committed
16

zhuwenwen's avatar
zhuwenwen committed
17
## 环境配置
zhuwenwen's avatar
zhuwenwen committed
18
提供[光源](https://www.sourcefind.cn/#/service-details)拉取推理的docker镜像:
zhuwenwen's avatar
zhuwenwen committed
19
20
21
22
23
24
25
26
27
28
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:fastfold-0.2.1-centos7.6-dtk-22.10-patch4-py38-latest
docker run -it --name fastfold --shm-size=32G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video image.sourcefind.cn:5000/dcu/admin/base/custom:fastfold-0.2.1-centos7.6-dtk-22.10-patch4-py38-latest /bin/bash
```

镜像版本依赖:
* DTK驱动:dtk22.10
* Pytorch: 1.10
* fastfold: 0.2.1
* python: python3.8
zhuwenwen's avatar
zhuwenwen committed
29
30
31
32
33
34
35
36

激活镜像环境:
`source /opt/dtk-22.10/env.sh`
`source /opt/openmm-dtk-22.10/env.sh`

测试目录:
`/opt/docker/test`

zhuwenwen's avatar
zhuwenwen committed
37
38
39
40
41
42
43
44
45
46
## 数据集
推荐使用AlphaFold2中的开源数据集,包括BFD、MGnify、PDB70、Uniclust、Uniref90等,数据集大小约3TB。

我们提供了一个脚本download_all_data.sh用于下载使用的数据集和模型文件:

    ./scripts/download_all_data.sh 数据集下载目录

## 推理
我们分别提供了基于Pytorch的单体和多体的推理脚本。
### 单体
zhuwenwen's avatar
zhuwenwen committed
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

    python inference.py T1024.fasta data/pdb_mmcif/mmcif_files/ \
    --output_dir ./ \
    --gpus 2 \
    --use_precomputed_alignments alignments/ \
    --param_path /data/params/params_model_1.npz  \
    --uniref90_database_path data/uniref90/uniref90.fasta \
    --mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \
    --pdb70_database_path data/pdb70/pdb70 \
    --uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --jackhmmer_binary_path `which jackhmmer` \
    --hhblits_binary_path `which hhblits` \
    --hhsearch_binary_path `which hhsearch` \
    --kalign_binary_path `which kalign` \
    --chunk_size 4 \
    --inplace

或者使用`./inference.sh`

zhuwenwen's avatar
zhuwenwen committed
67
#### 单体推理参数说明
zhuwenwen's avatar
zhuwenwen committed
68
T1024.fasta为推理的单体序列;data修改为数据集下载目录;
zhuwenwen's avatar
zhuwenwen committed
69
70
71
`--output_dir`为输出目录;`--gpus`为使用的gpu数量;`--use_precomputed_alignments`为搜索对齐目录,可以加载已经搜索对齐的序列,若不添加则进行搜索对齐;
`--param_path`为加载单体模型路径,需要和`--model_name`保持一致,默认为model_1;`--chunk_size`为分块数量,设置为4,并且使用`--inplace`来降低显存占用;
默认进行relax操作,若不需要,添加`--relaxation`;默认不保存输出的.pkl文件,若需要,添加`--save_outputs`.
zhuwenwen's avatar
zhuwenwen committed
72
73


zhuwenwen's avatar
zhuwenwen committed
74
Alphafold的数据预处理需要花费大量时间,因此我们通过[ray](https://docs.ray.io/en/latest/workflows/concepts.html)加快了数据预处理工作流程。
zhuwenwen's avatar
zhuwenwen committed
75
76
要使用ray工作流运行推理,应将参数--enable_workflow添加到cmdline或`./inference.sh`脚本中。

zhuwenwen's avatar
zhuwenwen committed
77
### 多体
zhuwenwen's avatar
zhuwenwen committed
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
    python inference.py SUGP1.fasta data/pdb_mmcif/mmcif_files/ \
    --output_dir ./ \
    --gpus 2 \
    --use_precomputed_alignments alignments/ \
    --model_preset multimer \
    --uniref90_database_path data/uniref90/uniref90.fasta \
    --mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \
    --pdb70_database_path data/pdb70/pdb70 \
    --uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --uniprot_database_path data/uniprot/uniprot_sprot.fasta \
    --pdb_seqres_database_path data/pdb_seqres/pdb_seqres.txt  \
    --param_path data/params/params_model_1_multimer.npz \
    --model_name model_1_multimer \
    --jackhmmer_binary_path `which jackhmmer` \
    --hhblits_binary_path `which hhblits` \
    --hhsearch_binary_path `which hhsearch` \
    --kalign_binary_path `which kalign` \
    --chunk_size 4 \
    --inplace 

或者使用`./inference_multimer.sh`

zhuwenwen's avatar
zhuwenwen committed
101
#### 多体推理参数说明
zhuwenwen's avatar
zhuwenwen committed
102
SUGP1.fasta为推理的多体序列;`--param_path`为加载多体模型路径,需要和`--model_name`保持一致,其他参数同单体推理参数说明一致.
zhuwenwen's avatar
zhuwenwen committed
103

zhuwenwen's avatar
zhuwenwen committed
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
## result
`--output_dir`目录结构如下:
```
alignments/
    <target_name>/
        bfd_uniclust_hits.a3m
        mgnify_hits.sto
        uniref90_hits.sto
        ...
{target_name}_{model_name}_output_dict.pkl
{target_name}_{model_name}_unrelaxed.pdb
{target_name}_{model_name}_relaxed.pdb
```

## 精度
zhuwenwen's avatar
zhuwenwen committed
119
120
121
122
123
124
125
126
127
测试数据:[casp14](https://www.predictioncenter.org/casp14/targetlist.cgi)[uniprot](https://www.uniprot.org/),使用的加速卡:4张 DCU 1代-16G

准确性数据:
| 数据类型 | 序列类型 | 序列标签 | 序列长度 | GDT-TS | GDT-HA | LDDT | TM score | MaxSub | RMSD |
| :------: | :------: | :------: | :------: |:------: |:------: | :------: | :------: | :------: |:------: |
| fp32 | 单体 | T1026  | 172  | 0.914 | 0.765 | 79.634 | 0.941 | 0.907 | 1.289 |
| fp32 | 单体 | T1053  | 580  | 0.937 | 0.782 | 92.284 | 0.984 | 0.929 | 1.105 |
| fp32 | 单体 | Q9NYK1 | 1046 | 0.907 | 0.744 | 86.642 | 0.962 | 0.905 | 5.757 |

zhuwenwen's avatar
zhuwenwen committed
128
129
130
131
132
133
134
135
136
137
138
## 应用场景

### 算法类别
NLP

### 应用行业
医疗,科研

### 算法框架
pytorch

zhuwenwen's avatar
zhuwenwen committed
139
## 源码仓库及问题反馈
zhuwenwen's avatar
zhuwenwen committed
140
* https://developer.hpccube.com/codes/modelzoo/FastFold
zhuwenwen's avatar
zhuwenwen committed
141

zhuwenwen's avatar
zhuwenwen committed
142
143
144
## 参考
* [https://github.com/deepmind/alphafold](https://github.com/deepmind/alphafold)
* [https://github.com/hpcaitech/FastFold](https://github.com/hpcaitech/FastFold)