README.md 4.12 KB
Newer Older
zhuwenwen's avatar
zhuwenwen committed
1
2
3
4
<!--
 * @Author: zhuww
 * @email: zhuww@sugon.com
 * @Date: 2023-04-15 14:34:07
zhuwenwen's avatar
zhuwenwen committed
5
 * @LastEditTime: 2023-10-09 17:11:01
zhuwenwen's avatar
zhuwenwen committed
6
-->
dcuai's avatar
dcuai committed
7
# ProteinMPNN
zhuwenwen's avatar
zhuwenwen committed
8
9
10
## 论文
- [https://www.biorxiv.org/content/10.1101/2022.06.03.494563v1](https://www.biorxiv.org/content/10.1101/2022.06.03.494563v1)

zhuwenwen's avatar
zhuwenwen committed
11
12
13
## 模型结构
模型包括3个Encoder,3个Decoder和128个隐藏层的MPNN,使用蛋白质骨干网络特征(Cα-Cα原子之间的距离,相对Cα-Cα-Cα框架的方向和旋转,以及主干二面角)作为输入,以自回归方式从N到C端预测蛋白质序列。

zhuwenwen's avatar
zhuwenwen committed
14
15
![img](./docs/proteinmpnn.png)

zhuwenwen's avatar
zhuwenwen committed
16
17
18
## 算法原理
ProteinMPNN是一种使用MPNN进行蛋白质预测的模型,该模型输入蛋白质序列和结构信息,输出蛋白质的三维结构

zhuwenwen's avatar
zhuwenwen committed
19
20
![img](./docs/proteinmpnn_1.png)

zhuwenwen's avatar
zhuwenwen committed
21
## 环境配置
zhuwenwen's avatar
zhuwenwen committed
22
提供[光源](https://www.sourcefind.cn/#/service-details)拉取推理的docker镜像:
zhuwenwen's avatar
zhuwenwen committed
23
```
yuhai's avatar
yuhai committed
24
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:proteinmpnn-dtk-24.04.2-patch4-py310
zhuwenwen's avatar
zhuwenwen committed
25
26
27
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
yuhai's avatar
yuhai committed
28
docker run -it --name proteinmpnn --shm-size=32G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v <Host Path>:<Container Path> <Image ID> /bin/bash
zhuwenwen's avatar
zhuwenwen committed
29
30
31
```

镜像版本依赖:
yuhai's avatar
yuhai committed
32
33
34
35
36
* DTK驱动:dtk24.04.2
* Pytorch: 2.3.0
* Torchvision >= 0.18.1
* Torchaudio >= 2.1.2
* python: python3.10
zhuwenwen's avatar
zhuwenwen committed
37
38

激活镜像环境:
yuhai's avatar
yuhai committed
39
`source /opt/dtk-24.04.2/env.sh`
zhuwenwen's avatar
zhuwenwen committed
40
41
42
43

测试目录:
`/opt/ProteinMPNN-main`

zhuwenwen's avatar
zhuwenwen committed
44
45
## 数据集
模型数据集[PDB biunits 2021/08/02](https://files.ipd.uw.edu/pub/training_sets/pdb_2021aug02.tar.gz),数据集大小为16.5GB。
dongchy920's avatar
dongchy920 committed
46
用于测试此数据集的小样本[PDB biunits sample 2021/08/02](https://files.ipd.uw.edu/pub/training_sets/pdb_2021aug02_sample.tar.gz),数据集大小为47MB。  
zhuwenwen's avatar
zhuwenwen committed
47
48
49
50
51
52
53
54
55
56
57
58
59
```
pdb_2021aug02_sample/                             
    pdb/
        l3/  
            1l30_A.pt
            1l30.pt
            1l3n_B.pt
            ...
    valid_clusters.txt
    test_clusters.txt
    README
    list.csv
```
zhuwenwen's avatar
zhuwenwen committed
60
61
62

## 训练
### 单机单卡
zhuwenwen's avatar
zhuwenwen committed
63

zhuwenwen's avatar
zhuwenwen committed
64
    pip install python-dateutil
zhuwenwen's avatar
zhuwenwen committed
65
    cd /opt/ProteinMPNN-main/training
yuhai's avatar
yuhai committed
66

zhuwenwen's avatar
zhuwenwen committed
67
68
69
    python ./training.py \
           --path_for_outputs "模型保存路径" \
           --path_for_training_data "数据集下载并解压路径/pdb_2021aug02" \
yuhai's avatar
yuhai committed
70
           --num_epochs 指定训练多少个epoch \
zhuwenwen's avatar
zhuwenwen committed
71
72
           --save_model_every_n_epochs 每几个epochs保存模型权重

yuhai's avatar
yuhai committed
73
    #使用示例
yuhai's avatar
yuhai committed
74
75
76
77
78
    #python ./training.py \
    #       --path_for_outputs "./exp_020" \
    #       --path_for_training_data "/opt/ProteinMPNN-main/pdb_2021aug02_sample" \
    #       --num_epochs 100 \
    #       --save_model_every_n_epochs 2
yuhai's avatar
yuhai committed
79

zhuwenwen's avatar
zhuwenwen committed
80
81
82
## 推理
分别提供了基于Pytorch的单体和多体的推理脚本。
### 单体
zhuwenwen's avatar
zhuwenwen committed
83

zhuwenwen's avatar
zhuwenwen committed
84
    cd  /opt/ProteinMPNN-main/examples
zhuwenwen's avatar
zhuwenwen committed
85
86
    ./submit_example_1.sh

zhuwenwen's avatar
zhuwenwen committed
87
### 多体
zhuwenwen's avatar
zhuwenwen committed
88

zhuwenwen's avatar
zhuwenwen committed
89
    cd  /opt/ProteinMPNN-main/examples
zhuwenwen's avatar
zhuwenwen committed
90
91
    ./submit_example_2.sh

zhuwenwen's avatar
zhuwenwen committed
92
## result
zhuwenwen's avatar
zhuwenwen committed
93
```
zhuwenwen's avatar
zhuwenwen committed
94

zhuwenwen's avatar
zhuwenwen committed
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
training/
    exp_020/
        model_weights/
            epoch_last.pt
    log.txt
outputs/
    example_1_outputs/
        seqs/
            5L33.fa
            6MRR.fa
        parsed_pdbs.jsonl
    example_2_outputs/
        seqs/
            3HTN.fa
            4YOW.fa
        assigned_pdbs.jsonl
        parsed_pdbs.jsonl

zhuwenwen's avatar
zhuwenwen committed
113
```
zhuwenwen's avatar
zhuwenwen committed
114

dcuai's avatar
dcuai committed
115
### 精度
yuhai's avatar
yuhai committed
116
测试数据:`/opt/ProteinMPNN-main/inputs`,使用的加速卡:1张 DCU K100_AI-64G
zhuwenwen's avatar
zhuwenwen committed
117
118

准确率数据:
zhuwenwen's avatar
zhuwenwen committed
119
| batch size | 数据类型 | 序列类型 | 序列标签 | 序列长度 | Sequence recovery(%) |
zhuwenwen's avatar
zhuwenwen committed
120
| :------: | :------: | :------: |:------: |:------: |:------: |
yuhai's avatar
yuhai committed
121
122
123
124
| 1 | fp32 | 单体 | 5L33 | 106 | 46.23 |
| 1 | fp32 | 单体 | 6MRR | 68  | 57.35 |
| 1 | fp32 | 多体 | 3HTN | 429 | 61.35 |
| 1 | fp32 | 多体 | 4YOW | 693 | 64.32 |
zhuwenwen's avatar
zhuwenwen committed
125

zhuwenwen's avatar
zhuwenwen committed
126
127
128
## 应用场景

### 算法类别
zhuwenwen's avatar
zhuwenwen committed
129
蛋白质结构预测
zhuwenwen's avatar
zhuwenwen committed
130
131

### 热点应用行业
zhuwenwen's avatar
zhuwenwen committed
132
医疗,科研,教育
zhuwenwen's avatar
zhuwenwen committed
133

zhuwenwen's avatar
zhuwenwen committed
134

zhuwenwen's avatar
zhuwenwen committed
135
## 源码仓库及问题反馈
chenzk's avatar
chenzk committed
136
* [https://developer.sourcefind.cn/codes/modelzoo/proteinmpnn_pytorch](https://developer.sourcefind.cn/codes/modelzoo/proteinmpnn_pytorch)
zhuwenwen's avatar
zhuwenwen committed
137
## 参考资料
zhuwenwen's avatar
zhuwenwen committed
138
139
* [https://github.com/dauparas/ProteinMPNN](https://github.com/dauparas/ProteinMPNN)