"vscode:/vscode.git/clone" did not exist on "230106304db3c3b3857490832a74bfeb9458ed0f"
README.md 3.45 KB
Newer Older
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
1
# Uni-Fold
2

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
3
## 论文
Rayyyyy's avatar
Rayyyyy committed
4
5
`Uni-Fold: An Open-Source Platform for Developing Protein Folding Models beyond AlphaFold`
- https://www.biorxiv.org/content/biorxiv/early/2022/08/06/2022.08.04.502811.full.pdf
zhangqha's avatar
zhangqha committed
6
7

## 模型结构
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
8
模型核心是一个基于Transformer架构的神经网络,包括两个主要组件:Sequence to Sequence Model和Structure Model,这两个组件通过迭代训练进行优化,以提高其预测准确性。
zhangqha's avatar
zhangqha committed
9

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
10
![img](./alphafold2.png)
zhangqha's avatar
zhangqha committed
11

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
12
13
## 算法原理
通过从蛋白质序列和结构数据中提取信息,使用神经网络模型来预测蛋白质三维结构。
zhangqha's avatar
zhangqha committed
14

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
15
![img](./alphafold2_1.png)
zhangqha's avatar
zhangqha committed
16

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
17
## 环境配置
zhangqha's avatar
zhangqha committed
18
提供[光源](https://www.sourcefind.cn/#/service-details)拉取的训练的docker镜像:
19

zhangqha's avatar
zhangqha committed
20
```
zhangqha's avatar
zhangqha committed
21
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:unifold-latest
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
22
docker run -it -v /path/your_code_data/:/path/your_code_data/ --shm-size=32G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name docker_name imageID bash
23

zhangqha's avatar
zhangqha committed
24
cd /root/Uni-Fold-main
zhangqha's avatar
zhangqha committed
25
```
Rayyyyy's avatar
Rayyyyy committed
26

zhangqha's avatar
zhangqha committed
27
安装requirement.txt中的工具,镜像中已经安装好,加载方式
Rayyyyy's avatar
Rayyyyy committed
28

zhangqha's avatar
zhangqha committed
29
```
zhangqha's avatar
zhangqha committed
30
export PATH=/root/software/hmmer/bin${PATH:+:${PATH}}
31

zhangqha's avatar
zhangqha committed
32
export PATH=/root/software/hh-suite-master/bin${PATH:+:${PATH}}
33

zhangqha's avatar
zhangqha committed
34
export PATH=/root/software/kalign/bin${PATH:+:${PATH}}
35

zhangqha's avatar
zhangqha committed
36
export LD_LIBRARY_PATH=/root/software/hh-suite-master/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
zhangqha's avatar
zhangqha committed
37
```
Rayyyyy's avatar
Rayyyyy committed
38

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
39
## 数据集
Rayyyyy's avatar
Rayyyyy committed
40
推荐使用[AlphaFold2](http://113.200.138.88:18080/aidatasets/project-dependency/alphafold)中的开源数据集,包括BFD、MGnify、PDB70、Uniclust、Uniref90等,数据集大小约2.62TB。数据集格式如下:
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
41
```
Rayyyyy's avatar
Rayyyyy committed
42
43
$DOWNLOAD_DIR/
    bfd/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
44
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
Rayyyyy's avatar
Rayyyyy committed
45
46
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
47
        ...
Rayyyyy's avatar
Rayyyyy committed
48
    mgnify/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
49
        mgy_clusters_2022_05.fa
Rayyyyy's avatar
Rayyyyy committed
50
    params/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
51
52
53
54
        params_model_1.npz
        params_model_2.npz
        params_model_3.npz
        ...
Rayyyyy's avatar
Rayyyyy committed
55
    pdb70/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
56
57
58
59
        pdb_filter.dat
        pdb70_hhm.ffindex
        pdb70_hhm.ffdata
        ...
Rayyyyy's avatar
Rayyyyy committed
60
    pdb_mmcif/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
61
62
63
64
65
66
        mmcif_files/
            100d.cif
            101d.cif
            101m.cif
            ...
        obsolete.dat
Rayyyyy's avatar
Rayyyyy committed
67
    pdb_seqres/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
68
        pdb_seqres.txt
Rayyyyy's avatar
Rayyyyy committed
69
    small_bfd/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
70
        bfd-first_non_consensus_sequences.fasta
Rayyyyy's avatar
Rayyyyy committed
71
    uniref30/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
72
73
74
75
        UniRef30_2021_03_hhm.ffindex
        UniRef30_2021_03_hhm.ffdata
        UniRef30_2021_03_cs219.ffindex
        ...
Rayyyyy's avatar
Rayyyyy committed
76
    uniprot/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
77
        uniprot.fasta
Rayyyyy's avatar
Rayyyyy committed
78
    uniref90/
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
79
80
        uniref90.fasta
```
Rayyyyy's avatar
Rayyyyy committed
81

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
82
此处提供了一个脚本download_all_data.sh用于下载使用的数据集和模型文件:
Rayyyyy's avatar
Rayyyyy committed
83

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
84
85
86
87
88
```
bash scripts/download/download_all_data.sh /path/to/database/directory
```

## 推理
zhangqha's avatar
zhangqha committed
89
### 安装
90
#### 安装Uni-Core-main(如使用镜像,则无需再次安装)
zhangqha's avatar
zhangqha committed
91
```
zhangqha's avatar
zhangqha committed
92
cd Uni-Core-main
93

zhangqha's avatar
zhangqha committed
94
export CUDA_HOME=/opt/dtk-22.04.2
95

zhangqha's avatar
zhangqha committed
96
python3 setup.py install
zhangqha's avatar
zhangqha committed
97
```
Rayyyyy's avatar
Rayyyyy committed
98

99
#### 安装Uni-Fold-main(如使用镜像,则无需再次安装)
zhangqha's avatar
zhangqha committed
100
```
zhangqha's avatar
zhangqha committed
101
pip install -e .
zhangqha's avatar
zhangqha committed
102
```
Rayyyyy's avatar
Rayyyyy committed
103

dcuai's avatar
dcuai committed
104
### 多卡测试
zhangqha's avatar
zhangqha committed
105
#### 多聚体参考脚本,需要根据实际情况修改路径配置
zhangqha's avatar
zhangqha committed
106
```
Rayyyyy's avatar
Rayyyyy committed
107
sh run_multimer.sh
zhangqha's avatar
zhangqha committed
108
```
Rayyyyy's avatar
Rayyyyy committed
109

zhangqha's avatar
zhangqha committed
110
#### 单聚体参考脚本,需要根据实际情况修改路径配置
zhangqha's avatar
zhangqha committed
111
```
zhangqha's avatar
zhangqha committed
112
sh run_monomer.sh
zhangqha's avatar
zhangqha committed
113
```
Rayyyyy's avatar
Rayyyyy committed
114

zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
115
## result
zhangqha's avatar
zhangqha committed
116
![img](./result_pdb.png)
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
117

dcuai's avatar
dcuai committed
118
### 精度
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
119
120


Rayyyyy's avatar
Rayyyyy committed
121
## 应用场景
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
122
### 算法类别
dcuai's avatar
dcuai committed
123
蛋白质结构预测
zhangqha@sugon.com's avatar
zhangqha@sugon.com committed
124
125
126
127

### 热点应用行业
医疗,科研,教育

Rayyyyy's avatar
Rayyyyy committed
128
129
## 源码仓库及问题反馈
- https://developer.hpccube.com/codes/modelzoo/uni-fold
zhangqha's avatar
zhangqha committed
130

dcuai's avatar
dcuai committed
131
## 参考资料
Rayyyyy's avatar
Rayyyyy committed
132
- https://github.com/dptech-corp/Uni-Fold