core.html 8.82 KB
Newer Older
yuhai's avatar
yuhai committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---

title: Iterative_masking


keywords: fastai
sidebar: home_sidebar

summary: "Use MSA Transformer to generate synthetic protein sequences by masking iteratively the same MSA."
description: "Use MSA Transformer to generate synthetic protein sequences by masking iteratively the same MSA."
nb_path: "00_core.ipynb"
---
<!--

#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: 00_core.ipynb
# command to build the docs after a change: nbdev_build_docs

-->

<div class="container" id="notebook-container">
        
    {% raw %}
    
<div class="cell border-box-sizing code_cell rendered">

</div>
    {% endraw %}

    {% raw %}
    
<div class="cell border-box-sizing code_cell rendered">

</div>
    {% endraw %}

    {% raw %}
    
<div class="cell border-box-sizing code_cell rendered">

<div class="output_wrapper">
<div class="output">

<div class="output_area">


<div class="output_markdown rendered_html output_subarea ">
<h2 id="IM_MSA_Transformer" class="doc_header"><code>class</code> <code>IM_MSA_Transformer</code><a href="" class="source_link" style="float:right">[source]</a></h2><blockquote><p><code>IM_MSA_Transformer</code>(<strong><code>iterations</code></strong>=<em><code>None</code></em>, <strong><code>p_mask</code></strong>=<em><code>None</code></em>, <strong><code>filename</code></strong>=<em><code>None</code></em>, <strong><code>num</code></strong>=<em><code>None</code></em>, <strong><code>filepath</code></strong>=<em><code>None</code></em>)</p>
</blockquote>
<p>Class that implement the Iterative masking algorithm</p>

</div>

</div>

</div>
</div>

</div>
    {% endraw %}

    {% raw %}
    
<div class="cell border-box-sizing code_cell rendered">

<div class="output_wrapper">
<div class="output">

<div class="output_area">


<div class="output_markdown rendered_html output_subarea ">
<h4 id="IM_MSA_Transformer.Batch_MSA" class="doc_header"><code>IM_MSA_Transformer.Batch_MSA</code><a href="__main__.py#L303" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>IM_MSA_Transformer.Batch_MSA</code>(<strong><code>use_pdf</code></strong>=<em><code>False</code></em>, <strong><code>simplified</code></strong>=<em><code>False</code></em>, <strong><code>repetitions</code></strong>=<em><code>2</code></em>, <strong><code>sample_all</code></strong>=<em><code>False</code></em>, <strong><code>T</code></strong>=<em><code>1</code></em>, <strong><code>phylo</code></strong>=<em><code>False</code></em>)</p>
</blockquote>
<p>Generate a full MSA by calling with different input MSAs the iterative MSA generator defined
in: <code>self.NEW_MSA</code>.</p>
<p>---&gt; Use this function with <code>simplified</code>=False only if you need tokens in cuda ! (i.e. if you want to compute embed
     or contacs), otherwise use <code>simplified</code>=True</p>
<p>The variable <code>self.iterations</code> must be a numpy array which specifies when (at which iterations)
the tokens must be saved. The last element of the array gives the maximum number of iterations that should be done.</p>
<p><code>repetitions</code>:      the number of times self.NEW_MSA() is repeated with a different input MSA.</p>
<p><code>use_pdf</code>:    if it's True the function sample the token from the logits pdf 
            instead of getting the argmax (greedy sampling).</p>
<p><code>sample_all</code>: if True all the new tokens are obtained from the logits (both
            the masked and the non masked), if False the non masked tokens
            are left untouched and only the masked ones are changed.</p>
<p><code>T</code>:          Temperature of sampling from the pdf of output logits.</p>
<p><code>phylo</code>:            if True the start sequences are sampled from phylogeny weights instead of randomly.</p>

</div>

</div>

</div>
</div>

</div>
    {% endraw %}

    {% raw %}
    
<div class="cell border-box-sizing code_cell rendered">

<div class="output_wrapper">
<div class="output">

<div class="output_area">


<div class="output_markdown rendered_html output_subarea ">
<h4 id="IM_MSA_Transformer.Context_MSA" class="doc_header"><code>IM_MSA_Transformer.Context_MSA</code><a href="__main__.py#L448" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>IM_MSA_Transformer.Context_MSA</code>(<strong><code>depth</code></strong>=<em><code>None</code></em>, <strong><code>ancestor</code></strong>=<em><code>None</code></em>, <strong><code>context</code></strong>=<em><code>None</code></em>, <strong><code>use_pdf</code></strong>=<em><code>False</code></em>, <strong><code>simplified</code></strong>=<em><code>False</code></em>, <strong><code>sample_all</code></strong>=<em><code>False</code></em>, <strong><code>print_all</code></strong>=<em><code>True</code></em>, <strong><code>T</code></strong>=<em><code>1</code></em>)</p>
</blockquote>
<p>Generates a new MSA with context-generation by iterating the masking on the original ancestor sequence
using: <code>self.generate_MSA_context</code>. It masks <code>ancestor</code> (original sequence) and uses the sequences in <code>context</code> as context MSA.</p>
<p>---&gt; Use this function with <code>simplified</code>=False only if you need tokens in cuda ! (i.e. if you want to compute embed
     or contacs), otherwise use <code>simplified</code>=True</p>
<p>The variable <code>self.iterations</code> must be a numpy array which specifies when (at which iterations)
the tokens must be saved. The last element of the array gives the maximum number of iterations that should be done.
If <code>print_all</code>=True then it saves the generated sequences at each iteration.</p>
<p><code>ancestor</code>:     input sequence to be masked iteratively.</p>
<p><code>context</code>:      context MSA (not masked).</p>
<p><code>use_pdf</code>:      if it's True the function sample the token from the logits pdf 
                instead of getting the argmax (greedy sampling).</p>
<p><code>sample_all</code>:   if True all the new tokens are obtained from the logits (both
                the masked and the non masked), if False the non masked tokens
                are left untouched and only the masked ones are changed.</p>
<p><code>T</code>:            Temperature of sampling from the pdf of output logits.</p>
<p><code>depth</code>:        number of generated sequences, if None the depth is the number of ancestor sequences.</p>

</div>

</div>

</div>
</div>

</div>
    {% endraw %}

    {% raw %}
    
<div class="cell border-box-sizing code_cell rendered">

</div>
    {% endraw %}

    {% raw %}
    
<div class="cell border-box-sizing code_cell rendered">

<div class="output_wrapper">
<div class="output">

<div class="output_area">


<div class="output_markdown rendered_html output_subarea ">
<h4 id="gen_MSAs" class="doc_header"><code>gen_MSAs</code><a href="__main__.py#L6" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>gen_MSAs</code>(<strong><code>filepath</code></strong>:"Path of the input directory", <strong><code>filename</code></strong>:"Name of the input file(s)", <strong><code>new_dir</code></strong>:"Name of the output directory", <strong><code>pdf</code></strong>:"Should I sample tokens from the pdf ? (bool)", <strong><code>T</code></strong>:"Which is the sampling Temperature from the pdf ? (only when <code>pdf</code> is True)", <strong><code>sample_all</code></strong>:"Should I sample all tokens or just the masked ones ? (True = sample all tokens)", <strong><code>Iters</code></strong>:"Number of total iterations to generate the new tokens", <strong><code>pmask</code></strong>:"Masking probability", <strong><code>num</code></strong>:"Size of the batches MSAs which the MSA-Transformer receives as input", <strong><code>depth</code></strong>:"Number of batches (of size num) that you want to generate", <strong><code>generate</code></strong>:"How should I generate sequences ? False (=Batch generation) or Linear with context (=linear-ran/linear-tot-ran), <code>-ran</code> means that the context MSA is sampled randomly (once) while <code>-tot-ran</code> means that it is sampled randomly each time.", <strong><code>print_all</code></strong>:"Should I print the MSA after each iteration ? (bool)", <strong><code>range_vals</code></strong>:"First and last index of the sequences that you want to use as ancestors", <strong><code>phylo_w</code></strong>:"Should I sample the starting sequences from the phylogeny weights ? (bool)")</p>
</blockquote>
<p>Generate a new MSA either with Batch generation of Context generation. It shuffles the initial MSA and uses different slices as batch MSAs</p>

</div>

</div>

</div>
</div>

</div>
    {% endraw %}

<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Build-library">Build library<a class="anchor-link" href="#Build-library"> </a></h2>
</div>
</div>
</div>
</div>