--- title: Iterative_masking keywords: fastai sidebar: home_sidebar summary: "Use MSA Transformer to generate synthetic protein sequences by masking iteratively the same MSA." description: "Use MSA Transformer to generate synthetic protein sequences by masking iteratively the same MSA." nb_path: "00_core.ipynb" ---
class IM_MSA_Transformer[source]
IM_MSA_Transformer(iterations=None,p_mask=None,filename=None,num=None,filepath=None)
Class that implement the Iterative masking algorithm
IM_MSA_Transformer.Batch_MSA[source]
IM_MSA_Transformer.Batch_MSA(use_pdf=False,simplified=False,repetitions=2,sample_all=False,T=1,phylo=False)
Generate a full MSA by calling with different input MSAs the iterative MSA generator defined
in: self.NEW_MSA.
---> Use this function with simplified=False only if you need tokens in cuda ! (i.e. if you want to compute embed
or contacs), otherwise use simplified=True
The variable self.iterations must be a numpy array which specifies when (at which iterations)
the tokens must be saved. The last element of the array gives the maximum number of iterations that should be done.
repetitions: the number of times self.NEW_MSA() is repeated with a different input MSA.
use_pdf: if it's True the function sample the token from the logits pdf
instead of getting the argmax (greedy sampling).
sample_all: if True all the new tokens are obtained from the logits (both
the masked and the non masked), if False the non masked tokens
are left untouched and only the masked ones are changed.
T: Temperature of sampling from the pdf of output logits.
phylo: if True the start sequences are sampled from phylogeny weights instead of randomly.
IM_MSA_Transformer.Context_MSA[source]
IM_MSA_Transformer.Context_MSA(depth=None,ancestor=None,context=None,use_pdf=False,simplified=False,sample_all=False,print_all=True,T=1)
Generates a new MSA with context-generation by iterating the masking on the original ancestor sequence
using: self.generate_MSA_context. It masks ancestor (original sequence) and uses the sequences in context as context MSA.
---> Use this function with simplified=False only if you need tokens in cuda ! (i.e. if you want to compute embed
or contacs), otherwise use simplified=True
The variable self.iterations must be a numpy array which specifies when (at which iterations)
the tokens must be saved. The last element of the array gives the maximum number of iterations that should be done.
If print_all=True then it saves the generated sequences at each iteration.
ancestor: input sequence to be masked iteratively.
context: context MSA (not masked).
use_pdf: if it's True the function sample the token from the logits pdf
instead of getting the argmax (greedy sampling).
sample_all: if True all the new tokens are obtained from the logits (both
the masked and the non masked), if False the non masked tokens
are left untouched and only the masked ones are changed.
T: Temperature of sampling from the pdf of output logits.
depth: number of generated sequences, if None the depth is the number of ancestor sequences.
gen_MSAs[source]
gen_MSAs(filepath:"Path of the input directory",filename:"Name of the input file(s)",new_dir:"Name of the output directory",T:"Which is the sampling Temperature from the pdf ? (only whensample_all:"Should I sample all tokens or just the masked ones ? (True = sample all tokens)",Iters:"Number of total iterations to generate the new tokens",pmask:"Masking probability",num:"Size of the batches MSAs which the MSA-Transformer receives as input",depth:"Number of batches (of size num) that you want to generate",generate:"How should I generate sequences ? False (=Batch generation) or Linear with context (=linear-ran/linear-tot-ran),-ranmeans that the context MSA is sampled randomly (once) while-tot-ranmeans that it is sampled randomly each time.",print_all:"Should I print the MSA after each iteration ? (bool)",range_vals:"First and last index of the sequences that you want to use as ancestors",phylo_w:"Should I sample the starting sequences from the phylogeny weights ? (bool)")
Generate a new MSA either with Batch generation of Context generation. It shuffles the initial MSA and uses different slices as batch MSAs