README: Small improvements, info regarding templates

9e32781f · Sachin Kadyan · 2d4fe4f4 · 9e32781f
Commit 9e32781f authored Oct 25, 2023 by Sachin Kadyan
Show whitespace changes
Inline Side-by-side

Showing with 4 additions and 2 deletions

README.md README.md +4 -2

No files found.
--- a/README.md
+++ b/README.md
@@ -235,13 +235,13 @@ at once. The `run_pretrained_openfold.py` script can enable this config option w
 #### SoloSeq Inference
 To run inference for a sequence using the SoloSeq single-sequence model, you can either precompute ESM-1b embeddings in bulk, or you can generate them during inference.

-For generating ESM-1b embeddings in bulk, use the provided script: `scripts/precompute_embeddings.py`. The script takes a directory of FASTA files and generates ESM-1b embeddings in the same format and directory structure as required by SoloSeq. Following is an example command to use the script:
+For generating ESM-1b embeddings in bulk, use the provided script: `scripts/precompute_embeddings.py`. The script takes a directory of FASTA files (one sequence per file) and generates ESM-1b embeddings in the same format and directory structure as required by SoloSeq. Following is an example command to use the script:

 ```bash
 python scripts/precompute_embeddings.py fasta_dir/ embeddings_output_dir/
 ```

-In the same per-label subdirectories inside `embeddings_output_dir`, you can also place `*.hhr` files (outputs from HHSearch), which can contain the details about the structures that you want to use as templates. If you do not place any such file, templates will not be used and only the ESM-1b embeddings will be used to predict the structure.
+In the same per-label subdirectories inside `embeddings_output_dir`, you can also place `*.hhr` files (outputs from HHSearch), which can contain the details about the structures that you want to use as templates. If you do not place any such file, templates will not be used and only the ESM-1b embeddings will be used to predict the structure. If you want to use templates, you need to pass the PDB MMCIF dataset to the command.

 Now, you are ready to run inference:
 ```bash
@@ -271,6 +271,8 @@ python3 run_pretrained_openfold.py \
    --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign \
 ```

+For generating template information, you will need the UniRef90 and PDB70 databases and the JackHmmer and HHSearch binaries. 
+
 SoloSeq allows you to use the same flags and optimizations as the MSA-based OpenFold. For example, you can skip relaxation using `--skip_relaxation`, save all model outputs using `--save_outputs`, and generate output files in MMCIF format using `--cif_output`.

 **NOTE:** Due to the nature of the ESM-1b embeddings, the sequence length for inference using the SoloSeq model is limited to 1022 residues. Sequences longer than that will be truncated.